Features with reduced unicode compatibility

Recently, I’ve received a number of requests internally within my organization to produce printed media in languages other than English. This has included letter detail in Chinese. To produce most of our letters, we pass UTF-8 XML documents to Planet Press. Whenever we produce letters which are created using the Connect Designer, they appear fine, however, whenever the letter is dynamically produced from XML content which includes mandarin characters, the content is mostly substituted with ??? characters. I’ve read a number of forum questions asking for assistance here, and it seems to indicate that mostly, Unicode is completely supported. There appear however a small number of components where Unicode support is limited.

May someone from the team please provide a list of where Unicode support is limited, so I and others may plan appropriate changes to the Workflow? Which plugins, which converters, which script plugins, which endpoints etc. of the product should we avoid? Which Unicode sets are fully supported and which are not?

In the OL Connect Designer/DataMapper, UTF-8 is fully supported. Obviously, your data must be truly UTF-8, which is not always the case, especially with XML. In XML, the lack of an encoding attribute in the <?XML ...> header means that the file should be interpreted as UTF-8. But many applications just don’t include the encoding attribute, even if the file contents are actually using a different encoding.

For instance, an XML file encoded using the generic Latin encoding used in western countries must be marked this way:

<?xml version="1.0" encoding="ISO-8859-1"?>

So if your XML file is properly encoded and marked as such (or unmarked if it truly is encoded in UTF-8), then it will work just fine in OL Connect Designer/DataMapper.

That said, the OL Connect Workflow module is not UTF-8 compliant. Well, to be more accurate: a lot of it is not. As soon as you read from/write to the data file, Workflow uses the machine’s default locale to determine which encoding to use. Note: in this case, writing/reading means “writing/reading content to/from the file”, it does not apply to file operations like copying, uploading or moving a file.

So with UTF-8 files, these operations go through a conversion process, which can lead to data loss. Also, when you type a specific value in any task’s fields, that value is encoded using the same locale encoding, so those values may be converted as well when written to a file.

There are some exceptions to this: Workflow’s Metadata is natively Unicode-aware. As long as you don’t manually write to/read from the file, the UTF-8 encoding is preserved. For instance, if you use an Execute Data Mapping task with a UTF-8 file and set the option to store the records in the Metadata, those metadata records will contain the correct values.

However, if you wanted to write those values to a file while retaining their original encoding, then you’d have to use a script because Workflow’s Script Engine is also Unicode-aware while, for instance, the Create File task is not.

Hopefully, this gives you a better idea of the potential pitfalls of using UTF-8 data.

Final note: before anyone asks, no, we can’t convert Workflow to a unicode-aware application as it would require a huge amount of work because of the language platform it was developed with. Just like we can’t convert it to a 64-bit application. That’s why we are developing, in parallel, a series of nodes for Node-RED that will, in time, be able to achieve most - if not all - of the operations currently available in Workflow. In 64-bit. And in Unicode.