Print input datamapping with regex

In workflow, since the WinQueue input does not actually use the full-text searchable version of the pdf file, but rather a copy acted upon by the PP printer driver, regex capabilities for datamapping are not possible, correct? And instead I would have to use a different input, such as a hot folder. Or am I mistaken?

Thanks in advance.

Hard to say, it depends how much will the printer driver mangle the character identities. The more “exotic” the character set, the more you risk losing the text. If you can get a hold on the original PDF file that would be much more efficient indeed.

Thanks for your reply,

I guess I ask because the copy itself isn’t text searchable, right? It’s my understanding that the PDF document, no matter if it were text searchable before, once taken into the PP printer driver, is no longer a text searchable document upon entering the workflow from the print input. Therefore, any advanced javascript regex’s I use in order to do my finds for datamapping would not be possible, and would throw errors (As I’m getting). Or is OL capable of running regex text-find on a document that isn’t text searchable (which wouldn’t make much sense to me).

Thank you for your help.

Pretty much, but not always.

The problem with the WinQueue Input task is that the printer driver first converts the print stream produced from any application to PostScript before the WinQueue Input plugin converts it to PDF. The fact that the original was a PDF is completely coincidental, it could have been a web page, a Word document, etc. and the end result would be the same.

It is that conversion to PostScript that causes the issue. PostScript is a printer description language, meaning that the end result is intended to be printed. On paper. So while carefully crafted PostScript code could be written to maintain character identities (i.e. the link between a glyph (i.e. the drawing of a charatcer) and its human meaning), this is seldom done, especially in printer drivers which are usually optimized to, well, print. :-p

Our PDF conversion library (Adobe Normalizer, the library version of Acrobat Distiller) does have some tricks up its sleeve to maintain character identities from PostScript. If the print stream is relatively clean, the original fonts are simple and the text uses just plain ASCII (i.e. English characters), there is a good chance that character identities will make it thorugh. But anything exotic (CJK, Type 3 fonts, etc.) unfortunately likely won’t be recovered.

The exact same thing goes if you use a PDF as background in PlanetPress 7. The PDF goes through the same PDF-to-PostScript-to-PDF conversion process. That’s why we usually advise on using the “stamping” method when character data is important.

Note that OL Connect uses a different technology for PDF background, so results should be better.

So yeah… If you want to datamap a print job captured with the WinQueue Input task, chances are you won’t get very far. And you are right, the Connect Data Mapper (or any other application* for that matter) will not be able to extract or find text in a PDF if the text isn’t searchable. Rule of thumb: if copy-paste from Acrobat works, so will data mapping; if not, you’re done for. Hence the reason why, if the original is a PDF, it is much better to find an alternative way to get it in the system instead of going through a print operation.

(*) That is not entirely true. For example, the text could be OCR’ed. Since we start with a perfect print and not a scan, accuracy should be pretty good. But that’s a whole other ball game. So my statement is close enough of the truth to say it is true. :wink:

Thank you so very much for helping me better understand. You’ve 100% answered my question and I appreciate your time and effort in this response.