I have PDF where I want to extract data from it. I can easily access numeric values but the Font used in PDF are arial and it is not embedded in planet press because of that it is showing character values differently.
How can I add that font in planet press so that datamapper can extract correct values.
This is likely NOT because the font isn’t embedded, but because the font uses a non-standard (custom) encoding. If you select text in the PDF and copy/paste it into a text editor, the text will appear “scrambled”. That’s the same thing the Data Mapper sees, and this isn’t fixable. Whatever system/platform producing the PDF must be altered to embed the fonts with standard encoding.
Is there any other way to do it? as we do not have control over the PDF file at the time of its generation.
There really is not. This is a case of the old saying “garbage in, garbage out”.
Font encoding works as a table of numbers and names, cross-referenced to a table of names and glyphs. The name is the name of the letter, in PostScript this looks like “/A” and “/B” for example. In standard encoding, the letter A is actually the number 65, the letter B is 66. When you copy/paste “text” you’re actually copying and pasting a series of numbers. To draw a letter in a shape specific to a font, “65” looks up “A”, and “A” looks up the glyph, which is the code that draws the letter shape.
What’s happened to you is that the letter names have been shuffled and arranged into a different set of numbers (that’s your “encoding” table). “A” isn’t in slot 65, “B” isn’t in slot 66.
Your PDF doesn’t care, but anything that extracts text from the PDF will get a string of numbers that do not match what any other software expects.
@TGREER, eloquently put. I wouldn’t have done better myself.
@djinkal, do you have an example to take a quick look? You can send it to me in a private message if it contains sensitive data.
Due to data privacy issue, we are not allowed to share this PDFs with you. Any other way we can connect where I can show you the issue over call?
Analysis would require to open the PDF in diagnostic tools. You could open a formal support ticket if that is more appropriate.
But otherwise, everything TGREER mentioned in his comment applies. If you can’t copy-paste the text from Acrobat to Notepad, then Connect won’t be able either. The analysis wouldn’t change any of this, it would simply provide a more complete answer as to why.
@djinkal If you open the PDF in Acrobat, and from the menu select “Document Properties”, there is a “Font” tab. That should list all the fonts and whether or not they are embedded, as well as the font encoding. At the very least that should give you information to relay back to the PDF producer.
The first tab will also tell you what application created the PDF, and you can use that to do a web search on how to properly embed and encode fonts in that software.
Looking at the Fonts tab is a good first step, but it is often misleading. The ability to extract text is not related to the font type or whether it is embedded or not, and only remotely to the encoding.
The primary facility for text extraction is the toUnicode table in the font descriptor. If the toUnicode table is not present, the software falls back to a series of heuristics which varies depending on the font type and encoding. This is where it gets messy.
Unfortunately, none of this is shown on the Fonts tab. You can infer part of it based on the info from there and the actual behavior, but the process is unreliable at best. Hence the need to look at the internals to figure out the precise reasons.
In any case, if the text can’t be copy-pasted, then going back to the producer is the only option. Looking at the application which produced the PDF as you suggested can be useful, as sometimes it can be updated or configured differently to produce a better output.
Thanks to everyone who have answered for my query. I tried using Adobe reader and copied the text in notepad. It is working properly. But for Planet press it is not working. what should be done in this case?
As you indicated that you are not able to share the PDF file due to data privacy issues, which sounds very reasonable to me, I would recommend you to submit a support ticket via our support portal. Because then someone from your local support team should be able to schedule a call with you.
And for anyone reading this post in the future, it is worth noting that if you can’t copy / paste from the Acrobat, then you won’t be able to extract in the Datamappper BUT if you can, you could still be unable to extract using the Datamapper has Acrobat does rely as well on OCR (Optical Character Recognition), which Connect doesn’t.