Strange Results from PDF Extract

Hi all

I’m trying to do a simple address block extraction from a PDF file, but I’m getting strange results. The fields are showing double letters even though the PDF clearly doesn’t have them. See image below

Any ideas why?

Hi Duncan,
I have come across this issue before and from memory I resolved it by adjusting the Word spacing under the Settings in the Data Mapper.

image

Best Regards

Justin Leigh

Hi Justin

Thanks for your reply, however upon changing the word spacing value, it didn’t seem to make any difference.

What is also unusual is that for one of my extract fields the information is correct ie. no double letters/digits

Hi Duncan,
Are you able to upload the Data Mapper? I have seen this before, but I’ll need to refresh my memory of how I fixed it.
One thing I do when I run into issues with PDFs used in the Data Mapper is open the PDF in Acrobat Reader and cut and paste the text that is causing problems into Notepad or Write. If that shows anomalies, then there’s an issue with the PDF itself; if not then the issues with the Data Mapper.

Best Regards

Justin Leigh

I just tried what you suggested, and the anomolies are also there when copying the address block and pasting it into editpad. Looks like a PDF issue then?

Hi Duncan,
It definitely seems like the PDF is the issue. I’m still sure I’ve seen this before. If I find more information that may be of use I’ll post a new response.

Best Regards

Justin Leigh

Would there be a post function script that could revert the extraction to single letters using the split lines extraction. I’m guessing I won’t be able to get another PDF to work with?

Something like the below…not sure how to integrate this though

const remaining = myString.split(‘’).filter((char, i) => i % 2 !== 0).join(‘’);

Could it be that this PDF was personalized in another program?
I know the problem from PrintShop Mail. There, an existing text in a background PDF is often covered with a white frame to put a personalization over it.
When such a PDF is read in Datamapper, the “covered” original text is still present in the PDF and is recognized accordingly.