Supported PDF in data mapping configuration

Hello,

I’m trying to do some data mapping configuration to extract data from pdf files.

Some kind of pdf files won’t open and produce errors.

Somebody knows what kind of pdf files are supported from the data mapper ?

To be more precise:

pdf version: 1.4 … 1.9

font type: truetype, type 1 …

font encoding: ANSI, CID …

font embedding: yes, no

other: layer …

ERROR [24 Feb 2015 14:32:19,496][ModalContext] com.objectiflune.weaver.textextraction.rest.client.ExtractionRestClient.getPage(?:?) POST http://localhost:51316/rest/weaverengine/extractor/getPage/C:\Users\Administrator.vmwin7ol\Connect\temp\Connectdesigner\1740\inputdata.8300413086049324830.pdf/0 returned a response status of 500 Internal Server Error
ERROR [24 Feb 2015 14:32:19,673][main] com.objectiflune.datamining.ui.model.DataMiningModel.loadDocument(?:?) [COMPONENT=Data Mapping][SOURCE=Internal] Unable to open the document 1 (DME000049)
java.lang.Exception: com.objectiflune.datamining.pdf.pdfengine.textextract.TextExtractorException: Error while retrieving character data (DME000165)
at com.objectiflune.datamining.ui.model.DataMiningModel.loadDocument(Unknown Source)
at com.objectiflune.datamining.ui.model.DataMiningModel.setDocumentIndex(Unknown Source)
at com.objectiflune.datamining.ui.model.DataMiningModel.updateDocumentCount(Unknown Source)
at com.objectiflune.datamining.ui.model.RefreshBoundariesJob$1.run(Unknown Source)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:135)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4144)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3761)
at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:2701)
at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2665)
at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2499)
at org.eclipse.ui.internal.Workbench$7.run(Workbench.java:679)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:668)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
at com.objectiflune.application.Application.start(Unknown Source)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:353)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:180)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:629)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:584)
at org.eclipse.equinox.launcher.Main.run(Main.java:1438)
at org.eclipse.equinox.launcher.Main.main(Main.java:1414)
Caused by: com.objectiflune.datamining.pdf.pdfengine.textextract.TextExtractorException: Error while retrieving character data (DME000165)
at com.objectiflune.datamining.pdf.pdfengine.textextract.internal.WeaverExtractorEngine.analyze(Unknown Source)
at com.objectiflune.datamining.pdf.data.PDFDocumentData.analyzePage(Unknown Source)
at com.objectiflune.datamining.pdf.data.PDFDocumentData.reset(Unknown Source)
at com.objectiflune.datamining.pdf.data.PDFDocumentData.open(Unknown Source)
at com.objectiflune.datamining.Document.open(Unknown Source)
at com.objectiflune.datamining.ui.model.DataMiningModel$LoadDocumentRunnable.run(Unknown Source)
at org.eclipse.jface.operation.ModalContext$ModalContextThread.run(ModalContext.java:121)
Caused by: java.lang.NullPointerException
at nl.edmond.weaver.api.textextraction.TextAnalyzer.analyzePage(Unknown Source)
… 7 more

It’s difficult to tell without seeing the actual PDF file, but my first guess would be that it’s either password-protected or has some kind restrictions on it. In theory, all types of PDF files can be handled by the Datamapper.

You can attach your file to a post by clicking the link button in the Editor toolbar and selecting Upload.

Attached you can find two sample files:

Sample01.pdf won’t open in datamapper

Sample02.pdf can be open but upon extratction of any text I get only garbage

Both files has not restriction or write protection and can be opened without issues in Adobe Reader

Thanks for your help !

Sample02 is the classic case of a PDF from which we can’t extract valid data: the fonts used in the PDF lack an encoding table. You can’t extract anything from that PDF using PlanetPress Design, nor can you just select text in Adobe Reader and Copy/Paste it into Notepad. So that one is easy: the application that generated it must be set to include encoding tables in the PDF.

Sample01 is more puzzling. In that one, fonts are fine and the data can be extracted properly, but for some reason Connect fails to display the PDF (therefore preventing you from selecting and extracting data). It works in PlanetPress design or when copy/pasting text from Adobe Reader to Notepad.

We will investigate further and attempt to fix the issue once we find the cause.

Our experts in the matter have found the reason for Sample01’s failure. It is caused by the logos appearing on the first three pages. Apparently the PDF was analyzed with an OCR library that adds invisible text to the PDF, allowing it to be extracted. That, in and of itself, is not an issue.

However, it also OCR’ed the logo with the waves and used a Type0 font named HiddenHorzOCR. This font is embedded but has not glyphs, so all glyphs were replaced by the “undefined character”. The absence of glyph definition was the cause of the issue you encountered with Connect. We have fixed this for the upcoming 1.1 release.

Hi Phil,

you are right, Sample01 was analyzed with Adobe Acrobat XI Pro to add OCR information.

I’m fully satisfied of your answer and happy that the issue has been fixed in the upcoming release to avoid similar troubles in the future.

I’ll try to remove the HiddenHorzOCR font to be able to go further.