Enrich offline then get data back

RalfG · September 14, 2021, 10:30pm

Don‘t know how to begin…

Extracted datamappers output (as csv) shoud be enriched by an external process, then sent back to the same or another workflow.

Reading a pdf, extracting some data, sorting, adding some data and then „stamping“ some of that new data into those documents.

Example:

Doc (pdf) consisting of several adresses stored in extracted fields.
External (manually created) geo-coordinates for each document-id and along with that a geo.map image (and/or just plain text)

Next process should be to insert that image (or some text) into the original pdf.

How can I reference the original metadata in a subsequent process?

Ralf.

Phil · September 15, 2021, 1:02pm

Not sure I correctly understand your request, but it would seem like a fairly simple procedure:

Make sure that in your initial DataMapper configuration, you have already created two fields (e.g. “geo_coordinates” and “geo_map”. Those fields will be empty in all records since the original data doesn’t contain those values, but that’s OK since you will be adding the values through Workflow.
In Workflow, after the data is extracted, use a script to store the geo coordinates and the name of the image file into those two fields in the metadata (this script obviously depends on where/how that information is stored externally)
When executing the Create Content task, make sure to tick the “Update Records from metadata” option, which will ensure that the values you added to the metadata are stored back in the database’s original records.

This is a very high-level view of how to achieve what you want, but it should hopefully get you going.

RalfG · September 15, 2021, 2:25pm

Phil, thx for your reply, seems that I’ll owe you more than one beer…

currently I’m working at:

first step already done (that was my intention to do it that way)
Extraction of data is done in datamapper, saving those values (IDs and addresses and 2 empty fields) to an external datasource along with the metadata
Updating that empty fields in the external datasource with appropriate values
In another workflow: re-reading the original file with its metadata from 2 and the updated datasource.
Updating the metadata with values from the datasource (in a workflow script)
doing the rest (Job Creation, Output Creation)

Ralf.

RalfG · September 30, 2021, 2:18pm

I’m nearly there…

I. First workflow processing the pdf

Doing the datamapping adding empty fields
Export metadata into temp xml file (meta.export)
XSLT transformation of temp xml file to csv, with id and those empty fields
Store that csv in a seperate mysql schema table (named after the original filename) (load data infile… extremely fast).
Export metadata and original file in temp folder
Offline process: enrich that csv

II. second workflow awaiting that “enriched” csv

Load csv updating the mysql table (I.4) using load data infile into temporary mysql table inner join / updating that mysql table (also incedibly fast).
fetching the original file and metadata (I.5), Folder capture and Metadata File Manager
iterating through the metadata fetching the values of the mysql table writing them to the empty metadata fields thisDoc.Fields.Add(“_vger_fld_myfield”, mysqlrecordset(“myfield”))
Create Print Content (Update records from metadata)

and there it fails, metadata is looking good, but

[0010] W3001 : Error while executing plugin: HTTP/1.1 500 There was an error running the content creation process caused by ApplicationException: No record found with ID 1809031 (SRV000022)

I don’t know where I scrambled the record IDs since I’m reading the original file back (II.2).

Any idea?

Ralf.

Addendum: it works with smaller datasets, but with 20k rows (input pdf with 180k pages) above error occurs.

Phil · September 30, 2021, 9:15pm

Hard to say, but one thing to look at is not length of the job itself, but rather the duration of the entire process. You don’t mention whether these steps are performed immediately or over a period of hours or days.

Check that your cleanup service isn’t running in between processes as this could explain missing records. You can change the frequency of the cleanup process in the Server Preferences. The easiest way to test if the service is the source of the issue is to disable it for a while and see if all your jobs are processed properly. If they are, then it means the service runs too frequently, so you should adjust its schedule.

RalfG · September 30, 2021, 9:47pm

Phil, thx again for your comments,
No cleanup process inbetween, shoudn‘t care, since I`m re-reading the original source and its metadata.

Phil · October 1, 2021, 11:11am

I think you’re going to have to open a call with our Support team as this will require some more in-depth investigating.