I’m trying to simulate the PlanetPress Suite v7 data selection Line Condition in PlanetPress Connect.
For example the data wanders within the general geographically region, but always contains a static label tag such as Date: preceding the value. I have several different values to extract each with it’s own static label tag. Name:, Address:, DOB:, etc…
You will have to use Regular Expressions for this.
For instance, if you set up a loop that goes through each line on the page, you can add a condition inside that loop that checks if the following expression is true:
/(Date: )(.*)/i.exec(data.extract(1,80,0,1,“”)) != null;
You can use multiple conditions to look for all the keywords.
I must be missing something setting up where to add the Regular Expression. Nothing I do appears to hits the TRUE side. I’m sure that I’m just missing something. Also when extracting the data from a PDF is the height of 1 correct or should it be something more like 9.313332 to match the line spacing the loop detects?
I can get Multiple Condition case set to a value “Name:” and then set the extraction to a location. I can also set following “Address:” (not sure how to handle the additional address lines though).
I worry about the performance of my version though because each case has a separate extraction and worse case there could be towards 50 separate extractions. Hoping that it will only be a half dozen though.
OK, that’s really not an easy one. You’re looking for moving targets and those are always challenging.
Here’s how I did it in a pet project of mine. Not sure it will work with your project, but it may put you on the right track.
So first, I have a loop that goes through all elements in the PDF page. That means every single line with content will be examined by the process.
Inside the loop, I have a Multiple Condition task whose left operand is set to this JavaScript code:
data.extract(52,209,0,5,"");
The (52,209) values indicate the left-right boundaries of the region I want to examine on each and every line. 0 is the current vertical offset while 5 is the height of each line of text.
My first Case statement sets the Right operand to Invoice #
. So if that piece of text is found in the data extracted in my Multiple Condition’s left operand, then this branch is executed.
I have a single extraction step inside my Case branch. It extracts the text from the same region that the condition examined, but then uses a Regular Expression to only store in my field whatever is found to the right of my keywords (i.e. Invoice #
). Here’s the code I use:
var line = data.extract(52.154663,208.95732,0,5,"")
var elements = line.match(/.*(Invoice #)(.*)/);
elements[2];
This portion of my sample file looks like this:
The code above extracts all text on the line that’s highlighted, which yields the result
Invoice # INV6103083
and then he regular expression takes care of discarding anything that comes before Invoice #
and only keeps the value to the right: INV6103083
.
I then created new cases in my Multiple Conditions step using the same technique, looking for keywords such as Due Date
: or Total Amount Due
.
Finally, the last step inside the loop is a Goto Step set to Next line with content
.
Here’s what the simplified data mapping config looks like:
As far as performance is concerned, the process is examining each and every line, which is where most of the time is spent. There isn’t much you can do about that. However, the overall procedure could definitely be improved by not extracting each line twice (once for evaluating the condition and once when actually extracting the data), but it would have been a bit more complex to explain here.
Let me know if that helps.
Yes, this does help.
Here are my Steps. I don’t know what performance will be like.
There is a keyword that matches twice so that why there is a condition within the case condition. Some items are always on the same line as other keywords so I was able to combine the extractions on those fields.
For the 2nd & 3 address line which does not have a tag, I increased the height to grab 3 lines off address line 1 (“Address:”), but don’t know yet if it will be an issue as everything seems to float based on what data values. There are 3 extra lines for the company address, which I believe is static based on my sample data, but if not it’s at the top of the page and I may be able to extract them in a normal fashion.
Let us know how it turns out and if you stumble into problems (or solutions). I think this is a great demonstration of what the DataMapper can do, but anything we could add to improve these kinds of workflows would make it even more powerful.