When I set my page delimiter to “On Lines”, my boundary trigger of “On text” successfully picks up “Page 1 of” to parse my input stream into records with a varying number of pages.
But if I change my page delimiter to “On text”, my boundary trigger fails and I get seemly random number of pages per record.
I’d suggest keying in on a different value than the page.
The top of your data has a static string on each page. So we can use that to determine your page breaks.
From there, each page has a Claim Reference in a static location. I assum multi page documents share the same claim number, even though that is not the case in the mocked up sample data. I’m guessing that might just be an artifact of the anonymization.
So, below we trim the first line to get things lined up nicely. Then we set our page delimiter to be a snippet of the string at the top of each page.
Finally, we determin our record boundaries by looking at the Claim Reference number and determining if it has changed from the previous page.
The pages are supposed to be 130 lines long, sadly the source system incorrectly creates some pages of 132 lines. (Other source errors will no doubt come to light in the future).
For expediency I have a put a conditional script in my workflow to reject any input file that doesn’t have a modulus of 130 lines.
But I do wish to learn how to handle variable length pages.
I can detect the top of a record more cleanly using “BENEFIT DECISION NOTICE” on line 6.
If I set Boundaries to “On Delimiter” I get each page as a record with a consistent top of page.
But I still can’t get the “Page 1 of” working as a boundary, the last record contains two “Page 1 of 1” pages
The most logical method for identifying boundaries in this datafile is to look for the Page x of y string and set a document boundary whenever x equals y (i.e. Page 1 of 1 or Page 4 of 4).
However, that string is found somewhere in the middle of a variable length page. To make sure the process isn’t thrown off by potential additional lines, you can write a script that checks for the x and y values and, when they match, waits until it finds the next header (identified, in this datafile, by the string Thurrock Council Benefits Department,) to set the document boundary.
The following script should achieve that:
var line = boundaries.get(region.createRegion(1,1,100,1));
var re = /Page (\d+) of (\d+)/;
var matches = re.exec(line[0]);
if(matches!==null && matches.length==3 && matches[1]==matches[2]){
boundaries.setVariable("found", true)
logger.info(line[0]);
} else if(boundaries.getVariable("found") && line[0].slice(0,37)=="Thurrock Council Benefits Department,") {
boundaries.set(0);
boundaries.setVariable("found", false)
}
The script inspects each line. It then checks for the Page x of y construct and if it finds it, it compares the values of x and y. If they match, it sets a variable (found) to true, but does not set a boundary yet. The script keeps processing lines and checks for the header string (“Thurrock…”) and when it finds it AND the found variable is set to true, then the boundary is set.
To make this work, make sure that:
The page delimiter is set to Lines, with the number of lines set to 1.
The trigger is set to On script, with the above script.
If any given page contains more lines than the others, the script will still work because it doesn’t look for fixed page lengths.