Conflict between page and boundary triggers in Text

When I set my page delimiter to “On Lines”, my boundary trigger of “On text” successfully picks up “Page 1 of” to parse my input stream into records with a varying number of pages.

But if I change my page delimiter to “On text”, my boundary trigger fails and I get seemly random number of pages per record.

Is there a fix for this?
This works:

This does not

I think I’m trying to create records which contain a variable number of variable length pages.

No wonder this is causing problems.

No, it shouldn’t be causing problems because variable number of pages of variable length is what the DataMapper eats for breakfast… :stuck_out_tongue:

Could you share some sample data? Would make things easier.

(Note that I have moved this topic to the DataMapper forum, as it didn’t belong in the Designer forum).

Hi @Joanne,

Have you already tried these settings?

Here is an example datafile, suitably anonymized and shortened, which exhibits the issue.

Record three is where the page length varies.
Example CLM349.txt (16.6 KB)

Those settings put the second pages of a record at the top of the following record.

But I like the idea of splitting the records based on the Page n of m, where n = m.

I guess a script to detect this?

I’d suggest keying in on a different value than the page.

The top of your data has a static string on each page. So we can use that to determine your page breaks.

From there, each page has a Claim Reference in a static location. I assum multi page documents share the same claim number, even though that is not the case in the mocked up sample data. I’m guessing that might just be an artifact of the anonymization.

So, below we trim the first line to get things lined up nicely. Then we set our page delimiter to be a snippet of the string at the top of each page.

Finally, we determin our record boundaries by looking at the Claim Reference number and determining if it has changed from the previous page.

Just noticed you’ve got a standard 130 lines per page as well, which means this works also:

The pages are supposed to be 130 lines long, sadly the source system incorrectly creates some pages of 132 lines. (Other source errors will no doubt come to light in the future).

For expediency I have a put a conditional script in my workflow to reject any input file that doesn’t have a modulus of 130 lines.

But I do wish to learn how to handle variable length pages.

I can detect the top of a record more cleanly using “BENEFIT DECISION NOTICE” on line 6.
If I set Boundaries to “On Delimiter” I get each page as a record with a consistent top of page.

But I still can’t get the “Page 1 of” working as a boundary, the last record contains two “Page 1 of 1” pages

Sorry folks, I think I may have been chasing the wrong issue.

It is the last record in the dataset that fails to detect the boundary

The most logical method for identifying boundaries in this datafile is to look for the Page x of y string and set a document boundary whenever x equals y (i.e. Page 1 of 1 or Page 4 of 4).

However, that string is found somewhere in the middle of a variable length page. To make sure the process isn’t thrown off by potential additional lines, you can write a script that checks for the x and y values and, when they match, waits until it finds the next header (identified, in this datafile, by the string Thurrock Council Benefits Department,) to set the document boundary.

The following script should achieve that:

var line = boundaries.get(region.createRegion(1,1,100,1));
var re = /Page (\d+) of (\d+)/;
var matches = re.exec(line[0]);
if(matches!==null && matches.length==3 && matches[1]==matches[2]){
	boundaries.setVariable("found", true)
	logger.info(line[0]);
} else if(boundaries.getVariable("found") && line[0].slice(0,37)=="Thurrock Council Benefits Department,") {
	boundaries.set(0);
	boundaries.setVariable("found", false)
}

The script inspects each line. It then checks for the Page x of y construct and if it finds it, it compares the values of x and y. If they match, it sets a variable (found) to true, but does not set a boundary yet. The script keeps processing lines and checks for the header string (“Thurrock…”) and when it finds it AND the found variable is set to true, then the boundary is set.

To make this work, make sure that:

  • The page delimiter is set to Lines, with the number of lines set to 1.
  • The trigger is set to On script, with the above script.

If any given page contains more lines than the others, the script will still work because it doesn’t look for fixed page lengths.

1 Like

Thank you. This works well and is a great introduction to scripted boundaries.