Boundary Script Possibilities

Is it even possible to do the following within a boundary script for a multi-page pdf?

:white_check_mark: Every record has a minimum of 2 pages
A record can never be single‑page.
If the logical content is only 1 page, the following page must still be included.
:white_check_mark: The first page of a record always contains a name/address block
It may contain only name/address
Or name/address plus additional content
:white_check_mark: A blank page is part of the record
Blank pages are not separators by exclusion
The blank page belongs to the record
The page after the blank page starts the next record
:white_check_mark: Name/address matching is meaningful but not absolute
A mismatch does not automatically end a record
Context determines ownership

Hello @marrd,

Given the provided information I would start with the question “What distantiate each record?”. Having the following isn’t very reliable, for example.

  • The first page of a record always contains an address block.
  • A change in address block does not always indicate the start of a new record.

This is the problem. It is a single mixed page count (per record) PDF.

All records are a minimum of 2 pages, but can be up to 30 pages.

Some records start with a blank page with the exception of a name/address block. Others start with a full invoice (inclusive of a name/address block).

Some records end with a completely blank page, others end with additional invoice detail pages and/or terms and conditions.

Invoices within the PDF are from different companies and have a completely layout and as such impossible to set standard boundaries

@marrd,

Even if we assume this could be doable, I would hate to see how you would then manage to extract the actual data values given that the formats of the invoices in the input file are so different.

That said, one of the main problems with determining your data boundaries is that the only “consistent” piece of information across all invoices is an address block on the first page… but address blocks are rarely formatted with consistency, especially if you’re dealing with addresses from multiple countries, on multiple invoice formats. So being able to actually recognize a postal address amidst all other information is going to prove extremely challenging.

Who (or what process) creates this mixed bag of invoices? Do you have any control over it? Is there any metadata inside the PDF that might contain information about each invoice?

I have no control over the PDF creation unfortunately as it comes from a third party.

I don’t actually need to extract anything from the PDF itself. What I need to know is the total number of records and number of pages per record.

Pages are printed duplex (as mentioned earlier each record is a minimum of 2 pages), and I need to put an inserting machine barcode on the grouped output PDF.

PDF IN → SPLIT INTO RECORDS → CALC TOTAL RECORDS/PAGES PER RECORD → GENERATE/APPEND INTEL BARCODE → PDF OUT