Is it even possible to do the following within a boundary script for a multi-page pdf?
Every record has a minimum of 2 pages
A record can never be single‑page.
If the logical content is only 1 page, the following page must still be included. The first page of a record always contains a name/address block
It may contain only name/address
Or name/address plus additional content A blank page is part of the record
Blank pages are not separators by exclusion
The blank page belongs to the record
The page after the blank page starts the next record Name/address matching is meaningful but not absolute
A mismatch does not automatically end a record
Context determines ownership
Even if we assume this could be doable, I would hate to see how you would then manage to extract the actual data values given that the formats of the invoices in the input file are so different.
That said, one of the main problems with determining your data boundaries is that the only “consistent” piece of information across all invoices is an address block on the first page… but address blocks are rarely formatted with consistency, especially if you’re dealing with addresses from multiple countries, on multiple invoice formats. So being able to actually recognize a postal address amidst all other information is going to prove extremely challenging.
Who (or what process) creates this mixed bag of invoices? Do you have any control over it? Is there any metadata inside the PDF that might contain information about each invoice?
I have no control over the PDF creation unfortunately as it comes from a third party.
I don’t actually need to extract anything from the PDF itself. What I need to know is the total number of records and number of pages per record.
Pages are printed duplex (as mentioned earlier each record is a minimum of 2 pages), and I need to put an inserting machine barcode on the grouped output PDF.
PDF IN → SPLIT INTO RECORDS → CALC TOTAL RECORDS/PAGES PER RECORD → GENERATE/APPEND INTEL BARCODE → PDF OUT