Trim PDF Pages Before Extracting Data

Hi,

I am very new to OL Connect. Is there a way to have Designer trim the last 2 pages from a pdf before setting my record boundaries?

Many of the pdf data files I am working with have this same structure: Summary (2 pages) > Content > Last Page Indicator (2 pages because duplex).

I am splitting the records on text “PAGE 1 OF” which only appears in the Content section.

Connect is already conveniently treating the Summary section as record 1 (I am guessing because it has not encountered the boundary definition yet). I am able to filter this out using the job preset settings by extracting an area which contains the words “SUMMARY” on the first page of only that record.

It is the Last Page Indicator that is creating a problem. It does not contain the boundary information “PAGE 1 OF”, so Connect is just treating it as the last two pages of the last record.

I would do it in Workflow using the Alambic API as so:

// File System Object
var fs = new ActiveXObject('Scripting.FileSystemObject');

// PDFs declaration
var originalPDF = Watch.ExpandString("%o");

// Load the original PDF
var pdf = Watch.GetPDFEditObject();
pdf.Open(originalPDF, false);

//Check number of pages and delete last two ones if greater than 2 pages
var nbPages = pdf.Pages().Count();
Watch.Log("nbPages at first= "+nbPages,2);

if(nbPages > 2){
  for(var i = nbPages-1; i >= 0; i--){
    if(i >= nbPages - 2) pdf.Pages().Delete(i);
  }
}

nbPages =  pdf.Pages().Count();

Watch.Log("nbPages at end= "+nbPages,2);
// Saves the PDF
pdf.Save(false);
CollectGarbage();
pdf.Close();


Thank you! I am not too familiar with workflow yet, but I will give this a shot after I learn the basics and update you then.

Update as promised. I never did attempt that workflow solution. Instead, I was able to script a second record boundary in data mapper that would break the “Last Page” section into its own record. I then created extraction fields to extract keywords that only appear on the Summary and Last Page records. Then I just filtered them out in the Job profile. I do appreciate your solution though, and I may put it to use if the occasion arises.

Interesting approach! It might be worth adding an option to trim a document by a fixed number of pages as part of the Boundaries feature in DataMapper. I’m not sure how common this use case is, but it doesn’t sound particularly complex to implement.

Depending on the scenario, another approach would be to run the data through an OL Connect template after setting the initial boundaries on the “Summary” text. You could store the resulting page count in a data field (steps.totalPages). In the template, the section background could be set to the PDF generated by DataMapper for that record. A script could calculate the page range, for example setting the last page to record.fields.pageCount - 2 (the subtraction could also be done in DataMapper).

Admittedly the process will take a bit more processing time as it needs to run the data through a merge engine.

Erik

This will help if added to the data mapper. I used Workflow to get rid of unwanted trailing pages.

Regards,
S

1 Like