Multi PDF Input

Hi,

We have a job where we can get 5000+ PDF’s and wondering which is the best way to process through PReS. each PDF is a record and we need to merge packs for the same address into the same pack. At the moment the options I can see are:

Read in each PDF, run through data mapper but don’t go into a template, then do a retrieve items and get all the files and then run through the template etc

Read in the PDF, add white text to the page to flag start of record, then output back to PDF for each input PDF. Merge the PDF’s using the merge module and then read back in another data mapper and then merge etc.

We’ve had problems in the past with doing the PDF merge, but the retrieve items can be very slow as I’ve seen that workflow is still a 32bit app so has a low limit on the amount of RAM it can use so this could cause issues when we have a lot of records possibly.

Any thoughts on which would be the best route or any other suggested ways of doing this

James

Hi @jbeal84 ,

I assume that you would like to group records based on the record field “Address” and that you would like to split these groups (also called Document Sets) into separated PDF documents based on the record field “Address”. Can you please confirm if this is correct?

Please note that, in case the above is correct, you can use the following settings in a Job Creation Preset and Output Creation Preset to group records on the record field “Address”:

  1. Connect Designer > File > Print Presets > Job Creation Settings…
    • Configuration Selection tab:
      • Use grouping [Checked]
    • Grouping Options tab:
      • Document Set Grouping Fields:
        • Move the record field “Address” from the list of Available Fields to the list of Selected Fields by clicking on the “Add” button in between these two lists.
  2. Connect Designer > File > Print Presets > Output Creation Settings…
    • Print Options tab:
      • Separation [Checked]
    • Separation Options tab:
      • Separation Settings:
        • Separation [Document Set]

P.S. A Job Creation Preset file can be used by the Create Job Workflow plugin and a Output Creation Preset file can be used by the Create Output Workflow plugin.
P.P.S. After creating the Job Creation Preset and Output Creation Preset you have the option to send these two files to the Workflow service via: Connect Designer > File > Send to Workflow.

I think @jbeal84’s request has to do with merging existing PDFs, rather than splitting a large PDF into grouped PDFs. If that’s the case, then it can be achieved with a Workflow process, without involving the DataMapper or a Connect template.

The idea is to get a listing of all PDFs that need to be merged based on the postal address, extract that address and concatenate the files that have the same address into single output PDFs. Here’s a sample process that would do that:

The key part of the process is the short script highlighted in yellow See code below).

That script reads the postal address from the input PDF and checks if that address has been recorded before. If it has, then script automatically assigns the same output file name in order for the Send To Folder task to concatenate the current PDF with all the other ones that had the same postal address. If the address is encountered for the first time, then the script generates a new, unique output filename (using %u) and records the new name, along with the associated postal address, in the allFiles variable (which is just a JSON array).

Here’s the script:

// Load list of existing addresses and corresponding file names
var allFiles = JSON.parse(Watch.GetVariable("allFiles"));
var inputFile = Watch.GetVariable("inputFile");

// Read postal address from current PDF
var inputPDF = Watch.GetPDFEditObject();
inputPDF.Open(inputFile,false)
var oneAddress = inputPDF.Pages(0).ExtractText2(0.40625,1.67708,4.83333,2.52083).replace(/\n/g,"_");
CollectGarbage();
inputPDF.Close()

// Check if address already exists
var existingAddress = allFiles.filter(function(elem){
  return elem.address==oneAddress
});

// Set file name and if it's the first time this address is encountered,
// record it and its corresponding file name in the allFiles variable
if(existingAddress.length==1){
  Watch.SetVariable("fileName",existingAddress[0].fileName);
} else {
  var newAddress = {address:oneAddress,fileName:Watch.ExpandString("%u")};
  allFiles.push(newAddress);
  Watch.SetVariable("fileName",newAddress.fileName);
  Watch.SetVariable("allFiles",JSON.stringify(allFiles));
}

Obviously, this code generates random (albeit unique) output file names, so you may want to adjust that using some of your own logic. In addition, the postal address is read from a specific location on Page 1 of each PDF, so you will also have to adjust that to match your own PDF files.

Note that the allFiles variable has a default valueof [] (i.e. empty array). I could have implemented some code in the script to check whether or not the variable has already been initialized, but I went for the easy method instead.

Also note that the 2 Change Emulation tasks are critical: the main branch loops on XML data while the script needs to read from PDF data, so you have to make sure the process’ emulation is aware of those changes in the data format along the way.

Hi Phil,

Thanks for this, it isn’t quite what I wanted but I can use it to achieve what I want with a slight amendment

James