I have a job that should merge hundreds of PDF files and result should be single PDF file. Current workflow is to take one PDF, create content, set jobid for content set, take another PDF, create content, set job id … you know how this works.
I would like to speed thing up by using self-replicating process and processing PDF files in parallel. Setting up self-replicating job is not a problem, creating separate process that will restore documents from memory based on jobID and create output is not a problem. Problem is how to know when all PDF files are processed by self-replicating process and when to start output creation. Can this be automated?
The problem you face with that kind of process is that you don’t initially know how many files will be processed. Even if you mass-copy all PDFs into a folder at once, the process will have time to start before all the PDFs have actually been written to that folder (the PDFs don’t all simultaneously appear in the folder, there is a delay between the first and the last file).
Therefore, the way I would approach it is to have the self-replicating process create a dummy text file (e.g. “completed.txt”) in the same folder where your PDFs are originally stored. That file will be overwritten as each instance of the self-replicating process runs.
You would then have a different process that starts with a File Count input. It should monitor that same folder and it should have the following extension mask (treated as a regular expression): \.txt|\.pdf
which means it will count the total number of PDF and TXT files. If that count is equal to exactly 1, then it means only the TXT file remains in that folder and you can then kick off your output creation process. The nice thing about the File Count input is that if the condition isn’t met, then the process doesn’t run needlessly.
Make sure to delete the TXT file immediately after you have validated that is is the only remaining file in the folder, so that the File Count task doesn’t find it again on its next run.
Using this method, there is an off chance that the last remaining PDF has not finished processing when the output creation process launches. One way to mitigate this issue is to look at the timestamp of the TXT file (the Folder Listing task would allow you to get that time stamp). You could delay the launch of the process until the file is at least, say, 2 minutes old (depending on how long it takes to process a single PDF file).
Not overly complex to implement, but it still requires a bit of work. But that’s to be expected when working with a mix of parallel and sequential processes.
That said, perhaps others users have dealt with this kind of requirement in a different way, so it would be interesting to learn how they went about it.
Personally I find that dealing with batches is ultimately much faster. If the resulting size of your batches is not too large for PlanetPress, I’d start with a scripted PDF merge and then produce the output for the whole batch. Can you clarify if there’s anything with your job that requires each PDF to be processed individually first?
You mentioned you want to “merge hundreds of PDF files and result should be single PDF file”
Would the following process not achieve it?
Folder Capture that captures a trigger file (such as GO.txt) to initiate the process once all the PDFs have been copied over
Folder Listing (with *.pdf mask) which creates a listing of all the PDFs in the folder
Then All in One to merge and generate a single PDF:
The datamapper will be based on the XML file generated by the Folder Listing. Each record being a PDF
The Template will simply go look for each PDF and apply it as dynamic section background . Here you could have several merge engines working simultaneously to create print content.
Create Job in Passthrough (or use a jobpreset config if you want to sort, filter, group…etc the pdfs. Information to do so should be available in the PDF names which can be extracted in the datamapper)
@jouberto I use this technique every time I can but in some jobs filename is important and we cant merge all PDF files before datamapper. Yes, I can create some TXT file during merge with filename/pageNo and use it to split input file but I hope that there is better/easier solution
@Rod That’s great when I don’t need to extract data from PDF. Unfortunately that is often required