Out of memory while processing ~5000+ pages PDF

tomtom · March 16, 2021, 3:21pm

Hey guys, I got an error when i’m processing PDF using alambic. PDF’s are optimized, however I receive an error after 6-7 minutes of processing through script. Error line indicates that InsertFrom2 is causing problems. I checked resource monitor and ppwcfg.exe is building up memory and throwing error while on ~1GB. I have more than needed free memory.

Error info is:
[0082] W3602 : Error 0 on line 318, column 13: AlambicEdit.AlambicEditPDF.1: Error inserting pages from ‘source’: Out of memory.

Any ideas how to solve this problem?

Phil · March 17, 2021, 12:14pm

The scary thing about that error message is that it occurs on line 318 of the script… which means you have a very long script that undoubtedly does tons of things. It is an almost impossible task to debug these beasts on a forum like this one.

Here’s a few pointers however:

Workflow is a 32bit application, so regardless of the total amount of memory on your system, it can’t use more than 4GB. Variables can also not exceed 2GB in theory, but in practice that’s closer to 1GB.
You are apparently copying pages from one PDF to another. It’s safe to assume you do that in a loop. If you are unable to do this in a single pass, you could make sure that both PDFs are closed every X pages or so, and then reopened to resume the operation. This should allow the memory to be reclaimed each time the files are closed/reopened.
Copying PDF pages is memory and CPU intensive: a page usually contains several cached resources (think logos and fonts) that are stored earlier in the file and used multiple times throughout the file. When you copy pages individually, you are bringing over all those resources for every single page you copy. This will ultimately get optimized when you save the receiving file (with pdf.save(true)), but until that time you can have hundreds or thousands of copies of the same resources on the receiving end. So saving/optimizing the file from time to time during the copy process will help streamline the contents of the file. For the same reason, whenever possible, copy adjacent pages in batches instead of individually.

Let us know if any of that information helped.

tomtom · March 17, 2021, 1:18pm

Thank you for your response Phil, I’m aware of memory limitations in PP, as well as optimizing PDF’s - here I can do nothing more basically. Your second bullet point however is interesting. Maybe I could rewrite script and try that.
It is indeed working in loop and extracting PDF pages based on text file, in addition to that based on conditions several attachments are being added. If I could somehow just free memory, that would save the day.

Also - I really doubt that pdf.save(true) does anything. Checked it many times, and see no difference in results/PDF size.

Phil · March 17, 2021, 3:21pm

Extracting text is also a costly operation, memory-wise. And unfortunately, JScript’s garbage collection abilities are, uhm… limited?
I would definitely try closing/reopening the PDFs periodically as you loop through the pages to add. If that fails (or if it only marginally improves the procedure), then you could try doing this in a Workflow loop (using the Loop task). That way, you would close the entire scripting environment in between batches, which would allow the system to reclaim the memory immediately instead of waiting for JScript’s engine to do it whenever it feels like it.

tomtom · March 22, 2021, 10:54am

Phil, thank you for your help. I can confirm that closing and reopening PDF while processing every X pages helps to free memory. It’s building up and releasing properly, my heart rate increased when it built up to 750MB, but then immediately it went down to ~400.
Oh and btw. I’ve tested this time with 10 000 pages where before it was crashing on 5k pages.

Phil · March 23, 2021, 10:21am

I realize I forgot to address one of your points: what does pdf.Save(true) do?

I checked with our lead architect and he explained the various optimizations that take place when specifying true:

PDF is linearized (for fast web-view)
Unreferenced resources are removed (unlikely to occur with a freshly created PDF file, but more likely if the PDF has been modified a few times)
ASCII85, LZW and unencoded streams are converted to Flate-encoded streams (results in smaller files, especially when using large areas of color or repeating patterns)
Fonts and encodings are merged when possible
XObjects (i.e. elements that are used multiple times throughout the file) are merged to remove duplicates (which may occur when copying pages from one PDF to another).

If your PDF is already pretty much optimized, using pdf.save(true) will have little effect. But the more you make changes to existing PDFs, the more likely that optimizing it results in a smaller, more efficient file. As a general rule of thumb, you should always optimize when saving since the operation has little impact on overall performance and may help control the size of the resulting PDF.

tomtom · March 23, 2021, 10:57am

Hey, thanks for sharing technical details. Personally as I said, I did not experienced any difference using that option, but talked to my colleague and he did confirm that there were couple times it had a significant impact on PDF itself.