Performance large XML input files

Hello,
I hope someone has an idea how to optimize my workflow for following requirement.

I get up to 10 XML files in a folder and I have to create pdf files based on Connect Design Templates (per record one file with variable output name). Thats not a problem.
My problem is the size of the input files. Some of these files are bigger than 100 mb (XML file!). So in some cases the workflow does not respond anymore. But even if workflow works untill the end in the task manager I can see that “mysql.exe” have 100% disk usage and it seems.

Is there an optimal way to process these large files?

Thanks,
Thomas

A single record that uses a 100MB XML file might be problematic. That’s a huge amount of data for one record. Are you extracting every single element of that XML file? If so, that’s why you see the MySQL task spiking at 100%: there must be thousands of fields in there!

Do you actually need all those fields? Can you pre-process the XML to only provide to the DataMapper the fields you actually need to extract? Can the XML file be split?

Also, is your Workflow process set to self-replicate or is it processing all 10 files in sequence?

Sadly, I need all those fields to create tables. XML-Splitting would be fine but I do not know how to do that properly in PlanetPress Workflow. The given plugins do not provide that functionality (e.g. split every 20 records).

My Workflow process is not self-replicating. It processes all files in sequence.

First: using self-replication might help because each file will be processed independently and the process’ memory will be reclaimed after each file. When processing all files in a single process, memory may only be reclaimed at the very end of the process (i.e., after all 10 files have been processed).

Second: you can use an advanced XSLT splitter to split after 20 occurrences of your target element.

Granted, XSLT is no fun at all … unless you have some experience with it, but it is very powerful nonetheless. For instance, the following XSLT splits a file after every 20 occurrences of the element CUSTOMER have been found:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 

<xsl:variable name="RecordsPerSplit" select="20" />
<xsl:variable name="NodeToSplit" select="//MyFile/CUSTOMER" />

    <xsl:template match="/">
        <xsl:for-each select="$NodeToSplit[position() mod $RecordsPerSplit = 1]">
         <xsl:result-document href="file:///{encode-for-uri('{WATCHTEMPFOLDER}')}{format-number(position(),'000000000')}.xml">

          <MyFile>

              <xsl:for-each select=".|following-sibling::CUSTOMER[not(position() > $RecordsPerSplit -1)]" >
                 <xsl:apply-templates select="."/>
              </xsl:for-each>

          </MyFile>
         </xsl:result-document>
        </xsl:for-each>  
    </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

It is composed of two loops: the outer loop initiates a new chunk of data when (element_index modulus 20) is equal to 1.

The inner loop then looks at the following 19 siblings of the CUSTOMER node and adds them to the chunk.

Obviously, depending on the structure of your own XML file, you will have to make changes to the select statements in the XSLT code, but perhaps the above example can get you going.

1 Like

Thanks Phil! That is really great. Now I can split the file into multiple smaller files and process each generated file in a separate self-replicating process.

And if you go down the XSL-Route proposed by Phil, keep in mind that the xslt library supplied by the Workflow (AltovaXML) could be slow.
I have managed to increase the performance 10x by using an external xslt processor (Saxon in my case) and optimising the xml template.
br
-i.