Performance large XML input files

thomasweber · June 30, 2020, 6:10am

Hello,
I hope someone has an idea how to optimize my workflow for following requirement.

I get up to 10 XML files in a folder and I have to create pdf files based on Connect Design Templates (per record one file with variable output name). Thats not a problem.
My problem is the size of the input files. Some of these files are bigger than 100 mb (XML file!). So in some cases the workflow does not respond anymore. But even if workflow works untill the end in the task manager I can see that “mysql.exe” have 100% disk usage and it seems.

Is there an optimal way to process these large files?

Thanks,
Thomas

Phil · June 30, 2020, 2:14pm

A single record that uses a 100MB XML file might be problematic. That’s a huge amount of data for one record. Are you extracting every single element of that XML file? If so, that’s why you see the MySQL task spiking at 100%: there must be thousands of fields in there!

Do you actually need all those fields? Can you pre-process the XML to only provide to the DataMapper the fields you actually need to extract? Can the XML file be split?

Also, is your Workflow process set to self-replicate or is it processing all 10 files in sequence?

thomasweber · June 30, 2020, 2:32pm

Sadly, I need all those fields to create tables. XML-Splitting would be fine but I do not know how to do that properly in PlanetPress Workflow. The given plugins do not provide that functionality (e.g. split every 20 records).

My Workflow process is not self-replicating. It processes all files in sequence.

Phil · June 30, 2020, 4:03pm

First: using self-replication might help because each file will be processed independently and the process’ memory will be reclaimed after each file. When processing all files in a single process, memory may only be reclaimed at the very end of the process (i.e., after all 10 files have been processed).

Second: you can use an advanced XSLT splitter to split after 20 occurrences of your target element.

Granted, XSLT is no fun at all … unless you have some experience with it, but it is very powerful nonetheless. For instance, the following XSLT splits a file after every 20 occurrences of the element CUSTOMER have been found:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 

<xsl:variable name="RecordsPerSplit" select="20" />
<xsl:variable name="NodeToSplit" select="//MyFile/CUSTOMER" />

    <xsl:template match="/">
        <xsl:for-each select="$NodeToSplit[position() mod $RecordsPerSplit = 1]">
         <xsl:result-document href="file:///{encode-for-uri('{WATCHTEMPFOLDER}')}{format-number(position(),'000000000')}.xml">

          <MyFile>

              <xsl:for-each select=".|following-sibling::CUSTOMER[not(position() > $RecordsPerSplit -1)]" >
                 <xsl:apply-templates select="."/>
              </xsl:for-each>

          </MyFile>
         </xsl:result-document>
        </xsl:for-each>  
    </xsl:template>
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

It is composed of two loops: the outer loop initiates a new chunk of data when (element_index modulus 20) is equal to 1.

The inner loop then looks at the following 19 siblings of the CUSTOMER node and adds them to the chunk.

Obviously, depending on the structure of your own XML file, you will have to make changes to the select statements in the XSLT code, but perhaps the above example can get you going.

thomasweber · July 1, 2020, 10:26am

Thanks Phil! That is really great. Now I can split the file into multiple smaller files and process each generated file in a separate self-replicating process.

iiliev · July 1, 2020, 11:54am

And if you go down the XSL-Route proposed by Phil, keep in mind that the xslt library supplied by the Workflow (AltovaXML) could be slow.
I have managed to increase the performance 10x by using an external xslt processor (Saxon in my case) and optimising the xml template.
br
-i.

UomoDelGhiaccio · October 18, 2024, 6:07pm

This worked great! I modified it to take two variables for the split node and the split count.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Define the XSLT stylesheet with version 2.0 -->
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <!-- Specify the output method and formatting options -->
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> 
    
    <!-- Define a parameter for the chunk size, defaulting to 50 -->
    <xsl:param name="RecordsPerSplit" select="%{iChunkSize}"/>
    <xsl:param name="NodeToSplit" select="%{sSplitNode}"/>
    
    <!-- Template to match the root node of the XML document -->
    <xsl:template match="/">
        <!-- Loop through each chunk -->
        <xsl:for-each select="$NodeToSplit[position() mod $RecordsPerSplit = 1]">
            <!-- Create a new result document for each chunk -->
            <xsl:result-document href="file:///{encode-for-uri('{WATCHTEMPFOLDER}')}{format-number(position(),'000000000')}.xml">
                <datamining>
                    <!-- Copy the records for the current chunk -->
                    <xsl:for-each select=".|following-sibling::record[not(position() > $RecordsPerSplit - 1)]">
                        <xsl:apply-templates select="."/>
                    </xsl:for-each>
                </datamining>
            </xsl:result-document>
        </xsl:for-each>  
    </xsl:template>
    
    <!-- Template to match and copy all attributes and nodes -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>