XML file size limit?

UomoDelGhiaccio · August 16, 2021, 3:51pm

Is there a limit to the size of XML that the Workflow can process?
Is there a limit to the size of XML that the data mapper can process?

Phil · August 16, 2021, 5:03pm

In Workflow, the theoretical size limit is 2GB, but in reality with XML, it’s much lower because the file needs to be mapped in memory and that can take as much as 10 times the size of the actual file. So I wouldn’t recommend trying anything over 150/200MB.

For the datamapper, the theoretical limit is much higher (it’s a 64 bit application) so it’s not the overall size of the file that can be a concern but rather the size of each individual record inside that file. If each single record contains 10’s of thousands of fields, the DataMapper’s GUI will become quite sluggish. However, when it runs as a service, it should be able to handle these sorts of records.

UomoDelGhiaccio · August 16, 2021, 5:11pm

So I wouldn’t recommend trying anything over 150/200GB.

Should this be 150/200MB?

With this file in 2021.1, I get an out of memory error when adding the sample data as XML

I need to convert the XML into a pipe delimited CSV file. There is a detail section that contains values where the number of values change based on another field (such as document type). The objective is to separate the resulting pipe delimited lines into separate files based on the field (such as document type)

<Batch>
    <Document>
        <Field1>DocType3</Field1>
        <Field2>INV12345</Field2>
        <Field3>Some Where A</Field3>
        <Field4>Some State A</Field4>
        <Field5>Some Zip A</Field5>
        <Details>
            <Item>
                <SubField1>Item 1</SubField1>
                <SubField2>Value 1</SubField2>
            </Item>
            <Item>
                <SubField1>Item 2</SubField1>
                <SubField2>Value 2</SubField2>
            </Item>
            <Item>
                <SubField1>Item 3</SubField1>
                <SubField2>Value 3</SubField2>
            </Item>
        </Details>
    </Document>
    <Document>
        <Field1>DocType5</Field1>
        <Field2>INV67890</Field2>
        <Field3>Some Where B</Field3>
        <Field4>Some State B</Field4>
        <Field5>Some Zip B</Field5>
        <Details>
            <Item>
                <SubField1>Item 1</SubField1>
                <SubField2>Value 2</SubField2>
            </Item>
            <Item>
                <SubField1>Item 2</SubField1>
                <SubField2>Value 2</SubField2>
            </Item>
            <Item>
                <SubField1>Item 3</SubField1>
                <SubField2>Value 3</SubField2>
            </Item>
            <Item>
                <SubField1>Item 4</SubField1>
                <SubField2>Value 4</SubField2>
            </Item>
            <Item>
                <SubField1>Item 5</SubField1>
                <SubField2>Value 5</SubField2>
            </Item>
        </Details>
    </Document>
</Batch>

Output two pipe delineated files

Phil · August 16, 2021, 7:13pm

You’re right, I meant MB, not GB. I edited my original post to avoid confusing everyone.

As for your project, it seems pretty straightforward. You set the boundaries to the Document element and extract all you need from each document.

Then you use a PostProcessor script to go through all the records you’ve extracted and write them to your pipe-delimited file, as already explained in this post, mong others.

UomoDelGhiaccio · April 26, 2022, 2:25pm

Phil

I now have a situation where I have massive XML files. My sample is 3.29 GB. The XML contains a ton of information that I don’t need for my output. The notion would be to use a datamapper to extract only the necessary information to produce a much smaller XML file.

I have failed in my attempts to have them provide smaller XML files. Looking for ideas to split the XML into smaller chunks so we can process them.

jchamel · April 26, 2022, 3:16pm

Are you able to load up the 3.29gb file in the Datamapper of the Design tool?

If so, then here is what I suggest to bypass the Workflow 32 bit 2gb limitation:

Using a text editor, make a very small version of the file.
In your Workflow, always use that small version as the input file.
In your Datamapper:
- Add a runtime parameter named file which will hold the actual file you intend to use (the 3.9 gb or else)
- Add a Preprocessor script that hold this code: copyFile(automation.parameters.file,data.filename);

This way, you will trigger the Datamapper with the small file version but always pass the real big file path and name as a parameter, therefore bypassing the Workflow 32 bit limitation.

Let us know if that helps.

jchamel · April 26, 2022, 3:32pm

And if that works for you, why not output a JSON file from your Datamapper. I find it much easier to work with especially if you have to play with it in script further down your process.

UomoDelGhiaccio · April 26, 2022, 3:51pm

Yes, I’m able to load the 3.29gb XML file into the datamapper. Not able to select it as sample data in the workflow though. I will give it a go and let you know.

There are over 2000 nodes for each record, but I only need under 100 of them. The first data mapper is to just extract the 100 nodes. The though was to output the XML file with the 100 nodes per record. Then we will need to sort and split the XML and route the XML pieces to one of several different PPC servers based on location node. Then we will use the XML piece in another datamapper and template.

jchamel · April 26, 2022, 4:01pm

Then in this case, you could use the Datamapper as explained in my previous post and add another script in the postprocessor that will output 1 file per record into a temp folder. Those file can then be capture by a Workflow process.

UomoDelGhiaccio · April 26, 2022, 4:28pm

Would need to split based on the location node. So there would be around 5000+ records in each file under 100 files. Would need to name the files with the location value so we can route to the proper secondary PPC Server. Or the folder could be a shared location and each server could pickup using a mask that includes the location value.

jchamel · April 26, 2022, 5:49pm

Same approach, different splitting condition.