Metadata access and update very slow

Cristian77 · September 25, 2020, 10:27am

Hi,
i have a big problem with metadata update when i have a large spool (500000 records).
When I update the first 20000 I have no problems, the metadata inside the records is updated in 6 seconds, while if I try to update from the record 300000 to 310000, the time for the update increases dramatically. (over 40 minutes)
The data that is updated has the same type as the first 20000.

Phil · September 25, 2020, 10:54am

How are you updating the metadata? Are you using the Metadata Fields Management task or a Script? How was the metadata initially created and what does it contain?

Cristian77 · September 28, 2020, 6:53am

Hi,
I have used a script

var oFSO = new ActiveXObject(“Scripting.FileSystemObject”);
var DirectoryCorrente = Watch.GetVariable(“Path_Source”);
var pathfile = DirectoryCorrente + “\output.mcr”;
var outFile = oFSO.OpenTextFile(pathfile,8, true);
var myMeta = new ActiveXObject(“MetadataLib.MetaFile”);
myMeta.LoadFromFile (Watch.GetMetadataFilename());
var metaJob = myMeta.Job();
var metaGroup = metaJob.Group(0);
var metaDocument = “”;
var numerofacce = “”; numerofogli = “” ; IndirizzoIntestatario = “”; Intestatario = “”; locintera = “”; Nominativo = “”; indirizzo = “”;

for (var i=0; i < metaGroup.Count; i++) {
metaDocument = metaGroup.Item(i);
metaDocument.Fields.Add(‘_vger_fld_PROGRESSIVONUMERICO’, i + 1);

    numerofacce = metaGroup.Document(i).FieldByName("_vger_fld_NUMEROFACCE");
    numerofogli = parseInt(numerofacce) / 2;
    IndirizzoIntestatario =metaGroup.Document(i).FieldByName("_vger_fld_Indirizzo");
    IndirizzoIntestatario = IndirizzoIntestatario.split(",").join(" ");
    Intestatario = metaGroup.Document(i).FieldByName("_vger_fld_Nominativo");
    Intestatario = Intestatario.split(",").join(" ");
    outFile.WriteLine(numerofacce + "," + numerofogli + "," + metaGroup.Document(i).FieldByName("_vger_fld_PROGRESSIVONUMERICO") + "," + Intestatario + "," + IndirizzoIntestatario + "," +metaGroup.Document(i).FieldByName("_vger_fld_CAP") + "," + myMeta.Job().Group(0).Document(i).FieldByName("_vger_fld_Localita") + "," + metaGroup.Document(i).FieldByName("_vger_fld_Provincia") + "," + "Airc.ps" + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " );

    locintera = metaDocument.FieldByName("_vger_fld_Localita");
    Nominativo = metaDocument.FieldByName("_vger_fld_Nominativo");
    indirizzo = metaDocument.FieldByName("_vger_fld_Indirizzo");

    if(locintera.length > 30)
    {
            locintera = locintera.substring(0, 30);
            metaDocument.Fields.Add('_vger_fld_Localita', locintera);
    }
    if(Nominativo.length > 40)
    {
            Nominativo = Nominativo.substring(0, 40);
            metaDocument.Fields.Add('_vger_fld_Nominativo', Nominativo);
    }
    if(indirizzo.length > 40)
    {
            indirizzo = indirizzo.substring(0, 40);
            metaDocument.Fields.Add('_vger_fld_Indirizzo', indirizzo);
    }

}
outFile.Close();
myMeta.SaveToFile (Watch.GetMetadataFilename());

Phil · September 28, 2020, 12:16pm

Your issue is likely caused by the three lines in your script that add new fields to the metadata (Localita, Nominativo, Indirizzo). Each time you add a field, the entire metadata structure has to be checked to ensure the structure of each metadata document is the same in the entire file. So as the file grows, the number of checks that have to be performed increases exponentially.

You should first make sure that all your documents contain those three fields when you initially create the metadata (even if they are empty). This guarantees that all records have the same structure. Then, your script should use the metadata’s add2() method with the afReplace (i.e. flag=0) flag. For instance:

~~metaDocument.Fields.Add2('_vger_fld_Nominativo', Nominativo,0);~~

~~This should make the performance much more consistent, regardless of the number of documents.~~

EDIT: this answer is incorrect, see below.

Cristian77 · September 28, 2020, 4:09pm

Hi,
I did some tests. There have been no improvements, but I noticed that with each Add2 the processing time multiplies. For example, in the same script, one Add2 30 seconds four Add2 2 minutes. The further I move away from record 0 the more the process slows down

Cristian77 · September 29, 2020, 9:39am

Hi,
I’ve done other tests.
I think the problem is not ADD2 and not even outFile.WriteLine, but access to internal metadata.
In the first 50,000 records, processing takes 4 minutes. For records between 100000 and 150000, processing takes 50 minutes.

Cristian77 · September 30, 2020, 8:06am

Hi,
sorry, no idea about solving my problem?
I’m in trouble with production.

Phil · September 30, 2020, 10:50am

Well my last post was wrong (will cross it out after this post so as not to further confuse future readers). The Add() method is indeed faster than Add2() in most cases. So that’s not the issue here.

Beyond the sheer number of records, it looks like the main culprit might be the FieldByName() method which iterates through the entire field collection in order to find the specified field.

There is unfortunately not much you can do to speed up the process if you keep using metadata with so many records. Since the main function of this script is to export data to what looks like a CSV file, it would be more efficient in this case to do it from a DataMapper post-processing script. Alternatively, you could use the Retrieve Items task to fetch all the records as JSON and modify your script to use that instead of metadata.

fortiny · September 30, 2020, 6:21pm

Hello Cristian77,

I studied your script and I think I might be able to help.

Replace your for() loop by an enumerator.
Reuse your metaDocument inside the loop rather than retrieving it from the group every iteration.
Use an in-memory value for progressivoNumerico instead of retrieving it from the metadata.

With these changes, you should see a significant improvement in speed.

For your benefit, here is the middle section that I modified according to my suggestions. Please note that I did the changes in Notepad and did not run it in Workflow, so there might be slight syntax mistakes to fix. But it should be enough to give you an idea of the required changes.

var progressivoNumerico = 0;
for(var docEnum = new Enumerator(metaGroup); !docEnum.atEnd(); docEnum.moveNext()) {
    progressivoNumerico++;
    metaDocument = docEnum.item();
    metaDocument.Fields.Add(’_vger_fld_PROGRESSIVONUMERICO’, progressivoNumerico);

    numerofacce = metaDocument.FieldByName("_vger_fld_NUMEROFACCE");
    numerofogli = parseInt(numerofacce) / 2;
    IndirizzoIntestatario = metaDocument.FieldByName("_vger_fld_Indirizzo");
    IndirizzoIntestatario = IndirizzoIntestatario.split(",").join(" ");
    Intestatario = metaDocument.FieldByName("_vger_fld_Nominativo");
    Intestatario = Intestatario.split(",").join(" ");
    outFile.WriteLine(numerofacce + "," + numerofogli + "," + progressivoNumerico + "," + Intestatario + "," + IndirizzoIntestatario + "," +metaDocument.FieldByName("_vger_fld_CAP") + "," + metaDocument.FieldByName("_vger_fld_Localita") + "," + metaDocument.FieldByName("_vger_fld_Provincia") + "," + "Airc.ps" + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " + "," + " " );

    locintera = metaDocument.FieldByName("_vger_fld_Localita");
    Nominativo = metaDocument.FieldByName("_vger_fld_Nominativo");
    indirizzo = metaDocument.FieldByName("_vger_fld_Indirizzo");

    if(locintera.length > 30)
    {
            locintera = locintera.substring(0, 30);
            metaDocument.Fields.Add('_vger_fld_Localita', locintera);
    }
    if(Nominativo.length > 40)
    {
            Nominativo = Nominativo.substring(0, 40);
            metaDocument.Fields.Add('_vger_fld_Nominativo', Nominativo);
    }
    if(indirizzo.length > 40)
    {
            indirizzo = indirizzo.substring(0, 40);
            metaDocument.Fields.Add('_vger_fld_Indirizzo', indirizzo);
    }

}

Cristian77 · October 1, 2020, 7:34am

Hi,
perfect, the processing time has gone from about 12 hours to 40 seconds. Thank you.

fortiny · October 1, 2020, 6:12pm

For your benefit (and for others who might be reading), here’s why it makes such a difference.

The metadata is optimized for sequential access. A lot of care has been put in order to make traversing the tree sequentially, from start to end, as fast as possible. It is also optimized to add new nodes and fields at the end. The design trade-off is that random access is costly, more so the bigger the collection is. Random access here means accessing nodes as Parent.Item[i].

Once you know that, it’s easier to spot the flaws in the original script.

The loop was done using a for() loop followed by a metaDocument = metaGroup.Item(i); assignment, i.e. a random access action. The optimized way of doing this is to use an Enumerator, which provides iteration using its moveNext() method.
Even though one of the first line inside the loop was the assignment above, the code later in the loop still referred to the current document through metaGroup.Item(i), causing yet 7 other random access operations. The fix for this is to reuse the metaDocument variable instead, thus paying the penalty price only once. Price which, due to fix #1, dropped down close to zero.

Hope this can help you and others.