How to sort a csv file to split into several csv files as output?

LisaGrant · March 20, 2017, 11:08am

We have a monthly process where we produce tens of thousands of invoices using csv files, using PlanetPress to generate the letters has been immensely helpful however in order to keep a track of the letters being produced and also to cut down on processing time per batch our print room operative currently manually sorts and splits our initial data.

In other words, we’ll have an initial data set of 40-80 thousand lines of data, and our operative will:

Split the data by invoice type (the type is indicated in the data)
Sort each invoice type by account reference (the reference is indicated in the data)
Split the data again into separate csv’s of no more than 1000 records each.

I understand that PlanetPress can sort the data in this way between the datamapper and the print output, however there are two problems with just trying to process the original 40-80 thousand line csv in one go and letting PlanetPress sort it ‘behind the scenes’:

Firstly, it takes PlanetPress hours to process the data in this way, and while this would be okay if we’re able to process it overnight, PlanetPress often crashes for us if we give it too much data to process at once so we can end up coming in in the morning with no work ready to print.
Secondly, if a job of 1000 gets mangled in the printer (rare, but it happens), although we could make an educated assumption of what 1000 they would be if we tried to mimic PlanetPress’ data sorting, there’s no way for us to guarantee which 1000 records were in that job and which therefore need to be re-processed.

I was wondering therefore if there was a way to use PlanetPress to split the csv before it is processed, so that we can identify each batch of 1000 invoices separately without our operative having to manually sort and split the original data. I vaguely understand that if I set up a process in Workflow for this I need to mark each field as a local variable and turn that into metadata, and at the end of the process I need to create a file, I’ve tried playing with this idea but I’ve not at all been able to get it to do what I want.

If I can get this working it would be ideal if I could get PlanetPress to name each split csv file with a batch number, and also include the batch number in an additional column within the split data to allow us to add it into the print job name for easy reconciliation.

Would someone be able to point me in the right direction to get this idea to work? Assuming that it’s something PlanetPress is able to do?

AlbertsN · March 20, 2017, 12:54pm

Hi Lisa,

Would you be able to post a sample of your data? If it is sensitive, I’d suggest zipping it with a password and sending me the password directly by private message.Alternatively, if you could provide a CSV with the same structure, but fake data, that would also be helpful.

In short, there are a lot of ways that we could do this, but it is going to depend on the structure of the data. It would be helpful if I could see it first.

LisaGrant · March 21, 2017, 8:22am

Hi, Albert, thanks for your reply.

Not sure if there’s a way to put a file into a post but here’s a WeTransfer link to a fake data version of the csv (or if you have an alternative suggestion of how best to post it I’d be open to that). https://wetransfer.com/downloads/7d421b2a663c3730cbe005613817803b20170321121001/20917fdfd1f24ca89fda90f9ba2b6e6b20170321121001/820fe0

There are 45 columns plus an additional 46th column (header ‘Days’) that we just use as a manual check, only the first 45 columns are actually used in the merge (we might add more fields in the future as we have more ideas of things to change on our templates, I assume that at that point I’d need to modify this splitting process as well).

Column A, ‘Tenant demsuite’ indicates the invoice type to be used for that transaction. Our operator splits the data by that column A first and then sorts it by column B (‘Tenant Reference’) before splitting it again into the final 1000 row max csv’s, so this is the process we’d like PlanetPress to replicate.

I should probably also mention that for the invoice merge process we have, we need to delete the header row before processing the csv to go to print, so we would also want the final split csv files to be headerless if possible.

AlbertsN · March 21, 2017, 1:26pm

Ok then. To get the CSV split up as you would like, the easiest way I can find without resorting to scripts is this:

First, open the CSV using the Database Query plugin found under Actions.

So we need to create a DSN so the plugin can ‘talk’ to the CSV file. Click ‘Connect to ODBC data source’ and create a new one with the Microsoft Access Text Driver. During creation you will have to point it to a folder where the CSV file will be.

This generates a connection string as follows:

Next, tell it what to get and how to sort it. You’ll do this in the Edit SQL button below and give it something like this:

SELECT * FROM [ExampleFile.csv]
ORDER BY [Tenant demsuite], [Tenant Reference]

This should sort first by tenant demsuite, and then by tenant Reference.
Under Output File Emulation, select CSV and be sure the Add a Header option is disabled.

This will take in your CSV file, sort it, and output a new CSV into the workflow. Now you can begin splitting it. You may also want to consider setting this up using Dynamic SQL, but this is going to require a bit more work. It would give you better control over what file is being used though, as the above method is always going to point to the same filename in the same folder.

You’ll need to use two Emulated Data Splitters located in the Data Splitters plugins. The first one will split on ‘a word change’. Be sure to select the area where the Tenant Demsuite value will be in the resulting CSV.

This will create one datafile for each group of tenant demsuits

The second Emulated Data Splitter will split on ‘a word’ where the word is InvoiceType. Be sure to select the appropriate area in the datafile so it knows where to look for that string. Tell it to split after the word is found 1000 times.

Now output all of this to a folder.

The above process will run in about 1 minute on the datafile you provide and about 45 minutes to print to PDF.

Now, if we go about this a little differently, we can do this much faster.

First, create a standard datamapper and design template.

Next, create a Job Preset. You will want to set the Job Segment Grouping to COLUMN1 and Document Set Grouping to Column 2. This tells Connect to create groups on these two fields that we can then exploit later on. It also sorts descending.

Next, create an Output Preset and chose Separation in your production options. Be sure to print to PDF her as well, and define your output location (this is a temporary output and should NOT be the same as your final output) In your Separation options, you’re going to tell it to separate on Job Segment. This will create 1 PDF for each group of invoice types.

Next in workflow, you’ll need to add another Folder Capture to grab those new PDFs. This can be in the same process, in line with the Create Output process.

Finally use the PDF Splitter located in Data Splitters. You’ll use Split PDF file on a word, check ‘On region content change’ and finally select the area with the get button. Note that you’ll need to do the select using one of the resulting PDFS from above, so you’ll want to debug once to get those files created.

Finally tell it to split after the condition is found 1000 times.

Output everything to your final output folder and you’re done.

This ran in about 45 minutes for me. So even if it crashes in the middle, You’re not out a lot of time to just re-run the batch.

So the output to PDF comes into play to help resolve some of your other issues. If you were to name them by using the current invoice type and the first tenant reference number in the file, like “InvoiceType1-10023213.PDF” and save those in an archive, you’ll be able to come back and reprint them very easily, because you can just look at the last one that printed correctly, find the next one in the series, and send the PDF back to the printer.

Naturally the workflow would be set to print the PDFs as well. Just be sure that you put a copy (via a branch) in a safe location.

In either case, if you update the layout of the CSV, you should only have to change the datamapper assuming all you do is add columns. Just don’t rename them or the SQL statement in the first example will need to be modified as well. Overall, the second example is probably easier to setup and maintain in the long run.

Note that the resulting times for each process was about the same. However, the Connect datamapper, output creation, etc, are much more efficient when running on a large file, rather than many small ones. If this were to scale up, the second solution should give better performance.

Note that all code examples and parameters above are provided as is and may need to be modified further to work in your environment. This is just a rough guide on how to set this up.

LisaGrant · March 23, 2017, 12:55pm

Thanks for this, Alberts. Although the split PDF process is simpler, we don’t actually output to PDF and instead send the file directly to the printer spooling software, the main benefits of this being that 1. the file is already rendered by the time we’re ready to print so there’s no further waiting time and 2. we have a record of what batches have been printed and at what time, which was a feature that we required for the satisfaction of our clients. But this is part of the issue as it means we can’t just open a PDF file and look at it to see what was in that batch (the spooling software does offer a preview function which is handy for smaller jobs but for the 1000 document jobs the print files are so big it crashes the software).

Regarding the first method, then, I think I can get around the need for Dynamic SQL by just getting a hot folder to send and rename the file as needed. However I seem to be having a problem getting the Database Query action set up. I think I’ve followed all of your instructions but when I’ve finished editing the properties and click on ‘OK’ an error window pops up that says ‘Invalid connection string’ and it doesn’t let me exit or complete the creation of the action. The Location is

DefaultDir=\Emsv1\Test\Split batches\Test;Driver={Microsoft Access Text Driver (*.txt, *.csv)};DriverId=27;Extensions=txt,csv,tab,asc;FIL=text;FILEDSN=\Emsv1\Test\Split batches\Test\Test.dsn;MaxBufferSize=2048;MaxScanRows=25;PageTimeout=5;SafeTransactions=0;Threads=3;UID=admin;UserCommitSync=Yes;

Does it look like the issue might be here somewhere? I wonder if it might be having a problem because I am pointing it to a different server to the one that the Workflow sits on, but all of our hot folders used to trigger Workflow processes and many of our output folders are on Emsv1 and we’ve never had a problem with it reading and outputting to that separate server before.

LisaGrant · March 23, 2017, 12:56pm

Whoops, looks like the Location went off the screen there, let me try again:

DefaultDir=\Emsv1\Test\Split batches\Test;Driver={Microsoft Access Text Driver (*.txt,
*.csv)};DriverId=27;Extensions=txt,csv,tab,asc;FIL=text;FILEDSN=\Emsv1\Test\Split
batches\Test\Test.dsn;MaxBufferSize=2048;MaxScanRows=25;PageTimeout=5;Safe
Transactions=0;Threads=3;UID=admin;UserCommitSync=Yes;

AlbertsN · March 23, 2017, 2:07pm

One thing of note is that there is no DBQ at the start of your connection string, which is what tells the DSN what folder to look in for the CSV. After you create/select your DSN, you should be prompted to choose a directory to look in for the database (csv).

In the Database Connection, under Table/Query, do you see your CSV file listed?

My complete string looks like this (minus the line breaks):

DBQ=C:\SOURCE\20K SPLIT CSV;
DefaultDir=C:\SOURCE\20K SPLIT CSV;
Driver={Microsoft Access Text Driver (*.txt, *.csv)};
DriverId=27;
Extensions=txt,csv,tab,asc;
FIL=text;
FILEDSN=C:\DSN
ewcsvtest.dsn;
MaxBufferSize=2048;
MaxScanRows=25;
PageTimeout=5;
SafeTransactions=0;
Threads=3;
UID=admin;
UserCommitSync=Yes;

You could try a local location as well. I’ve never actually tried pointing to a network path but I don’t see any reason why it wouldn’t work, so long as the user account has access to it. If it doesn’t work on the network, even with a DBQ setting, try saving the DSN to the local drive (the default location is fine) and then when it asks you what folder to look in, also use a local drive (preferably not the same folder as the DSN to avoid confusion down the line). As you say, you can always use workflow to bring the CSV file into that local folder (just run a folder capture on the network location and output to folder in your local folder with whatever name you need).

johndoe · March 24, 2017, 6:29am

If you are allowed to use third partysoftware on the server, there may be a way slightly different than per row processing.

CoreUtils from GNU contain ‘sort’ program, you can get core utils package from http://gnuwin32.sourceforge.net/packages/coreutils.htm

Steps in the workflow could look like:

Capture CSV
Add/Remove text to get rid of header line
Use External Program to run the sort.exe. The parameters should be (-k1,2 -t, “%F” “%F.out”) without (). -k selects column 1 and 2, while -t sets delimiter. The sample file you’ve sent should be done in less than a second.
Folder Capture “%F.out”
Set emulation
Use splitter as already mentioned to split on invoice type
Follow with another splitter to split into 1k chunks
Set a unique job id and store it in a variable
Use either Search and Replace, or Advanced Search and Replace to append the job id to your csv file (eg. target \n and replace with ,%{jobid}\n). I’ll let you work it out
Use rename plugin to rename the file to %{jobid}.csv

From there it’s just a normal job.

LisaGrant · March 27, 2017, 9:27am

Ah, I’d missed the prompt to choose the directory - it splits it perfectly now (and is very quick!) thank you, and I’ve put the Tenant demsuite in as a variable so that the split csv batches can be named with and identified by invoice type as well as a batch number (%i+1 so it doesn’t start at zero). I’m hoping I can just use the name of the csv as the resulting print job but I haven’t tested this yet.

Currently I just have one problem which I could probably work around but would prefer if I could just stop it from happening altogether: In the split csv batches the dates are re-formatted into a dd/mm/yyyy format - the initial data comes out from our system in a dd-MMM-yy format and I would prefer it to stay this way in the split data so I can use the same datamapper for smaller runs that don’t require splitting, or if we want to make copy invoices with the original data. Any way that I can get it to prevent the re-formatting, or re-format the dates back in the end product?

AlbertsN · March 27, 2017, 9:49am

Neat, glad you`ve got it working. Interesting that it changes the date format. I had not noticed it doing that for me, but it certainly is.

It appears to be taking the format defined in the Regional Settings on my server. I don`t recommend modifying that, however, as it may impact other things.

You can format the date directly in the SQL select string so that when the CSV is output, it has the correct format. Documentation can be found here: https://www.w3schools.com/sql/func_date_format.asp

However, probably the easiest would be to simply modify this in Connect. Since you know the date format will be coming out of the split in a given format, you can also modify your datamapping and template to work with the new date format input.

So in the extraction step for each of the dates in the CSV, specify that they are of Type DATE, and then tell it what format to expect in the field below. For example, mine are formatted as: “30-Jan-17” so in my date format I specify “dd-MM-yy” You can find the specifications for that here: PlanetPress Connect 1.6.1 User Guide

Once this is done, Connect knows that those fields are Dates and you can now re-format them however you like by using the date format function. Details on that here: PlanetPress Connect 1.6.1 User Guide

This is probably the most flexible since it allows you change the format however you like in the design template without modifying the base data any further.

PeterMeissner · March 17, 2021, 10:05am

Alternatively, I can recommend Powershell Core. With the following example, a CSV can be sorted over several splats (hierarchies).
$csv_in = $args[0]
$csv_out = $args[1]
$prop1 = @{Expression=‘column3’; Ascending=$true}
$prop2 = @{Expression=‘column3’; Ascending=$true}
Import-Csv -Path $csv_in -Delimiter “;” | Sort-Object $prop1, $prop2 | Export-Csv -Path $csv_out -Delimiter “;” -NoTypeInformation -UseQuotes Never

Please note: Powershell Core is the newer version of Windows Powershell, but has several advantages and is also multi-platform capable!

ws-garcia · March 20, 2021, 5:56pm

I can’t view your fake dataset. Can you share the fake data through other service like Dropbox?