How to process CSV-Data files

thomaswichern · February 12, 2015, 5:53am

I have a data file named “data.txt” that contains the following lines:

HDR;Albert;Aal;Ammengasse 1;11111;Amelinghausen;1
POS;Product 1;11,11
POS;Product 2;22,22
POS;Product 3;33,33
HDR;Bernd;Brecher;Brunnentalweg 2;22222;Berneburg;2
POS;Product 4;44,44
POS;Product 5;55,55

There are records of type HDR that carry some header information for each document and that are followed by records of type POS containing information about products and their prices.

I created a new DataMapper configuration and selected the fuile “data.txt” (after switching the filter in the file open dialog to “Text file (*.txt, *.dat, *.prn)”.

The field separator is set to “;”.

When I switch to the “Steps”-panel no data is shown; the messages panel contains a long list of entries (saved as a text file it is about 42 KB).

Q1: what is wrong with the file?

Q2: shall I insert the messages as text here or is there a way to upload such files?

Edit 1: after renaming the file to “data.csv” I got a message:

ERROR [12 Feb 2015 12:51:13,916][main] com.objectiflune.datamining.ui.model.RefreshBoundariesJob$2.run(?:?) Error while refreshing document boundaries: Line contains wrong number of columns: 2 1 (DME000048)
com.objectiflune.datamining.DocumentManager$DocumentInputException: java.sql.SQLException: Line contains wrong number of columns: 2 1
at com.objectiflune.datamining.tabular.internal.handlers.TabularDocumentManager.nextBoundaries(Unknown Source)
at com.objectiflune.datamining.DocumentManager.refresh(Unknown Source)
at com.objectiflune.datamining.PluginInput.refresh(Unknown Source)
at com.objectiflune.datamining.ui.model.RefreshBoundariesJob.run(Unknown Source)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)
Caused by: java.sql.SQLException: Line contains wrong number of columns: 2 1
at org.relique.jdbc.csv.CsvReader.getEnvironment(Unknown Source)
at org.relique.jdbc.csv.CsvResultSet.next(Unknown Source)
at com.objectiflune.datamining.tabular.internal.documentdata.DocumentHandler.nextRecord(Unknown Source)
… 5 more

That sounds as if it is required to have the same number of fields in every record.

If so: what is the “correct” way to read a file such like the onw described above?

Evie · February 12, 2015, 10:33am

Hi Thomas,

The problem here is that what you have isn’t a CSV file. CSVs have the same number of columns on each line. This is why you get “Line contains wrong number of columns.”, because, well, it does.

There are two ways you could go about this, and both require scripting.

First, you could use a pre-processor script which would “append” the first line of a record to each of the other lines so that you have the same number of columns, something like:

POS;Product 1;11,11;HDR;Albert;Aal;Ammengasse 1;11111;Amelinghausen;1
POS;Product 2;22,22;HDR;Albert;Aal;Ammengasse 1;11111;Amelinghausen;1
POS;Product 3;33,33;HDR;Albert;Aal;Ammengasse 1;11111;Amelinghausen;1
POS;Product 4;44,44;HDR;Bernd;Brecher;Brunnentalweg 2;22222;Berneburg;2
POS;Product 5;55,55;HDR;Bernd;Brecher;Brunnentalweg 2;22222;Berneburg;2

Preprocessor scripts run once for the whole data file so it’s generally used to “stabilize” or “massage” data into submission.

Here’s a preprocessor script you can use on this data file in particular. To add it, click the Preprocessor Step in the Steps pane, click Add in the Preprocessor section, then copy the script contents in the Preprocessor Definition expression below. Once you’ve added the script, you need to click the checkmark to the left of Preprocessor1 in the Preprocessor section, and then on the right, click the green checkmark which “applies” the preprocessors.

var reader = openTextReader(data.filename);
var tmpFile = createTmpFile();
var writer = openTextWriter(tmpFile.getPath());

var header_line = “”;

var line = “”;
while((line = reader.readLine())!= null){
if(line.split(“;”).length > 4) {
header_line = line;
} else {
writer.write(line + “;” + header_line);
writer.newLine();
}
}
writer.close();
reader.close();

deleteFile(data.filename);
tmpFile.move(data.filename);

The other solution would be to read the file as a Text data source instead of a CSV. That, however, requires both a script to define the boundaries (because you still need to check for something like “the number of ; on the line”, as well as scripts to extract each fields on the first line and then loop through the other lines (3 in your example), extracting each field, again with scripts. I’d rather not get into this kind of example on the Q&A, this answer is long enough as it is

Evie · February 12, 2015, 10:47am

Oh, and before I forget, I did save the configuration so you can open it directly:
http://help.objectiflune.com/files/connect-samples/Sample-Preprocessor-CSV-With-Different-Column-Count.OL-datamapper

thomaswichern · February 13, 2015, 4:12am

Hi Evie,
thanks for the file and the explantions.
Unfortunately, customers who are going to test Planet Press Connect still need to know how to read a file similar to the example.
As there is no documenation available about the JavaScript API, we’d be glad for a small example that deals with different record-types of different number of fields.
No matter if one needs to read them as text-files.
E.g. how does the logic look like and how are fields stored in the “records-tables”?
If you could provide just an example file that would be ok - so there is no need for longer explanations here in the forum.

Evie · February 16, 2015, 10:35am

I’m not sure what you mean by storing fields in the record table. Once the preprocessor has run, you have a perfectly useable CSV file in the DataMapper and simple drag & drop is necessary to extract the data.

If you mean my “second” example, I mention using text-based data mapping and scripts to extract each field of the data but that’s just not the right way to do it - so much more complex than it needs to be. Doing it the way I provided will save a lot of grief and a lot of time.

But this is with the starting point that this is not a CSV file. It’s a nonstandard format that requires nontrivial processing, and each such type of file needs to be analysed on a case-by-case basis, there’s no one-size-fits-all solution, here.

I’d rather spend time starting to write the API documentation rather than making this sort of example.

Luca_Anghera · March 3, 2016, 3:22am

Hi,

i solved opening the file with Excel and saving as CSV (Excel insert the colums )