Split Suburb, State & Postcode using extract script

marrd · August 2, 2018, 12:31am

Hi there, hoping someone can help a novice…

I am extracting the address data from a PDF and splitting it into multi line fields which includes the suburb, state and postcode. I’m then moving/replacing the suburb, state and postcode into separate individual fields (see below script to handle it)

The data is mostly in the format below and the script works well when it is:

1 Macquarie Street
SYDNEY NSW 2000

However, I’m running into “undefined script results” with some records as the data has some anomalies. See below examples. Some are missing states or postcodes, Some are split over 2 lines, Some only have an email address as the address, Some are international addresses:

2 Macquarie Street
SYDNEY 2000
---------------------------------
3 Macquarie Street
SYDNEY
2000
---------------------------------
person@email.com
---------------------------------
2 New York Street
New York NY 40302
USA

Can anyone help with my script as to the best way to split these out so that I don’t get the “undefined script results” when processing?

Script to split suburb, state and postcode:

for (field in record.fields) {
if(record.fields[field].trim().indexOf(" NSW ") > -1){
record.fields.postcode=record.fields[field].trim().match(/\d+$/)[0];
record.fields.locality=record.fields[field].trim().replace(" NSW ", "").replace(record.fields.postcode,"");
record.fields[field] = "";
"NSW";
break;

Phil · August 2, 2018, 12:48am

This would probably be doable if you only had Australian addresses because I would presume there’s a limited number of ways of writing them. But the main problem is with international addresses: there could be so many different formats out there that it’s almost impossible to write a script that could handle every single variation.

Can’t you extract the entire address block as an HTML string and use that on your template? Or do you absolutely need each element of the address to be stored in its own specific field?

marrd · August 2, 2018, 12:52am

Hi Phil, thanks for your response. For the Australian addresses, it is essential to split, however the internationals and email only can remain joined, but still accessible for obvious reasons.

Phil · August 2, 2018, 8:37am

Hi marrd,

I am attaching my very imperfect attempt at matching addresses using the formats you provided above: https://learn.objectiflune.com/qa-blobs/6614794466864697988.ol-datamapper

This configuration first extracts the entire address block and stores it in a field named, unsurprisingly, AddressBlock.

All other fields use the content of the newly extracted AddressBlock field to try and determine what they contain. So the next field in line is Email, which simply (and stupidly, I admit!) searches for an @ character and when found, stores the result in the field.

Then the AU_State field tries to match [space]State_Abbrev[space] for all common Australian States and Territories abbreviations. And finally, the AU_PostalCode field also tries to find the AU_State abbreviation followed by a numerical value. If a match is found, that numerical value is stored in the field.

Whenever there is no match for any of these fields, the value “NONE” is stored in that field.

Note that this process will not correctly identify the following Australian address as a proper one:

1 Macquarie Street
SYDNEY 2000

because there is not AU_State abbreviation in the address and therefore this could be any kind of international address.

Note also that the entire logic makes heavy use of Regular Expressions. So you may have to brush up on this very powerful feature if you are not familiar with it.

Hopefully, this will get you started in the right direction.

marrd · August 2, 2018, 8:55pm

Thanks Phil. I’ll give this a go