Regex to split address block

In Datamapper I try to create a regex to split address parts (only Mid-European adresses) into

  • street
  • (countrycode) ZIP City
  • country

but failed :frowning:

Examples of the possible formats (captured lines already seperated by |)

Sir|Alfred Testman |c/o Watermelon 2245 Inc|Brownstreet 13|6280 Hochdorf|SCHWEIZ
Alfred Testman |Testman 2245|Yellowlane 13|CH-6280 Hochdorf|SCHWEIZ
Boris Checkman |c/o Bavarian 5555|Morninglane 13|CH-6280 Hochdorf
Peter Pan|Oststrasse 13|99999 Simcity

goal is to identify the first 4-5 digit ZIP code (with the approrpiate countrycode)

record.fields.AdrBlock.match(/(\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim(); //works but is stripping the country code. Prerequisite:
Country may be empty, ZIP may be preceeded by countrycode, ZIP is 4-5 digits, street is always the line before the ZIP/city.

So result should look like
Line 1
Street: Brownstreet 13
Zipcity: 6280 Hochdorf
country: SCHWEIZ

Line 2:
Street: Yellowlane 13
ZipCity: CH-6280 Hochdorf
country: SCHWEIZ

Line 3:
Street: Morninglane 13
Zipcity: CH-6280 Hochdorf
country: null

Line4:
Street:Oststrasse 13
Ziptcity: 99999 Simcity
country: null

I tried hundreds of regex combination but none worked for every of these cases.

appreciate any help,

Ralf.

I believe the following should work:

.match(/(([A-Za-z]{2}-)?\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim();

The trick is to create a conditional country code (with dash) that may or may not be there. Thatā€™s what the ([A-Za-z]{2}-)? part of the RegEx does.

Phil, thx!
Changed it to /(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi
to eliminate wrong matching (Line 1) adding a possible space between countrycode and ZIP and checking for a leading ā€ž|ā€œ, ZIP+City now work (Sometimes I canā€˜t see the wood for trees).

And: any idea to capture the line before and after that line (e.g. the previous and following text surrounded by ā€ž|ā€œ)?

Ralf

works in regex101 but not in DM:
record.fields.AdrBlock.match(/(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi)[0];

Hi @RalfG, I assume that the Data Mapper cannot handle the following part of your Regular Expression: ā€œ(?<=|)ā€ because without it the Regular Expression seems to work fine.

The expression (?<=|) triggers the RegEx engineā€™s lookbehind functionality (i.e. the full regular expression is a match if, and only if, the preceding character is a |). That functionality was added in the ECMASCRIPT 2018 specification, but the DataMapperā€™s JavaScript engine implements the ECMASCRIPT 2016 spec, so lookbehind is not supported.

But in your case, you donā€™t have to use lookbehind. You can adjust your RegEx to look for the | character without capturing it:

record.fields.AdrBlock.match(/(?:\|)((?:[A-Za-z]{2}[- ])?\d{4,5}\s(?:.[^|]+))/i)[1]

Notice the /i)[1] options and index at the end of the statement, which instruct the JS engine to retrieve the content of the first capturing group instead of the fully matched expression.

didnā€™t think abount ECMAā€¦ and youā€™re absolutely right, your solution fits best!!!

to improve DM speed (regex shouldnā€™t run on every extract field) I changed following steps:

  • added 2 global properties (AdrBlock object, international int)
  • inserted an action step to catch the whole adress-block into one object
  • inserted another action step reversing the adressblock array and testing if the (now simplified) regex matches:

sourceRecord.properties.Adressblock=sourceRecord.properties.Adressblock.split("<br />").reverse();
if(/((?:[A-Za-z]{2}[- ])?\d{4,5}\s.+)/i.test(sourceRecord.properties.Adressblock[0]))
{sourceRecord.properties.international=0;}
else {sourceRecord.properties.international=1;}

then in extraction just pulled the array objects adding the international value to the array field:
e.g.Street:
sourceRecord.properties.Adressblock[1+sourceRecord.properties.international];

e.g. City:
sourceRecord.properties.Adressblock[(0+sourceRecord.properties.international)];

and: it works!

thx again!

Ralf.