Regex to split address block

RalfG · January 8, 2022, 12:18pm

In Datamapper I try to create a regex to split address parts (only Mid-European adresses) into

street
(countrycode) ZIP City
country

but failed

Examples of the possible formats (captured lines already seperated by |)

goal is to identify the first 4-5 digit ZIP code (with the approrpiate countrycode)

record.fields.AdrBlock.match(/(\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim(); //works but is stripping the country code. Prerequisite:
Country may be empty, ZIP may be preceeded by countrycode, ZIP is 4-5 digits, street is always the line before the ZIP/city.

So result should look like
Line 1
Street: Brownstreet 13
Zipcity: 6280 Hochdorf
country: SCHWEIZ

Line 2:
Street: Yellowlane 13
ZipCity: CH-6280 Hochdorf
country: SCHWEIZ

Line 3:
Street: Morninglane 13
Zipcity: CH-6280 Hochdorf
country: null

Line4:
Street:Oststrasse 13
Ziptcity: 99999 Simcity
country: null

I tried hundreds of regex combination but none worked for every of these cases.

appreciate any help,

Ralf.

Phil · January 8, 2022, 1:22pm

I believe the following should work:

.match(/(([A-Za-z]{2}-)?\d{4,5}\s)(.[^|]+)/gi).slice(-1)[0].trim();

The trick is to create a conditional country code (with dash) that may or may not be there. That’s what the ([A-Za-z]{2}-)? part of the RegEx does.

RalfG · January 8, 2022, 8:04pm

Phil, thx!
Changed it to /(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi
to eliminate wrong matching (Line 1) adding a possible space between countrycode and ZIP and checking for a leading „|“, ZIP+City now work (Sometimes I can‘t see the wood for trees).

And: any idea to capture the line before and after that line (e.g. the previous and following text surrounded by „|“)?

Ralf

RalfG · January 10, 2022, 12:03pm

works in regex101 but not in DM:
record.fields.AdrBlock.match(/(?<=\|)(([A-Za-z]{2}[- ])?\d{4,5}\s)(.[^|]+)/gi)[0];

Marten · January 10, 2022, 12:57pm

Hi @RalfG, I assume that the Data Mapper cannot handle the following part of your Regular Expression: “(?<=|)” because without it the Regular Expression seems to work fine.

Phil · January 10, 2022, 2:23pm

The expression (?<=|) triggers the RegEx engine’s lookbehind functionality (i.e. the full regular expression is a match if, and only if, the preceding character is a |). That functionality was added in the ECMASCRIPT 2018 specification, but the DataMapper’s JavaScript engine implements the ECMASCRIPT 2016 spec, so lookbehind is not supported.

But in your case, you don’t have to use lookbehind. You can adjust your RegEx to look for the | character without capturing it:

record.fields.AdrBlock.match(/(?:\|)((?:[A-Za-z]{2}[- ])?\d{4,5}\s(?:.[^|]+))/i)[1]

Notice the /i)[1] options and index at the end of the statement, which instruct the JS engine to retrieve the content of the first capturing group instead of the fully matched expression.

RalfG · January 10, 2022, 5:37pm

didn’t think abount ECMA… and you’re absolutely right, your solution fits best!!!

to improve DM speed (regex shouldn’t run on every extract field) I changed following steps:

added 2 global properties (AdrBlock object, international int)
inserted an action step to catch the whole adress-block into one object
inserted another action step reversing the adressblock array and testing if the (now simplified) regex matches:

sourceRecord.properties.Adressblock=sourceRecord.properties.Adressblock.split("<br />").reverse();
if(/((?:[A-Za-z]{2}[- ])?\d{4,5}\s.+)/i.test(sourceRecord.properties.Adressblock[0]))
{sourceRecord.properties.international=0;}
else {sourceRecord.properties.international=1;}

then in extraction just pulled the array objects adding the international value to the array field:
e.g.Street:
sourceRecord.properties.Adressblock[1+sourceRecord.properties.international];

e.g. City:
sourceRecord.properties.Adressblock[(0+sourceRecord.properties.international)];

and: it works!

thx again!

Ralf.