Converting HTML emailbody to plain text

Vidar_S · September 23, 2021, 8:36am

I am trying to set up a process where I capture email and then use datamapper to extract text from the email-body. Everyting works if it is a plain text email, but my challenge is that it is a html email and then when i use send to folder and save the email-body as a txt file for datamapper to process, it will contain all the html formatting as plain text. So my challenge is that all the divs and font information etc… is mixed with the text information i need to extract.

Do anyone have an idea for a solution for this?

Phil · September 24, 2021, 11:55am

The Email Input task only retrieves the default format from the server, and that default format is usually HTML when it is available.

Instead, you can use the Secure Email Input task, which retrieves both the Text and the HTML content.

When you look at the Body.html file it generates, you’ll see that the text content of the email is at the very top of the file, with the html content appended after the text. The HTML portion starts with a <html marker.

To filter out the entire HTML content, you can use an Advanced Search & Replace task and specify the following Regular Expression:

<html[\s\S]*?</html>

Here’s what the rest of the settings look like:

richardTabram · March 14, 2022, 7:20pm

Hi there

I have the same requirement but my ‘body.html’ only contains HTML, is there a setting to enable just the text?

Phil · March 14, 2022, 7:58pm

That should be a setting on the server side, not on Workflow’s side.

klaus.soerensen · August 23, 2022, 1:32pm

Hi Phil

Have a customer where I just switched form IMAP input to MS 365 Email input
And correct now the body file is in html format, where it was text before.
Tried your trick with the replace, but it does not do anything for me.
I have attached the html file, hope you have the knolege to crack it, so I can get the text in the body file.010E4VVW8DED3F2.zip (1.2 KB)

Phil · August 23, 2022, 3:46pm

Hmmm… that’s a bit of a challenge…

First off, the file does not contain a text version of the email: it is pure HTML. That means the file must be parsed to remove all HTML tags and attributes so that only the text content remains.

Second, the file is encoded in UTF-16, which Workflow can’t handle natively (and the Translator task cannot be used since it doesn’t know about UTF-16 either). Which means we’ll have to convert it via a script.

Therefore, after you receive the file, you can use the following script to convert it from UTF-16 to UTF-8:

var oFSO   = new ActiveXObject("Scripting.FileSystemObject");
var job    = Watch.GetJobFileName();
var jobNew = job+'.tmp';

var oInFile  = new ActiveXObject("ADODB.Stream");
var oOutFile = new ActiveXObject("ADODB.Stream");

oInFile.Open;
oInFile.Type = 2;
oInFile.Charset = "utf-16le";
oInFile.LoadFromFile(job);
oInFile.Position = 0;

oOutFile.Open;
oOutFile.Type = 2;
oOutFile.Charset = "utf-8";
oOutFile.WriteText(oInFile.ReadText());
oOutFile.Position = 0;
oOutFile.SaveToFile(jobNew, 2);

oOutFile.Close()
oInFile.Close();

oFSO.DeleteFile(job);
oFSO.MoveFile(jobNew,job);

Then, immediately after that script, you can use an Advanced Search and Replace task to remove all HTML tags. The string to search for is:

<[^\<]*>

(make sure to tick the Treat as regular expressions option)

and the string to replace with is

\n

At that stage, you should have a pure text file… that contains a lot of white space.

NOTE: I’m not quite sure how generic that solution is. It’s a bit too convoluted for my tastes, but it does work for that one file you provided. You may run into different issues with other files. So you will have to test extensively before using that logic in a production environment.

In other words: use at your own risks!

klaus.soerensen · August 24, 2022, 5:15am

Will try it later.
Html is what the new MS 365 email plugin get from Office365 for the body part, when the mail i retrieved.
IMAP retrieved text files.
But if I look at the html files, in source code in a browser they are all in UTF-8 format.
010E50C5XSP1E81 - UTF8.zip (328 Bytes)