Obtaining data from HTML file

Hi There

I have an html document which I need to extract name and address details from. I was just wondering if my best way of doing this is using a data mapper or using workflow.

Thanks

Well, it sort of depends.

If this is an active web page, such as a form that users will input data into, you can intercept it with Workflow by using the HTTP Server Input plugin.

If this is simply a HTML document generated by some other application or process, you will want to use the Data Mapper and treat it as any other text file. Doing it this way, you’ll just want to keep out for a few Gotchas:

  1. Be sure the data is always located on the same lines in the HTML code. If your output is generated automatically, this probably won’t be an issue. If it is though, remember that you can always do full text searches for the tags, classes, or IDs listed in your HTML. You’ll take a hit to processing speed by doing this, so be sure there is no other way first.
  2. Since you’ll probably be dealing with variable length data surrounded by HTML tags, you will most likely have to grab the entire string, tags and all, then use some regular expressions or scripting to trim off the tags.