The E.T. in ETL

By Iron | January 9, 2016 |

Thanks to JD Hancock for the base image! CC BY 2.0

Anyone who’s ever done ETL knows it can get seriously funky. When I first started working on ETL, I was parsing data for a real estate company. Every once in awhile roofing data would appear in the pool field. “Shingles” isn’t a compelling feature for swimming pools. Go figure.

Thankfully, Node.js gives us a lingua franca for sharing cool solutions. A search for data validation shows there are more than a few options. For ETL, let’s take a look at just one of those options.

Table of Contents

Very Valid Validity

I’m a fan of code that’s easy to read. I hope you are too.

I like the data-validate library for data validation since it’s syntax makes it easy to follow. In the npm repo’s words, it’s a “simple chainable data validation library.”

Sounds good to me. Let’s install and give it a whirl:

npm init

npm install --save-dev data-validate

touch index.js

chmod u+x index.js

Field by Field

Email is a great test case for any normalization or validation solution. A basic email check in data-validate looks pretty simple. Here’s what I did in index.js:

Email isn’t that easy, though. There are some some ugly addresses that are technically “valid”:

- $A12345@example.com

- !def!xyz%abc@example.com

_somename@example.com

The point I’m making is the rules we want to employ are likely to get quite complicated. In the above sample, I’m just filtering for basic address validity and the removal of aliases (“pam+alias@”).

Being able to easily read and update rules is pretty important. Specifically if I’m not interested in email addresses with domains like localhost or [IPv6:2001:db8::1].

Staying Regular

Even as complexity grows, I like the approach of chaining together validation rules because they preserve clarity. Sadly, there are some tempting ways to muddle that clarity. At some point, someone on your team will recommend the Lovecraftian nightmare that is regex.

Say you want to validate a field for website URLs. You might Google, and land on a page recommending regex like:

_^(?:(?:https?|ftp)://)(?:S+(?::S*)?@)?(?:(?!10(?:.d{1,3}){3})(?!127(?:.d{1,3}){3})(?!169.254(?:.d{1,3}){2})(?!192.168(?:.d{1,3}){2})(?!172.(?:1[6-9]|2d|3[0-1])(?:.d{1,3}){2})(?:[1-9]d?|1dd|2[01]d|22[0-3])(?:.(?:1?d{1,2}|2[0-4]d|25[0-5])){2}(?:.(?:[1-9]d?|1dd|2[0-4]d|25[0-4]))|(?:(?:[a-zx{00a1}-x{ffff}0-9]+-?)*[a-zx{00a1}-x{ffff}0-9]+)(?:.(?:[a-zx{00a1}-x{ffff}0-9]+-?)*[a-zx{00a1}-x{ffff}0-9]+)*(?:.(?:[a-zx{00a1}-x{ffff}]{2,})))(?::d{2,5})?(?:/[^s]*)?$_iuS

It’s clear as mud. If we must regex, then let’s try to keep the dream alive. How can we make regex maintainable? Break it up.

In other words, take the time to split up the components. If you’re good, the variables that you store each regex chunk as become self-documenting. Yay, readability! Breaking it up means we just have to add an assembly step. This means new RegExp is the last step.

Free Text Fields are Awful

If politicians took a hardline stance on free text fields, they just might get my vote. What do you do when your data is a blob of free flowing writing?

Making sense of the data requires natural language processing, or NLP. Unfortunately, NLP is an incredibly complex topic. I’m a big fan of SaaS-y solutions like Wit.ai as a way to tame the complexity.

Getting going just requires the request library and a quick trip to their docs.

Clarity First

I hope I’ve been clear [sic].

For a good ETL workflow, clarity should be valued highly. If you find it’s difficult to read and update your data scrubbing rules, it might be time consider a new solution.

Posted in Programming Topics and tagged ETL, extract, load, Node, Node.js, transform

About Iron

View all posts by Iron →