E is for Event: A Fresh Take on ETL


As a follow up to my previous post, The Workloads of the Internet of Things, I wanted to walk through a real world example that fully captures the principles of event-driven computing put forth.

Let’s set the stage first… imagine we operate a windmill farm and want to continually track optimal weather conditions to maximize energy output. What basic steps need to be taken?

  1. Sensors capture surrounding weather conditions
  2. Captured data is delivered to a backend service
  3. The service calculates the expected power generation
  4. Calculated data is delivered to an analytics system
  5. Data is presented in a variety of charts and maps

This process flow sounds similar to the common Extract, Transform, Load (ETL) pattern, however the distinction to make here is that data is pushed from the source to the backend service instead of pulled. This means we need to update our pipeline to be more reactive.

Event, Transform, Load

Data sources come in a variety of flavors these days, from mobile devices to industrial machines, forcing modern applications to account for changing environments on the fly through more event-driven patterns. These workflows tend to have a different lifecycle than traditional applications, where triggers automatically kick off independent tasks for on-demand execution as shown below.


Task-centric Transformation

For this example, let’s first look at the Transform phase as that is where our core business logic lies. With IronWorker as the development environment, we can build an independent task to calculate the expected power of a windmill. Using the Forecast API, a 3rd party weather API service, we can get the current temperature, humidity, air pressure, wind speed, and wind bearing at a specific coordinate; required data points to make a more accurate calculation.

All we’re doing with our task code is taking the coordinates as a payload, hitting the Forecast API, performing some basic calculations, and delivering the data to a queue. (To keep things simple, we’re just going to use wind speed for the calculations.)

Docker-based Development

A key distinction we make with a task-centric development environment is that code is packaged and uploaded instead of deployed. Then the task can be triggered to run on-demand via an event – webhook, schedule, push queue, or API call.

For this example, let’s walk through the new Iron.io Docker-based workflow to package the worker. With Docker, you can be sure your task compute environment is consistent from dev to production. While local Docker-based development may be a point of contention these days, we do believe it is the future of continuous delivery, and it makes perfect sense for IronWorker today given that we’re dealing with loosely coupled, stateless tasks.

Step 1: Build Docker image

First, we set our dependencies and runtime commands. We’re only loading a few gems in a Ruby environment, so this will be a very lightweight Docker image.

To test this locally, make sure you’re running Docker and use the following commands to vendor the dependencies and then execute the image in a container. Place a single lat/lon coordinate in a json file as a sample payload.

$ docker run --rm -v "$(pwd)":/worker -w /worker iron/images:ruby-2.1 sh -c 'bundle install --standalone --clean'
$ docker run --rm -v "$(pwd)":/worker -w /worker iron/images:ruby-2.1 sh -c 'ruby windmill.rb -payload windmill.payload.json -id 123'


Step 2: Upload to Iron.io

Once you’ve ran the task locally, you can package and upload to Iron.io.

$ zip -r windmill.zip .
$ iron worker upload --name windmill --zip windmill.zip iron/images:ruby-2.1 ruby windmill.rb


Step 3: Queue a Task

Once uploaded, queue a single task so you can see it running in the Iron.io environment. The –wait parameter waits for the task to execute and prints the log.

$ iron worker queue --payload-file windmill.payload.json --wait windmill

queue task

Respond to Event Triggers

With our worker uploaded to Iron.io and ready for execution on-demand, we can set our triggers. Considering that I don’t actually own and operate a windmill farm, we’re going to fake it using a simple Ruby script that randomly generates a set of coordinates with the Faker gem.

To demonstrate the concurrent processing capabilities of IronWorker, we can trigger n number of windmill tasks at once.

$ ruby trigger.rb 100

This will queue up 100 tasks to run concurrently. Once complete, we can inspect a single run task to see the output. Our fake windmill spun up, ran the task code, and delivered the data to IronMQ.

windmill task log

Scheduled Event

In the real world, we would want to trigger this worker on a set schedule so we can build better time series data charts. IronWorker has a built in job scheduler, so we’ll upload our trigger script as another worker following the same steps as before. Using our dashboard, we’ll set this worker to run every hour.

windmill schedule

Iron.io for Modern ETL

This example demonstrates a few key points to the general nature of ETL workloads, and why they are well suited for a task-centric processing environment such as that provided by Iron.io.

  • Capturing the event, running the code, and delivering the results happens asynchronously
  • The task is loosely coupled, with only the dependencies needed to run the process
  • The task is stateless, with the required data input taken in as a payload
  • The compute environment is ephemeral, only needed for the duration of the process
  • The lifecycle of the pipeline needs to be tracked from source to destination, with failure mechanisms in place for each step along the way