Batch Processing: A Tutorial on Workers, Queueing and Gelato

Worker_Queue_Gelato_V2

Batch processing is one of the earliest ways of data processing, utilized by Herman Hollerith’s Tabulating Machine in 1890. Batch processing was developed to take advantage of scarce computing resources: it avoids idling these expensive resources by queueing instructions to process data without manual user intervention, and can shift workload to times when resources are less scarce1.

Today, we can leverage modern architectural patterns like worker systems, message queues and the cloud to level-up these advantages and simplify our code. Let’s look at an example of queueing and workers using a calorie-dense metaphor: gelato.

gelato2

Using our favorite local gelato shop as an example, we explore how architectural concepts like queueing and workers can affect a given task. We chose gelato because:

  • Each order takes time to set up. You must examine the menu and display, choose and order.
  • Each order takes time to process. The time spent preparing an order can vary based on the order size and complexity, just as job size can vary in a worker system.
  • Adding additional workers helps. The queue of customers will be processed faster, in the same way that adding more workers to a particular batch processing job will make the queue shrink faster.
  • Like your infrastructure, gelato must be kept cold, and is delicious when consumed2.

Check it out:

1 https://en.wikipedia.org/wiki/Batch_processing

2This is not true. Iron.io assumes no responsibility if you eat your infrastructure.

Docker + Iron.io = Super Easy Batch Processing

lotsocontainers

There is a ton of use cases for batch processing and every business is probably doing it in some way or another. Unfortunately, many of these approaches take much longer than need be. In a world of ever increasing data, the old way can now hinder the flow of business. Some examples of where you’ll see batch processing used are:

  • Image/video processing
  • ETL – Extract Transform Load
  • Genome processing
  • Big data processing
  • Billing (create and send bills to customers)
  • Report generation
  • Notifications – mobile, email, etc.

We’ve all seen something that was created during a batch process (whether we like it or not).

Now, I’m going to show you how to take a process that would typically take a long time to run, and break it down into small pieces so it can be distributed and processed in parallel. Doing so will turn a really long process into a very quick one.

We’ll use Docker to package our worker and Iron.io to do the actual processing. Why Docker? Because, we can package our code and all our dependencies into an image for easy distribution. Why Iron.io?  It’s the easiest way to do batch processing. I don’t need to think about servers or deal with distributing my tasks among a bunch of servers.

Alright, so let’s go through how to do our own batch processing.

Continue reading “Docker + Iron.io = Super Easy Batch Processing”