There is a ton of use cases for batch processing and every business is probably doing it in some way or another. Unfortunately, many of these approaches take much longer than need be. In a world of ever increasing data, the old way can now hinder the flow of business. Some examples of where you’ll see batch processing used are:
- Image/video processing
- ETL – Extract Transform Load
- Genome processing
- Big data processing
- Billing (create and send bills to customers)
- Report generation
- Notifications – mobile, email, etc.
We’ve all seen something that was created during a batch process (whether we like it or not).
Now, I’m going to show you how to take a process that would typically take a long time to run, and break it down into small pieces so it can be distributed and processed in parallel. Doing so will turn a really long process into a very quick one.
We’ll use Docker to package our worker and Iron.io to do the actual processing. Why Docker? Because, we can package our code and all our dependencies into an image for easy distribution. Why Iron.io? It’s the easiest way to do batch processing. I don’t need to think about servers or deal with distributing my tasks among a bunch of servers.
Alright, so let’s go through how to do our own batch processing.
What is a Worker?
A worker is a small program that does a specific task. From the examples above, a worker could process images, crunch some big data, or generate bills.
There is a good way to define your workers and a bad way
Let’s talk about billing, because there’s nothing more exciting than getting a bill in your inbox right?
One way to do this would be for your worker to query all your users and iterate through all of them, generate the bill for each user and send it via email.
That would work, but depending on how many users you have it could take a long time. Imagine you had 50,000 users and it took 5 seconds to generate and send each bill.
50,000 users * 5 seconds = 2.9 days!
That would take about 3 days to finish! Also, if something goes wrong while generating the bill for user 15,001, then what? How do you know which user it errored out on? How do you know which users bills have already been generated and which ones haven’t? How do you start from that user and just generate the bills that haven’t been generated yet. You’ll have to spend a lot of time to dig in and figure it out. You may even need to create additional tools to help troubleshoot.
A better way
A better way to do this is to break the problem apart. Since there is no state that needs to be shared between each user’s bill, we can create one worker that will generate and send a single user’s bill. Then instead of generating all the bills in one worker, we run a separate work process for every user. Doing it this way means we can distribute those processes and run them in parallel. If we had enough capacity to run all of these processes at once, we could run the entire billing process in 5 seconds! Now you probably don’t have 50,000 machines sitting around to do this all at once, but even 2 machines would cut the time in half. And, what if you had 100 machines? 50,000 * 5 / 100 = 42 minutes. You get the idea.
In addition, you don’t have the problems of figuring out where a process errors out and continuing the process, because you can know exactly which job failed and therefore which user it failed on. Then you can debug and rerun the process for that one user quite easily.
A worker should do one thing and do it well. (Unix Philosophy)
What is a Docker Worker?
A Docker Worker is your worker program embedded in a Docker image.
Why build a worker as a Docker image?
- Docker packages your code and all it’s dependencies
- All you need is Docker installed to run all your workers
- Easy distribution via a Docker registry (Docker Hub)
Writing your worker is no different whether you use Docker or not. Docker just makes it better for deployment and distribution.
Enough talk, let’s get down to business.
Batch Image Processing
For this demo, we’re not going to do billing, since it’s about as exciting as watching paint dry. Instead of billing, we’re going to do image processing since it’s easy to see the results. We are going to follow the same pattern as above: one worker will perform a transformation on one single image, worker A, and another worker will fetch a list of images and create a new task for worker A for every image.
You can find source code for the project here: https://github.com/treeder/image_processing_worker
Create an Image Processing Worker
Let’s start with the input to the worker (the payload), because the code will be based off the input:
As you can see, we pass in an image URL, an array of operations to perform on the image, and some AWS credentials to store the image in S3.
Now let’s write some code that can transform the image. I’m going to use Ruby for this example (because it’s easy) and ImageMagick for the actual image transformations. This is the main part of the code (full code here):
If you step through the code above, you’ll see we get the payload, download the image from the URL, loop through the operations to perform on the image, and upload the transformed image to S3.
Let’s test this worker. Clone this repo and go through the setup part of the README so you can follow along.
1. Vendor dependencies (only need to do this once):
2. Test it
The treeder/ruby-imagemagick image has Ruby and ImageMagick installed.
3. Build Docker image
Replace USERNAME in all the following commands with your Docker Hub username.
4. Test the Docker image
Now that we’ve built a Docker image with our worker in it, let’s test the image.
5. Push it to Docker Hub
Alright, so we have our worker that can process a single image and it’s packaged up and available on Docker Hub.
Now we want to get a big list of images and queue up tasks for our image processing worker to process all the images in parallel. But how?
You could “do it yourself” with a message queue, a bunch of servers and a script running on those servers to check the queue for tasks and execute them. That sounds a lot easier said than done.
You could try to use a container framework like Mesos, but that will take a lot of work to setup and get running.
IronWorker is job processing as a service with full Docker support. We can get this batch process running in a few minutes. So ya, let’s do that.
Run a Batch of Jobs on Iron.io
We’re going to get a list of about ~860 images from the awesome Unsplash.it service and process all of them quickly on Iron.io. First we need to let Iron.io know about our Image Processing Docker Worker:
iron register USERNAME/image_processor
Now we can start queuing up a ton of jobs for our worker!
Here’s a Ruby script to get the image list and queue up tasks for our worker on Iron.io:
Now lets run the script with the same payload so it can get our AWS credentials and use our IRON_TOKEN and IRON_PROJECT_ID environment variables:
That’s it. You can see the jobs running in the Iron.io console and you’ll see a bunch of new images show up in your S3 bucket!
Scheduling your Batch Process
Now to wrap it all up in a nice bow, you can Dockerize the batch.rb script and schedule that to run on some set interval (daily, weekly, monthly). Using Iron.io, scheduling is built in so you would just have to register your batch image (containing batch.rb) and then schedule it with the Iron.io scheduling API.
I’ll leave implementing that part up to you, the reader, but if you did the stuff above, it should be a piece of cake. Now get batching!