Super Easy Batch Processing with Docker and IronWorker
Batch processing means getting a lot of one type of task done all in one go, saving time, effort and money. There are many ways to do this, but businesses are always hunting for more efficient ways to do more without over-complicating their apps or business tools. Today, we’re going to look at how a user-friendly worker and the Docker container structure can help automate batch processing and make it easier than ever.
Table of Contents
Batch Processing: When Will You Use It?
Examples of where you’ll see batch processing used include:
- Image/video processing
- ETL – Extract Transform Load
- Genome processing
- Big data processing
- Billing (creating and sending bills to customers)
- Processing reports
- Dealing with notifications (mobile, email, etc.)
Every business has tasks that they need doing over and over, and it’s just not practical to do these types of tasks manually. That’s where batch processing comes in.
Why Docker for Batch Processing?
Docker allows tasks to be completed in isolation within containers so that they don’t interfere with any other tasks. This allows background tasks to be automated and run in great numbers. A Docker container only contains the code and dependencies for the app or service it’s running when activated, so it’s fast and efficient.
In this article we’re going to take a process that would typically take a long time to run, and break it down into small pieces so it can be distributed and processed in parallel. Doing so turns a really long process into a very quick one.
We’ll use Docker to package our worker and IronWorker to do the actual processing.
What is a Worker?
A worker is a small program that does a specific task. From the examples above, a worker could process images, crunch some big data, or generate bills.
Let’s take the above example of creating bills for customers.
One method is for your worker to query all your users and individually generate the bill for each user and send it via email.
That works, but depending on how many users/customers you have, it could take a long time. Imagine you had 50,000 users and it took 5 seconds to generate and send each bill.
50,000 users x 5 seconds = 2.9 days
So, that’s around 3 days to do the work and the possibility of errors that are almost impossible to track without someone monitoring the process for this long.
Unix Philosophy: Write programs that do one thing and do it well.
A better way to do this is to break the problem apart. Since there is no state that needs to be shared between each user’s bill, we can create one worker that will generate and send a single user’s bill. Then instead of generating all the bills in one worker, we run a separate work process for every user. Doing it this way means we can distribute those processes and run them in parallel.Now you probably don’t have 50,000 machines sitting around to do this all at once, but even 2 machines would cut the time in half. And, what if you had 100 machines? 50,000 x 5 / 100 = 42 minutes. You get the idea.
If a job fails, you know exactly which job it was and therefore which user it failed on. You can debug and rerun the process for that one user quite easily.
What is a Docker Worker?
A Docker Worker is your worker program embedded in a Docker image. Docker images, when run, become a Docker container holding all the code and dependencies purely for the task in hand. This makes them incredibly efficient and ensures no other tasks are affected. It’s possible to run several containers at once, allowing for huge amounts of tasks to be completed simultaneously. Writing your worker is no different whether you use Docker or not. Docker just makes it better for deployment and distribution.
Why build a worker as a Docker image?
- Docker packages your code and all it’s dependencie.
- All you need is Docker installed to run all your workers
- Easy distribution via a Docker registry (Docker Hub)
IronWorker Makes Batch Processing Easy
Learn how to process your background jobs at scale using IronWorker with fanatical support.
Create an Image Processing Worker
It’s time to look at a real-life example of batch processing. As images are so commonly used in modern app development and deployment, we’re going to see how we might use the docker container environment and IronWorker to batch process some images. Let’s start with the input to the worker:
We are going to follow the same pattern as above: one worker will perform a transformation on one single image, worker A, and another worker will fetch a list of images and create a new task for worker A for every image.
You can find the source code for the project here: https://github.com/treeder/image_processing_worker
Let’s start with the input to the worker (the payload) because the code will be based on the input:
As you can see, we pass in an image URL, an array of operations to perform on the image, and some AWS credentials to store the image in S3.
Now let’s write some code that can transform the image. I’m going to use Ruby for this example (because it’s easy) and ImageMagick for the actual image transformations. This is the main part of the code (full code here):
If you step through the code above, you’ll see we get the payload, download the image from the URL, loop through the operations to perform on the image, and upload the transformed image to S3.
Let’s test this worker. Clone this repo and go through the setup part of the README so you can follow along.
1. Vendor dependencies (only need to do this once):
2. Test it
The treeder/ruby-imagemagick image has Ruby and ImageMagick installed.
3. Build Docker image
Replace USERNAME in all the following commands with your Docker Hub username.
4. Test the Docker image
Now that we’ve built a Docker image with our worker in it, let’s test the image.
5. Push it to Docker Hub
Alright, so we have our worker that can process a single image and it’s packaged up and available on Docker Hub.
IronWorker and Batch Processing
Now we want to get a big list of images and queue up tasks for our image processing worker to process all the images in parallel. But how?
DIY
You could “do it yourself” with a message queue, a bunch of servers and a script running on those servers to check the queue for tasks and execute them. That sounds a lot easier said than done.
Frameworks
You could try to use a container framework like Mesos, but that will take a lot of work to setup and get running.
Iron.io's IronWorker
IronWorker is job processing as a service with full Docker support. We can get this batch process running in a few minutes. So ya, let’s do that.
Run a Batch of Jobs on Iron.io
We’re going to get a list of about ~860 images from the awesome Unsplash.it service and process all of them quickly on Iron.io. First we need to let Iron.io know about our Image Processing Docker Worker:
iron register USERNAME/image_processor
Now we can start queuing up a ton of jobs for our worker!
Here’s a Ruby script to get the image list and queue up tasks for our worker on Iron.io:
Now lets run the script with the same payload so it can get our AWS credentials and use our IRON_TOKEN and IRON_PROJECT_ID environment variables:
That’s it. You can see the jobs running in the Iron.io console and you’ll see a bunch of new images show up in your S3 bucket!
Scheduling your Batch Process
Now to wrap it all up in a nice bow, you can Dockerize the batch.rb script and schedule that to run on some set interval (daily, weekly, monthly). Using Iron.io, scheduling is built in so you would just have to register your batch image (containing batch.rb) and then schedule it with the Iron.io scheduling API.
I’ll leave implementing that part up to you, the reader, but if you did the stuff above, it should be a piece of cake. Now get batching!
Deploy Docker Containers Easily
Speak to us to find out how easy process your Docker containers at scale using IronWorker with free handeld support.