A good worker pattern – based on what we've seen and done – is to chunk the work (and create task-level workers) at runtime and not prior. In other words, use a master/slave setup that is event or time-driven and do the majority of the work using concurrent processes. When the master comes off queue, it can slice up the work/data space into granular tasks/chunks as well as queue up slave worker jobs to each handle a collection of discreet tasks and data. (If the work or data space is especially large or complicated, then the slicing can be distributed across a set of master tasks)
The reason for waiting until runtime to create the slave workers is that viewing jobs in the schedule is much easier when done at a coarse grain level. Meaning that the jobs scheduled correspond to the units that you're tracking (such as webpages, user profiles, or blocks of data from users, sensors or other streaming input devices). At this level, you can more easily monitor and inspect the collection of scheduled jobs because they correspond to your key metrics and inputs. Notifications have a better signal-to-noise ratio and status indicators are much more meaningful.
Fine-Grained Data | Course-Grained Workers
The slicing of the work often gets done on a fine-grained level, meaning that atomic pieces of data are created or made available for your task-level routines to do to their work. (Such as check to see if a Klout score needs to be updated or adding explicit likes or clickstream data to a user's existing preference information).
Instead of creating a worker for each data element, however, it's better to have each worker work on a reasonable collection of tasks or data items. We've found that having each worker process multiple tasks (20-1000 data items for example) provides a good balance between:
- optimizing worker setup (establishing a database connection for example)
- providing good introspection into the jobs
- making retries and exception handling more manageable
The number per worker will depend on the type/length of task. The idea is to have workers execute in minutes as opposed to seconds or hours. The reason is so you have greater visibility into worker performance and so that retries only affect a limited amount of the work space.
Making use of S3 to hold the large data blocks and then using a NoSQL solution (esp. database-as-a-service ones like MongoHQ or MongoLabs) to hold the data slices can make it easy keep track and manage the data slicing and task-level work.
A key part in creating any worker process set is to do so in a way that makes it independent of your application environment. This means writing each worker so it can run in an independent app environment as well as using callbacks, database flags, and other asynchronous approaches to communicate asynchronously between the application and the workers.
Doing it this way gives you much greater agility – meaning workers can be modified or new ones created without worrying about application dependencies. This approach also allows the work to run asynchronously and be distributed over an elastic worker systems, which is really where you need to go if you have even a modest amount of work.
Just as applications are being written to run on elastic (and disposable) infrastructure, workers also need to be written to run in elastic environments, ones that will increasingly be separate from your application environment.
Update: We're seeing the need for this pattern more and more. Which should explain the anti-pattern (and corresponding blog post) that we came up with: