Blog

Search This Blog

Loading...

Tuesday, June 14, 2011

Post Twitter Updates to HipChat using SimpleWorker - No Server Required

We use HipChat for internal communications at SimpleWorker and one of the great things about it is that it has a nice API allowing you to do some cool things like post messages to a chat room programmatically. One thing I wanted to be notified about was when someone mentioned SimpleWorker on Twitter and what better way to do that then to schedule a worker to check Twitter every day (or hour) and post results to our HipChat room.

So 20 minutes later, I was done and running in the cloud. The beauty of this is how easy it was to write the worker to do what I wanted and then schedule it to run every day using SimpleWorker's built in scheduling.



Then to schedule it to run every day, I simply call:

tw.schedule(:start_at=>1.days.from_now.change(:hour=>3), :run_every=>24*60*60) # to have it recur every day

You can get the full source code here. Just clone it, change the config.yml and run enqueue_worker.rb to run it for yourself.

So simple. So elegant. No servers required. And did I mention how simple it is?

Monday, June 13, 2011

How to Build a Serverless Search Engine with SimpleWorker and IndexTank

Our friends over at IndexTank recently posted an article on how to create a search page using IndexTank and SimpleWorker. SimpleWorker to gather data from a remote service, in this case Plixi, and IndexTank to index that data for searching. SimpleWorker is great for collecting and keeping data up to date because you can schedule a worker to run every X minutes to retrieve data and update your IndexTank index.

But even more interesting in this example is the fact that the search page is static html and runs without an application or server! You can create a search page with live up to date data without running a server anywhere. You could put this page on a cheap GoDaddy hosting account, Amazon S3, a file on your Intranet, or even on your local file system. Interesting stuff.


Try it out here


This really got me thinking about what other things you could build without an application server and hopefully it will get your creative juices flowing too. Feel free to share any ideas in the comments below, we'd love to hear them.

Wednesday, June 8, 2011

Pattern: Creating Task-Level Workers at Runtime

A type of topic appearing more and more frequently in StackOverflow and Quora are questions on general architecture and app design. We came across one the other day on approaches to queuing and scheduling workers.


A good worker pattern – based on what we've seen and done – is to chunk the work (and create task-level workers) at runtime and not prior. In other words, use a master/slave setup that is event or time-driven and do the majority of the work using concurrent processes. When the master comes off queue, it can slice up the work/data space into granular tasks/chunks as well as queue up slave worker jobs to each handle a collection of discreet tasks and data. (If the work or data space is especially large or complicated, then the slicing can be distributed across a set of master tasks)

The reason for waiting until runtime to create the slave workers is that viewing jobs in the schedule is much easier when done at a coarse grain level. Meaning that the jobs scheduled correspond to the units that you're tracking (such as webpages, user profiles, or blocks of data from users, sensors or other streaming input devices). At this level, you can more easily monitor and inspect the collection of scheduled jobs because they correspond to your key metrics and inputs. Notifications have a better signal-to-noise ratio and status indicators are much more meaningful.

Fine-Grained Data | Course-Grained Workers

The slicing of the work often gets done on a fine-grained level, meaning that atomic pieces of data are created or made available for your task-level routines to do to their work. (Such as check to see if a Klout score needs to be updated or adding explicit likes or clickstream data to a user's existing preference information).

Instead of creating a worker for each data element, however, it's better to have each worker work on a reasonable collection of tasks or data items. We've found that having each worker process multiple tasks (20-1000 data items for example) provides a good balance between:
  • optimizing worker setup (establishing a database connection for example)
  • providing good introspection into the jobs
  • making retries and exception handling more manageable
The number per worker will depend on the type/length of task. The idea is to have workers execute in minutes as opposed to seconds or hours. The reason is so you have greater visibility into worker performance and so that retries only affect a limited amount of the work space.

Making use of S3 to hold the large data blocks and then using a NoSQL solution (esp. database-as-a-service ones like MongoHQ or MongoLabs) to hold the data slices can make it easy keep track and manage the data slicing and task-level work.

Worker Independence

A key part in creating any worker process set is to do so in a way that makes it independent of your application environment. This means writing each worker so it can run in an independent app environment as well as using callbacks, database flags, and other asynchronous approaches to communicate asynchronously between the application and the workers.

Doing it this way gives you much greater agility – meaning workers can be modified or new ones created without worrying about application dependencies. This approach also allows the work to run asynchronously and be distributed over an elastic worker systems, which is really where you need to go if you have even a modest amount of work.

Just as applications are being written to run on elastic (and disposable) infrastructure, workers also need to be written to run in elastic environments, ones that will increasingly be separate from your application environment.

Update: We're seeing the need for this pattern more and more. Which should explain the anti-pattern (and corresponding blog post) that we came up with: