Blog

Search This Blog

Loading...

Wednesday, August 22, 2012

Web Crawling at Scale with Nokogiri and IronWorker (Part 1)

This is a two-part post focusing on using Nokogiri and IronWorker to crawl websites and process web pages. Part two can be seen here. For other resources on web crawling, see our solutions page as well as this in-depth article on using IronWorker with PhantomJS.

Web crawling is at the core of many web businesses. Whether you call it web crawling, web or screen scraping, or mining the web, it involves going to sites on the web, grabbing pages, and parsing them to pull out links, images, text, prices, titles, email addresses, and numerous other page attributes.

But extracting items from pages is just

Tuesday, August 14, 2012

Elastic, Highly Available Backend for Delayed Job: IronMQ

Want an easy, scalable, highly available, zero maintenance Delayed Job backend?  Look no further. Here's a Delayed Job (DJ) backend gem that uses IronMQ. You can set it up in minutes and never have to worry about it again:

Delayed::Job IronMQ

Getting Started

Get Your Iron.io Credentials

Sign up for an Iron.io account if you don't already have one and add your credentials to an iron.json file or environment variables. You can read more about that here.

Add gem to Gemfile:


    gem 'delayed_job_ironmq'

That's It, Now Use It as Your Normally Would

I did say it was easy right?  Here's a simple example of how to use it:

Start worker process:
rake jobs:work

In your model:
class User
  def background_stuff
    puts "I run in the background"
  end
end

Now to run the background_stuff method in the background:
user = User.new
user.delay.background_stuff


More Information

For more details and configuration options, check out the gem README. And here's a demo Rails application that uses it so you can clone it and try it out.

Monday, August 13, 2012

Iron.io @ Box Hackathon (#redefiningwork)


The Iron.io community team just got back from the #RedefiningWork | Box Hackathon this weekend. It was a fantastic event with over 40+ teams spread across a large room and featured a rich set of API sponsors included SendGrid, Twilio, Parse, TokBox, Firebase, Mashery, and Iron.io.

Most the teams used the Box APIs to show off the versatility of having cloud-based document storage in conjunction with a rich set of hosted services. Pairing a rich platform like Box with cloud services in a hackathon makes sense, because nobody wants to stand up servers and manage infrastructure; everyone wants to just get to the coding. (It makes even more sense for apps in the real world, but we talk about that in other places.)
Box Hackathon - Presentation of Top 15

Sunday, August 12, 2012

Iron.io @ eCommerce Hack Day

Last weekend the DevEx team at Iron.io was at the eCommerce Hack Day in New York City. It was hosted by Dwolla with help from Etsy and features lots of great API sponsors and a terrific set of applications. Here's an article from @betabeat that gives you a glimpse of the energy at the event.

The top hacks that used Iron.io were Post My Trip and FriendlyFeast. Post My Trip is an app that generates a postcard of images by first pulling photos from nearby foursquare locations, using IronWorker to put together a good-looking collage, and then sending the postcard using Sincerely. They took home the other Iron.io prize: a GoPro Hero2 camera along with some service credits.

Post My Trip: Postcards of Foursquare Images

Monday, August 6, 2012

IronWorker's Most Requested Feature is Here: Max Concurrency

Too many lemmings and you've got a problem.
Being able to limit how many worker tasks can run at once has been one of the most requested features of IronWorker and it is available now! There are many reasons this feature is so important, here are a few of the more common use cases for it:
  1. Limit load on your database. If you know your database can handle X tasks at a time, set max_concurrency to X and then queue up as many tasks as you want. 
  2. Limit requests on third party API's to stay under rate limits such as the Facebook API and Twitter API. The Twitter API for instance, has a 150 requests per hour limit.
  3. Limit requests on your own API's. This is essentially the same as #1. 
  4. Limit requests on third party sites if you are web crawling or scraping so you don't get blacklisted. 
  5. Eliminate issues related to spiky behavior. For instance, let's say you got TechCrunched and your system wasn't ready for it, your app/site would go down, you're customers would be upset and you wouldn't be able to take orders or do whatever it is your site does. If you had offloaded your work to IronWorker to begin with and set a max_concurrency on the tasks, then when you got a spike from being TechCrunched or perhaps the Colbert bump, you would be perfectly fine. Smooth sailing. 
As you can see, it opens the door to a lot of new use cases.

How to Use max_concurrency


If you're using the IronWorker CLI, it's very easy, simply add the --max-concurrency flag when uploading your worker code. So if your worker is called max_worker and you had a .worker file called max_worker.worker, you would upload like this:

    > iron_worker upload max_worker --max-concurrency 100

Now just queue up as many tasks as you want for max_worker and you can rest easy knowing that a maximum of 100 of them will be running at any point in time. 

This worker has max_concurrency of 250, you can see that only 250 of 1000 tasks are running.

Thursday, August 2, 2012

IronMQ + EngineYard


We're pleased to announce IronMQ is now a preferred add-on on the Engine Yard platform.

IronMQ joins IronWorker on Engine Yard to let developers do even bigger things with their applications by providing industrial-strength ways to scale out processing and pass messages between independent processes.

Engine Yard is one of the leading cloud platforms and largely defined the Platform as a Service category. Engine Yard provides a complete and commercial grade solution that gives developers the freedom to focus on building great applications rather than managing an intricate platform.

IronMQ and IronWorker are two hosted and elastic services from Iron.io that provide simple ways to supercharge applications by adding massive distributed processing and asynchronous message- and event-handling.

IronMQ HUD - Queues