This is a two-part post focusing on using Nokogiri and IronWorker to crawl websites and process web pages. Part two can be seen here.
For other resources on web crawling, see our solutions page as well as this in-depth article on using IronWorker with PhantomJS.
Web crawling is at the core of many web businesses. Whether you call it web crawling, web or screen scraping, or mining the web, it involves going to sites on the web, grabbing pages, and parsing them to pull out links, images, text, prices, titles, email addresses, and numerous other page attributes.
But extracting items from pages is just
Wednesday, August 22, 2012
Tuesday, August 14, 2012
Elastic, Highly Available Backend for Delayed Job: IronMQ
Want an easy, scalable, highly available, zero maintenance Delayed Job backend? Look no further. Here's a Delayed Job (DJ) backend gem that uses IronMQ. You can set it up in minutes and never have to worry about it again:
Delayed::Job IronMQ
gem 'delayed_job_ironmq'
Start worker process:
In your model:
Now to run the background_stuff method in the background:
Delayed::Job IronMQ
Getting Started
Get Your Iron.io Credentials
Sign up for an Iron.io account if you don't already have one and add your credentials to an iron.json file or environment variables. You can read more about that here.Add gem to Gemfile:
gem 'delayed_job_ironmq'
That's It, Now Use It as Your Normally Would
I did say it was easy right? Here's a simple example of how to use it:Start worker process:
rake jobs:work
In your model:
class User def background_stuff puts "I run in the background" end end
Now to run the background_stuff method in the background:
user = User.new user.delay.background_stuff
More Information
For more details and configuration options, check out the gem README. And here's a demo Rails application that uses it so you can clone it and try it out.
Labels:
delayed job,
IronMQ,
Workers
Monday, August 13, 2012
Iron.io @ Box Hackathon (#redefiningwork)
The Iron.io community team just got back from the #RedefiningWork | Box Hackathon this weekend. It was a fantastic event with over 40+ teams spread across a large room and featured a rich set of API sponsors included SendGrid, Twilio, Parse, TokBox, Firebase, Mashery, and Iron.io.
Most the teams used the Box APIs to show off the versatility of having cloud-based document storage in conjunction with a rich set of hosted services. Pairing a rich platform like Box with cloud services in a hackathon makes sense, because nobody wants to stand up servers and manage infrastructure; everyone wants to just get to the coding. (It makes even more sense for apps in the real world, but we talk about that in other places.)
Most the teams used the Box APIs to show off the versatility of having cloud-based document storage in conjunction with a rich set of hosted services. Pairing a rich platform like Box with cloud services in a hackathon makes sense, because nobody wants to stand up servers and manage infrastructure; everyone wants to just get to the coding. (It makes even more sense for apps in the real world, but we talk about that in other places.)
![]() |
| Box Hackathon - Presentation of Top 15 |
Sunday, August 12, 2012
Iron.io @ eCommerce Hack Day
Last weekend the DevEx team at Iron.io was at the eCommerce Hack Day in New York City. It was hosted by Dwolla with help from Etsy and features lots of great API sponsors and a terrific set of applications. Here's an article from @betabeat that gives you a glimpse of the energy at the event.
The top hacks that used Iron.io were Post My Trip and FriendlyFeast. Post My Trip is an app that generates a postcard of images by first pulling photos from nearby foursquare locations, using IronWorker to put together a good-looking collage, and then sending the postcard using Sincerely. They took home the other Iron.io prize: a GoPro Hero2 camera along with some service credits.
The top hacks that used Iron.io were Post My Trip and FriendlyFeast. Post My Trip is an app that generates a postcard of images by first pulling photos from nearby foursquare locations, using IronWorker to put together a good-looking collage, and then sending the postcard using Sincerely. They took home the other Iron.io prize: a GoPro Hero2 camera along with some service credits.
![]() |
| Post My Trip: Postcards of Foursquare Images |
Monday, August 6, 2012
IronWorker's Most Requested Feature is Here: Max Concurrency
![]() |
| Too many lemmings and you've got a problem. |
- Limit load on your database. If you know your database can handle X tasks at a time, set max_concurrency to X and then queue up as many tasks as you want.
- Limit requests on third party API's to stay under rate limits such as the Facebook API and Twitter API. The Twitter API for instance, has a 150 requests per hour limit.
- Limit requests on your own API's. This is essentially the same as #1.
- Limit requests on third party sites if you are web crawling or scraping so you don't get blacklisted.
- Eliminate issues related to spiky behavior. For instance, let's say you got TechCrunched and your system wasn't ready for it, your app/site would go down, you're customers would be upset and you wouldn't be able to take orders or do whatever it is your site does. If you had offloaded your work to IronWorker to begin with and set a max_concurrency on the tasks, then when you got a spike from being TechCrunched or perhaps the Colbert bump, you would be perfectly fine. Smooth sailing.
As you can see, it opens the door to a lot of new use cases.
How to Use max_concurrency
If you're using the IronWorker CLI, it's very easy, simply add the --max-concurrency flag when uploading your worker code. So if your worker is called max_worker and you had a .worker file called max_worker.worker, you would upload like this:
> iron_worker upload max_worker --max-concurrency 100
Now just queue up as many tasks as you want for max_worker and you can rest easy knowing that a maximum of 100 of them will be running at any point in time.
![]() |
| This worker has max_concurrency of 250, you can see that only 250 of 1000 tasks are running. |
Labels:
concurrency,
IronWorker
Thursday, August 2, 2012
IronMQ + EngineYard
We're pleased to announce IronMQ is now a preferred add-on on the Engine Yard platform.
IronMQ joins IronWorker on Engine Yard to let developers do even bigger things with their applications by providing industrial-strength ways to scale out processing and pass messages between independent processes.
Engine Yard is one of the leading cloud platforms and largely defined the Platform as a Service category. Engine Yard provides a complete and commercial grade solution that gives developers the freedom to focus on building great applications rather than managing an intricate platform.
IronMQ and IronWorker are two hosted and elastic services from Iron.io that provide simple ways to supercharge applications by adding massive distributed processing and asynchronous message- and event-handling.
![]() |
| IronMQ HUD - Queues |
Subscribe to:
Posts (Atom)











