Ruby Web Crawler

Posted by Scott on April 19, 2007

Here’s a web crawler I wrote awhile back. It’s pretty simple, but does the job. If you need something more, you might try rdig or Nutch.

You can run this as a stand-alone script and just pass in the URL to crawl as an argument.

[ruby]
require ‘net/http’
require ‘uri’

class SiteCrawler
def initialize(url)
@site_uri = URI.parse(url)
@site_uri.path = “/” if @site_uri.path == “”
@visited = Hash.new
@queue = Array.new
addPath(@site_uri.path)
puts “Initialized site crawl for #{@site_uri}”
end

def addPath(path)
@queue.push path
@visited[path] = false
end

def getPage(path)
begin
uri = @site_uri.clone
uri.path = uri.path + path if path != “/”
puts “getting #{uri}”
response = Net::HTTP.get_response(uri)
rescue Exception
puts “Error: #{$!}”
return “”
end
return response.body
end

def queueLocalLinks(html)
html.scan(/ uri = URI.parse("#{w}")
if !@visited.has_key?(uri.path) and
(uri.relative? or uri.host == @site_uri.host)
addPath(uri.path)
end
}
end

def crawlSite()
while (!@queue.empty?)
uri = @queue.shift
page = getPage(uri)
yield uri, page
queueLocalLinks(page)
@visited[uri] = true
end
end
end

sc = SiteCrawler.new(ARGV[0])
@pages = Array.new
sc.crawlSite { |url, page_text|
@pages << url
# SITE_INDEX << { :url => url, :context => page_text }
}
[/ruby]

6 Responses

  1. webug
    May 27, 2007

    dear sir:
    i am a ruby beginner,i wondere in your programe above, where do you store the download html pages ?
    thanks


  2. scott
    May 27, 2007

    You could store it wherever you like. In this example, the contents of the page are in the page_text variable. You can see an updated version at RubyForge where I’m storing the results in a hashmap.


  3. Shig
    September 13, 2007

    Scott - I’m currently in the process of learning Ruby and RoR. Is this something that would integrate easily into an RoR app? I’m wondering how this would fit into the MVC architecture.


  4. scott
    September 13, 2007

    While it would be easy to add this to a controller, it’s probably not the best place for this. One possibility would be to create an Observer to handle it; however, I would be more inclined to create a task (in lib/tasks) to handle the crawling and set up cron to run it periodically. That would insure a constant level of resource usage and make it easier to process the requests serially.


  5. Shig
    September 13, 2007

    Sounds good. Thanks for the response.


  6. Carsten
    January 23, 2008

    Great code,

    Will have to try it out.

    Thanks man!


Leave a Reply

Who Are These Guys?

Netphase, LLC. is a Charlotte-based web application development company specializing in really rapid application development on the Ruby on Rails framework.

Netphase was founded by Scott Nedderman and Chris Beck in 2007. Both Scott and Chris have been active Ruby on Rails developers for more than 3 years, and each has spent more than a decade designing, devloping and deploying web applications.

We have worked on projects of all sizes, and delivered successful sites to production for some of the biggest names in the business. For all of the gory details, check out our about page.

Bookmark This

del.icio.us Digg Furl reddit