Ruby Web Crawler
Here’s a web crawler I wrote awhile back. It’s pretty simple, but does the job. If you need something more, you might try rdig or Nutch.
You can run this as a stand-alone script and just pass in the URL to crawl as an argument.
[ruby]
require ‘net/http’
require ‘uri’
class SiteCrawler
def initialize(url)
@site_uri = URI.parse(url)
@site_uri.path = “/” if @site_uri.path == “”
@visited = Hash.new
@queue = Array.new
addPath(@site_uri.path)
puts “Initialized site crawl for #{@site_uri}”
end
def addPath(path)
@queue.push path
@visited[path] = false
end
def getPage(path)
begin
uri = @site_uri.clone
uri.path = uri.path + path if path != “/”
puts “getting #{uri}”
response = Net::HTTP.get_response(uri)
rescue Exception
puts “Error: #{$!}”
return “”
end
return response.body
end
def queueLocalLinks(html)
html.scan(/
uri = URI.parse("#{w}")
if !@visited.has_key?(uri.path) and
(uri.relative? or uri.host == @site_uri.host)
addPath(uri.path)
end
}
end
def crawlSite()
while (!@queue.empty?)
uri = @queue.shift
page = getPage(uri)
yield uri, page
queueLocalLinks(page)
@visited[uri] = true
end
end
end
sc = SiteCrawler.new(ARGV[0])
@pages = Array.new
sc.crawlSite { |url, page_text|
@pages << url
# SITE_INDEX << { :url => url, :context => page_text }
}
[/ruby]





May 27, 2007
dear sir:
i am a ruby beginner,i wondere in your programe above, where do you store the download html pages ?
thanks
May 27, 2007
You could store it wherever you like. In this example, the contents of the page are in the page_text variable. You can see an updated version at RubyForge where I’m storing the results in a hashmap.
September 13, 2007
Scott - I’m currently in the process of learning Ruby and RoR. Is this something that would integrate easily into an RoR app? I’m wondering how this would fit into the MVC architecture.
September 13, 2007
While it would be easy to add this to a controller, it’s probably not the best place for this. One possibility would be to create an Observer to handle it; however, I would be more inclined to create a task (in lib/tasks) to handle the crawling and set up cron to run it periodically. That would insure a constant level of resource usage and make it easier to process the requests serially.
September 13, 2007
Sounds good. Thanks for the response.
January 23, 2008
Great code,
Will have to try it out.
Thanks man!