Ruby Web Crawler

Here’s a web crawler I wrote awhile back. It’s pretty simple, but does the job. If you need something more, you might try rdig or Nutch.

You can run this as a stand-alone script and just pass in the URL to crawl as an argument.

[ruby]
require ‘net/http’
require ‘uri’

class SiteCrawler
def initialize(url)
@site_uri = URI.parse(url)
@site_uri.path = “/” if @site_uri.path == “”
@visited = Hash.new
@queue = Array.new
addPath(@site_uri.path)
puts “Initialized site crawl for #{@site_uri}”
end

def addPath(path)
@queue.push path
@visited[path] = false
end

def getPage(path)
begin
uri = @site_uri.clone
uri.path = uri.path + path if path != “/”
puts “getting #{uri}”
response = Net::HTTP.get_response(uri)
rescue Exception
puts “Error: #{$!}”
return “”
end
return response.body
end

def queueLocalLinks(html)
html.scan(/ uri = URI.parse("#{w}")
if !@visited.has_key?(uri.path) and
(uri.relative? or uri.host == @site_uri.host)
addPath(uri.path)
end
}
end

def crawlSite()
while (!@queue.empty?)
uri = @queue.shift
page = getPage(uri)
yield uri, page
queueLocalLinks(page)
@visited[uri] = true
end
end
end

sc = SiteCrawler.new(ARGV[0])
@pages = Array.new
sc.crawlSite { |url, page_text|
@pages << url
# SITE_INDEX << { :url => url, :context => page_text }
}
[/ruby]

9 Comments to “Ruby Web Crawler”

  1. webug 27 May 2007 at 3:02 am #

    dear sir:
    i am a ruby beginner,i wondere in your programe above, where do you store the download html pages ?
    thanks

  2. scott 27 May 2007 at 8:15 pm #

    You could store it wherever you like. In this example, the contents of the page are in the page_text variable. You can see an updated version at RubyForge where I’m storing the results in a hashmap.

  3. Shig 13 September 2007 at 3:46 am #

    Scott – I’m currently in the process of learning Ruby and RoR. Is this something that would integrate easily into an RoR app? I’m wondering how this would fit into the MVC architecture.

  4. scott 13 September 2007 at 3:09 pm #

    While it would be easy to add this to a controller, it’s probably not the best place for this. One possibility would be to create an Observer to handle it; however, I would be more inclined to create a task (in lib/tasks) to handle the crawling and set up cron to run it periodically. That would insure a constant level of resource usage and make it easier to process the requests serially.

  5. Shig 13 September 2007 at 3:16 pm #

    Sounds good. Thanks for the response.

  6. Carsten 23 January 2008 at 11:04 am #

    Great code,

    Will have to try it out.

    Thanks man!

  7. Manjunath 5 May 2009 at 5:14 am #

    hi, i am new to the ruby can i know this

    i run this code in SciTE when i execute i am getting this error

    updateversionofsitecrawl.rb:6:in `initialize’: uninitialized constant WebCrawler::URL (NameError)
    from updateversionofsitecrawl.rb:53:in `new’
    from updateversionofsitecrawl.rb:53

    can u please help me out this

  8. Nik 30 June 2009 at 3:02 pm #

    Hello, thanks for this script. I have been wanting to experiment with making my own web indexes for a few websites. –

    Do you happen to know that, now say we have the crawler, how we come up with a list of websites for it to crawl on if we were to index the whole world wide web. Should we run through all ips from 0.0.0.0 to 255.255.255.255 ?

    Or is it that the idea is that you give one seed website, a rather big one, and let the spider go link after link until hopefully it includes all webpages? or fail somewhere and try to find another one seed website?

    Thanks!

  9. Andre Durao 21 January 2010 at 8:34 am #

    thanks for that!
    Helped a lot.


Leave a Reply