Ruby Web Crawler
Here’s a web crawler I wrote awhile back. It’s pretty simple, but does the job. If you need something more, you might try rdig or Nutch.
You can run this as a stand-alone script and just pass in the URL to crawl as an argument.
[ruby]
require ‘net/http’
require ‘uri’
class SiteCrawler
def initialize(url)
@site_uri = URI.parse(url)
@site_uri.path = “/” if @site_uri.path == “”
@visited = Hash.new
@queue = Array.new
addPath(@site_uri.path)
puts “Initialized site crawl for #{@site_uri}”
end
def addPath(path)
@queue.push path
@visited[path] = false
end
def getPage(path)
begin
uri = @site_uri.clone
uri.path = uri.path + path if path != “/”
puts “getting #{uri}”
response = Net::HTTP.get_response(uri)
rescue Exception
puts “Error: #{$!}”
return “”
end
return response.body
end
def queueLocalLinks(html)
html.scan(/
uri = URI.parse("#{w}")
if !@visited.has_key?(uri.path) and
(uri.relative? or uri.host == @site_uri.host)
addPath(uri.path)
end
}
end
def crawlSite()
while (!@queue.empty?)
uri = @queue.shift
page = getPage(uri)
yield uri, page
queueLocalLinks(page)
@visited[uri] = true
end
end
end
sc = SiteCrawler.new(ARGV[0])
@pages = Array.new
sc.crawlSite { |url, page_text|
@pages << url
# SITE_INDEX << { :url => url, :context => page_text }
}
[/ruby]
9 Comments to “Ruby Web Crawler”
Leave a Reply

dear sir:
i am a ruby beginner,i wondere in your programe above, where do you store the download html pages ?
thanks
You could store it wherever you like. In this example, the contents of the page are in the page_text variable. You can see an updated version at RubyForge where I’m storing the results in a hashmap.
Scott – I’m currently in the process of learning Ruby and RoR. Is this something that would integrate easily into an RoR app? I’m wondering how this would fit into the MVC architecture.
While it would be easy to add this to a controller, it’s probably not the best place for this. One possibility would be to create an Observer to handle it; however, I would be more inclined to create a task (in lib/tasks) to handle the crawling and set up cron to run it periodically. That would insure a constant level of resource usage and make it easier to process the requests serially.
Sounds good. Thanks for the response.
Great code,
Will have to try it out.
Thanks man!
hi, i am new to the ruby can i know this
i run this code in SciTE when i execute i am getting this error
updateversionofsitecrawl.rb:6:in `initialize’: uninitialized constant WebCrawler::URL (NameError)
from updateversionofsitecrawl.rb:53:in `new’
from updateversionofsitecrawl.rb:53
can u please help me out this
Hello, thanks for this script. I have been wanting to experiment with making my own web indexes for a few websites. –
Do you happen to know that, now say we have the crawler, how we come up with a list of websites for it to crawl on if we were to index the whole world wide web. Should we run through all ips from 0.0.0.0 to 255.255.255.255 ?
Or is it that the idea is that you give one seed website, a rather big one, and let the spider go link after link until hopefully it includes all webpages? or fail somewhere and try to find another one seed website?
Thanks!
thanks for that!
Helped a lot.