My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members

python crawler spider

Web Crawler

Example Code

from crawler.crawler import Crawler

mycrawler = Crawler()
seeds = ['http://www.example.com/'] # list of url
mycrawler.add_seeds(seeds)
rules = {'^(http://.+example\.com)(.+)$':[ '^(http://.+example\.com)(.+)$' ]}
#your crawling rules: a dictionary type, 
#key is the regular expressions for url, 
#value is the list of regular expressions for urls which you want to follow from the url in key.
mycrawler.add_rules(rules)
mycrawler.start() # start crawling

data files

three database (Berkeley DB) files will be generated.

  • queue.db
  • webpage.db
  • duplcheck.db

windows installation howto:

ubuntu installation howto:

  • apt-get install python-lxml
  • apt-get install python-bsddb3
  • install python-crawler : python setup.py install
Powered by Google Project Hosting