My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
WritingCustomCrawlers  
Writing Custom Crawlers by subscribing to HarvestMan Events
Featured
Updated Feb 10, 2010 by abpil...@gmail.com

How to use HarvestMan events API to write custom crawlers

HarvestMan events API

HarvestMan provides a very well-defined events API which can be used by developers to write custom crawlers suited for a specific crawling/data mining task.

Events

Events are implemented using a callback mechanism. At different times during the program execution, HarvestMan raises events with specific names. These events can be hooked into custom functions by subscribing to the events and defining functions which process the state supplied along with the event.

Events are mainly of two types - post events are those that are raised after an action is performed and pre (before) events are those that are raised prior to performing an action. In HarvestMan, pre events are more useful for controlling program flow since their return values are checked for True/False to decide rest of processing. Fore more information, read on.

Illustration

Let us say that you want to write a custom crawler which saves only images which are larger than 4K to the disk (a practical example of this would be a crawler which ignores thumbnail images, since thumbnails are typically of size 2K-4K). This is how you would do this by subscribing to the save_url_data event.

First you need to define a custom crawler class over-riding the HarvestMan class.

from harvestman.apps.spider import HarvestMan
from harvestman.lib.common.macros import *

class MyCustomCrawler(HarvestMan):
    """ A custom crawler """

    size_threshold = 4096

    def save_this_url(self, event, *args, **kwargs):
        """ Custom callback function which modifies behaviour
            of saving URLs to disk """

        # Get the url object
        url = event.url
        # If not image, save always
        if not url.is_image():
            return True
        else:
            # If image, check for content-length > 4K
            size = url.clength
            return (size>self.size_threshold)

# Set up the custom crawler
if __name__ == "__main__":
    crawler = MyCustomCrawler()
    crawler.initialize()
    # Get the configuration object
    config = crawler.get_config()
    # Register for 'save_url_data' event which will be called
    # back just before a URL is saved to disk
    crawler.register('save_url_data', crawler.save_this_url)
    # Run
    crawler.main()

You can run the program as if you would run HarvestMan. For example if you save this code to a file named customcrawler.py then you would run it as,

$ python customcrawler.py [URL]

Here is a sample crawl of a site containing images.

$ python customcrawler.py http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/crawler.py:53: DeprecationWarning: the sha module is deprecated; use the hashlib module instead
  import sha
/usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/urlparser.py:50: DeprecationWarning: the md5 module is deprecated; use hashlib instead
  import md5
Loading user configuration... 
Starting HarvestMan 2.0 beta 5... 
Copyright (C) 2004, Anand B Pillai 
  
[2010-02-10 19:21:51,052] *** Log Started ***
 
[2010-02-10 19:21:51,052] Starting project www.tcm.phy.cam.ac.uk ...
[2010-02-10 19:21:51,052] Writing Project Files... 
[2010-02-10 19:21:51,191] Starting download of url http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html ...
[2010-02-10 19:21:51,250] Reading Project Cache... 
[2010-02-10 19:21:51,253] Project cache not found 
[2010-02-10 19:21:51,256] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
[2010-02-10 19:21:52,211] Saved /home/anand/work/harvestman/HarvestMan-lite/harvestman/apps/samples/www.tcm.phy.cam.ac.uk/www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
[2010-02-10 19:21:52,299] Fetching links http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html
[2010-02-10 19:21:52,730] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/index.html
[2010-02-10 19:21:52,731] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0002.html
...

Diving deep

Let us dissect the custom crawler application we have built to understand the events API.

Here are the steps involved.

  • Create a custom crawler class inheriting the HarvestMan class.
  • Create a custom function which hooks into a specific event.
  • In the __main__ section, configure the crawler to subscribe to the event by using the register method. This method takes the event name as first argument and event handler as the second argument.

Let us take a deeper look at the function save_this_url.

    def save_this_url(self, event, *args, **kwargs):
        """ Custom callback function which modifies behaviour
            of saving URLs to disk """

        # Get the url object
        url = event.url
        # If not image, save always
        if not url.is_image():
            return True
        else:
            # If image, check for content-length > 4K
            size = url.clength
            return (size>self.size_threshold)

The function simply tells the crawler to alway save non-Image URLs, by returning True. For image URLs, it checks for size and returns True only if the size is greater than the required threshold size, else False.

The main object here of interest is the event object. This object contains all the state the programmer needs to write the custom behavior. The event object is of type Event, a class defined in the module harvestman.lib.event.

The Event class is defined as follows.

class Event(object):
    """ Event class for HarvestMan """

    def __init__(self):
        self.name = ''
        self.config = objects.config
        self.url = None
        self.document = None

The attributes of the class are namely, name, config, url and document. Of these the attributes of primary interest to the developer are url and document.

The url attribute

The url attribute contains the current URL object which is being processed. The URL object is of type HarvestManUrl (in module harvestman.lib.urlparser). It keeps all the state of the current URL under processing.

The document attribute

The document attribute holds information on the current web-page being crawled. The document object is of type HarvestManDocument (module harvestman.lib.document). This object holds information on the URL as a document, i.e its content, etag, last modified time etc. The document object is useful for URLs which represent web-pages or documents such as PDF etc.

The config attribute

This attribute binds to the global configuration object. Instead of having to call objects.config everytime, you can get global configuration in the event handler by accessing the event.config attribute.

The name attribute

This will contain the name of the event. For example in the above code, this would be save_url_data.

How to use attributes of event object

For most events, the url attribute it present and is required to do any meaningful processing. The document attribute is present only for the events which are dealing with a web-page with parseable content. For some events which are related to program stages (such as start/end of a project), both these attributes wont be present, i.e they will be None.

Additional arguments

Additional arguments could be passed to the event handler by specific events. Positional arguments will appear as the *args and keyword arguments as the **kwargs variables respectively. For example the before_tag_parse event passes in the current HTML tag and its attributes using positional arguments.

Table of Events

The following table lists the main events raised by HarvestMan and the attributes that are filled in for each event, additional arguments, points in program flow when the events are raised, module which raises the event etc.

Event Raised when Attributes Positional Arguments Module Comments
before_start_project Before starting a project url None harvestman.apps.spider url is the starting URL
post_start_project After starting a project url None harvestman.apps.spider url is the starting URL
before_finish_project Before finishing a project url None harvestman.apps.spider url is the starting URL
after_finish_project After finishing a project url None harvestman.apps.spider url is the starting URL
before_crawl_url Before a URL is crawled url, document None harvestman.lib.crawler crawled here means the function crawl_url
post_crawl_url After a URL is crawled url, document None harvestman.lib.crawler crawled here means the function crawl_url
before_download_url Before a URL is downloaded url None harvestman.lib.crawler
before_parse_url Before a URL is parsed url,document None harvestman.lib.crawler This always comes after before_download_url hook
post_parse_url After a URL is parsed url,document links harvestman.lib.crawler links stand for the child links of this URL
before_url_connect Before connection for a URL is done url last_modified, etag harvestman.lib.connector last_modified, etag args are valid (not None) only if there is cache for the URL
post_url_connect After connection for a URL is done url None harvestman.lib.connector
save_url_data Before saving data for a URL to disk url data harvestman.lib.connector data is the content of the URL
post_crawl_complete After the crawl is completed None None harvestman.lib.datamgr
before_tag_parse Before an HTML tag is parsed url tag,attrs harvestman.lib.pageparser tag is the tag name and attrs the attributes dictionary
before_tag_data Before CDATA of an HTML tag is parsed url tag, cdata harvestman.lib.pageparser tag is the tag name and cdata is its CDATA
include_this_url Before a URL is checked for rules url None harvestman.lib.rules This comes before a URL is crawled

Programming using Events

The key towards programming with events is that, the programmer can control the program flow by binding to any before event and returning True or False depending upon his logic.

All events which are raised before a certain action is performed, checks for the return value of the event processing. If the return value is False, the rest of the processing in the function which raised the event is NOT done. If the return value is True, the function continues processing as if nothing happened.

This can be exploited to write custom crawlers that perform specific actions. For example in the sample code illustrated previously, we return True if the URL is not an image and False if the URL is an image, but below the given size. This way we modify the functionality of the function which raised the event, thereby causing the program to not save image URLs below a certain size.

Post events (events raised after an action, check table) are also useful, but since their return values are not checked in code, they are much less useful in controlling program flow when compared to pre events.

NOTE: In the table, any event for which Raised when says before is a pre-event. Any event for which it says post or after is a post-event.

More Reading

For more information, check out sample custom crawler applications in the folder _harvestman/apps/samples_ in the code-base. Also read the HOWTO at _doc/events.HOWTO_.

Caveat

In the earlier released versions, the register method is not present. Instead the method name is bind_event, with the same arguments. Also in released versions, positional arguments are not supported and additional arguments are always passed in as keyword arguments. This document is conformant to the most recent release (HarvestMan 2.0.5 beta) and the current trunk-code under HarvestMan-lite branch.

Comment by remarkability@gmail.com, Oct 11, 2010

After digging around, I found none of these events seem to "bind" properly, but writeurl does...

spider.bind_event('writeurl', spider.save_url_data)
def save_url_data(self, event, args, kwargs):
  1. ata = args0?
  2. rint len(data)
  3. bsurl = event.url.absurl
  4. rl = absurl
logconsole( "EVENT:", event.name ) logconsole( "EVENT URL:", event.url)
  1. ogconsole( "EVENT:", event.config )
return False


Sign in to add a comment
Powered by Google Project Hosting