|
WritingCustomCrawlers
Writing Custom Crawlers by subscribing to HarvestMan Events
Featured How to use HarvestMan events API to write custom crawlersHarvestMan events APIHarvestMan provides a very well-defined events API which can be used by developers to write custom crawlers suited for a specific crawling/data mining task. EventsEvents are implemented using a callback mechanism. At different times during the program execution, HarvestMan raises events with specific names. These events can be hooked into custom functions by subscribing to the events and defining functions which process the state supplied along with the event. Events are mainly of two types - post events are those that are raised after an action is performed and pre (before) events are those that are raised prior to performing an action. In HarvestMan, pre events are more useful for controlling program flow since their return values are checked for True/False to decide rest of processing. Fore more information, read on. IllustrationLet us say that you want to write a custom crawler which saves only images which are larger than 4K to the disk (a practical example of this would be a crawler which ignores thumbnail images, since thumbnails are typically of size 2K-4K). This is how you would do this by subscribing to the save_url_data event. First you need to define a custom crawler class over-riding the HarvestMan class. from harvestman.apps.spider import HarvestMan
from harvestman.lib.common.macros import *
class MyCustomCrawler(HarvestMan):
""" A custom crawler """
size_threshold = 4096
def save_this_url(self, event, *args, **kwargs):
""" Custom callback function which modifies behaviour
of saving URLs to disk """
# Get the url object
url = event.url
# If not image, save always
if not url.is_image():
return True
else:
# If image, check for content-length > 4K
size = url.clength
return (size>self.size_threshold)
# Set up the custom crawler
if __name__ == "__main__":
crawler = MyCustomCrawler()
crawler.initialize()
# Get the configuration object
config = crawler.get_config()
# Register for 'save_url_data' event which will be called
# back just before a URL is saved to disk
crawler.register('save_url_data', crawler.save_this_url)
# Run
crawler.main()You can run the program as if you would run HarvestMan. For example if you save this code to a file named customcrawler.py then you would run it as, $ python customcrawler.py [URL] Here is a sample crawl of a site containing images. $ python customcrawler.py http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html /usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/crawler.py:53: DeprecationWarning: the sha module is deprecated; use the hashlib module instead import sha /usr/local/lib/python2.6/dist-packages/HarvestMan-2.0.4betadev_r210-py2.6.egg/harvestman/lib/urlparser.py:50: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5 Loading user configuration... Starting HarvestMan 2.0 beta 5... Copyright (C) 2004, Anand B Pillai [2010-02-10 19:21:51,052] *** Log Started *** [2010-02-10 19:21:51,052] Starting project www.tcm.phy.cam.ac.uk ... [2010-02-10 19:21:51,052] Writing Project Files... [2010-02-10 19:21:51,191] Starting download of url http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html ... [2010-02-10 19:21:51,250] Reading Project Cache... [2010-02-10 19:21:51,253] Project cache not found [2010-02-10 19:21:51,256] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html [2010-02-10 19:21:52,211] Saved /home/anand/work/harvestman/HarvestMan-lite/harvestman/apps/samples/www.tcm.phy.cam.ac.uk/www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html [2010-02-10 19:21:52,299] Fetching links http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0001.html [2010-02-10 19:21:52,730] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/index.html [2010-02-10 19:21:52,731] Downloading http://www.tcm.phy.cam.ac.uk/~pdh1001/Photo_Album/Kilimanjaro/pic0002.html ... Diving deepLet us dissect the custom crawler application we have built to understand the events API. Here are the steps involved.
Let us take a deeper look at the function save_this_url. def save_this_url(self, event, *args, **kwargs):
""" Custom callback function which modifies behaviour
of saving URLs to disk """
# Get the url object
url = event.url
# If not image, save always
if not url.is_image():
return True
else:
# If image, check for content-length > 4K
size = url.clength
return (size>self.size_threshold)The function simply tells the crawler to alway save non-Image URLs, by returning True. For image URLs, it checks for size and returns True only if the size is greater than the required threshold size, else False. The main object here of interest is the event object. This object contains all the state the programmer needs to write the custom behavior. The event object is of type Event, a class defined in the module harvestman.lib.event. The Event class is defined as follows. class Event(object):
""" Event class for HarvestMan """
def __init__(self):
self.name = ''
self.config = objects.config
self.url = None
self.document = NoneThe attributes of the class are namely, name, config, url and document. Of these the attributes of primary interest to the developer are url and document. The url attributeThe url attribute contains the current URL object which is being processed. The URL object is of type HarvestManUrl (in module harvestman.lib.urlparser). It keeps all the state of the current URL under processing. The document attributeThe document attribute holds information on the current web-page being crawled. The document object is of type HarvestManDocument (module harvestman.lib.document). This object holds information on the URL as a document, i.e its content, etag, last modified time etc. The document object is useful for URLs which represent web-pages or documents such as PDF etc. The config attributeThis attribute binds to the global configuration object. Instead of having to call objects.config everytime, you can get global configuration in the event handler by accessing the event.config attribute. The name attributeThis will contain the name of the event. For example in the above code, this would be save_url_data. How to use attributes of event objectFor most events, the url attribute it present and is required to do any meaningful processing. The document attribute is present only for the events which are dealing with a web-page with parseable content. For some events which are related to program stages (such as start/end of a project), both these attributes wont be present, i.e they will be None. Additional argumentsAdditional arguments could be passed to the event handler by specific events. Positional arguments will appear as the *args and keyword arguments as the **kwargs variables respectively. For example the before_tag_parse event passes in the current HTML tag and its attributes using positional arguments. Table of EventsThe following table lists the main events raised by HarvestMan and the attributes that are filled in for each event, additional arguments, points in program flow when the events are raised, module which raises the event etc.
Programming using EventsThe key towards programming with events is that, the programmer can control the program flow by binding to any before event and returning True or False depending upon his logic. All events which are raised before a certain action is performed, checks for the return value of the event processing. If the return value is False, the rest of the processing in the function which raised the event is NOT done. If the return value is True, the function continues processing as if nothing happened. This can be exploited to write custom crawlers that perform specific actions. For example in the sample code illustrated previously, we return True if the URL is not an image and False if the URL is an image, but below the given size. This way we modify the functionality of the function which raised the event, thereby causing the program to not save image URLs below a certain size. Post events (events raised after an action, check table) are also useful, but since their return values are not checked in code, they are much less useful in controlling program flow when compared to pre events. NOTE: In the table, any event for which Raised when says before is a pre-event. Any event for which it says post or after is a post-event. More ReadingFor more information, check out sample custom crawler applications in the folder _harvestman/apps/samples_ in the code-base. Also read the HOWTO at _doc/events.HOWTO_. CaveatIn the earlier released versions, the register method is not present. Instead the method name is bind_event, with the same arguments. Also in released versions, positional arguments are not supported and additional arguments are always passed in as keyword arguments. This document is conformant to the most recent release (HarvestMan 2.0.5 beta) and the current trunk-code under HarvestMan-lite branch.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
After digging around, I found none of these events seem to "bind" properly, but writeurl does...