spidey


Easy to use yet powerful Perl wrapper of HtmlUnit for web scraping

News: * Waiting for my company to give me permission to release this as Open Source. Finger crossed! * ~~Spidey now available on CPAN~~. * take the Spidey quick tutorial!

Looking for co-developers: if interested please contact me.

Simplicity is a difficult thing to achieve - Charlie Chaplin

Spidey (in Italian "ragnetto") is very easy-to-use a library that provides a browser object that lets the developer to interact with a website in a similar way a real browser would. In particular it provides subcommands to open pages, follow links, change form data and submit forms.

Spidey sits on top of the CPAN module WWW::HtmlUnit but departs from any Mechanize-ish syntax.

It is for all those that need enough JavaScript support but don't like Java and want to develop scalable web crawlers in a scripting language like Perl, working at a very abstract level almost like when using a browser manually while taking advantage of all the Perl goodness for data conversion, some of which are even provided by the library.

Infact Spidey is much faster and robust than driving a real browser as with WWW::Mechanize::Firefox or any other screen-scraping solution.

Image: Arvind Balaraman / FreeDigitalPhotos.net

Project Information

The project was created on Feb 20, 2011.

Labels:
web scraping harvesting perl library framework crawler browser automation