Export to GitHub

spidey - QuickTutorial.wiki


Introduction

You need to know some basic HTML and Perl. If you don't know Perl, here's a free modern book about it.

We are going to scrape some cheap flight dates and prices from Ryanair. I actually use this script to save money on my trips back to Italy.

If you are not familiar with the Ryanair booking form, go here and try a search.

Requiring the Spidey library

Usually you will want to write your spiders into separate Perl modules (e.g. MySpiders::Ryanair), but to keep things simple we will use a main program.

First you require the Spidey library:

use WWW::HtmlUnit::Spidey;

This will import by default into your namespace some Tcl-ish commands (subs) like browser, file, node, table, etc and a %Conf hash of configuration parameters.

Should some names clash with other libraries or your code, you can either use:

use WWW::HtmlUnit::Spidey qw( :NOTCL );

to get non Tcl-ish commands (e.g. browser_ini rather than browser 'ini') or:

use WWW::HtmlUnit::Spidey ();

to not import anything. You will have to use a lot of long namespace prefixes, e.g. WWW::HtmlUnit::Spidey::browser.

Some date manipulation code

This is not related to web scraping, but needed for this example: we are going to search for flight from tomorrow up to one week from now and need to set future dates by their constituents day, month and year.

A way to find the date N days from now is by using:

use Date::Calc (':all');

There are a lot of other date manipulation modules on CPAN, we simply chose one.

sub next_date { my ($yy, $mm, $dd) = Today(); ($yy, $mm, $dd) = Add_Delta_Days($yy, $mm, $dd, $_[0]); $dd = sprintf('%02u', $dd); $mm = sprintf('%02u', $mm); # These values are in exactly the same format as option values on site. return ($dd, "$mm$yy"); }

E.g. if today is the 28th of February 2011, next_date(1) will give you ('01', '032011') meaning the 1st of March 2011.

Let the dance begin

We need two variables, one for the browser and another for the current page:

my ($b, $p);

Then we initialize a browser with the default options:

$b = browser 'ini';

And point it directly to the Ryanair booking form:

$p = browser 'get', $b, 'http://www.ryanair.com/en/booking/form';

For debugging purposes you may want to save a copy of this page that you can open and inspect with any browser (including the text-only ones):

file 'slide', path => 'booking_form.html', obj => $p;

Look into /tmp/spidey/main/.

It is possible to prevent Spidey to fetch some resources linked from an HTML page. By default external CSS style sheets are not fetched but all external JavaScript files are. E.g. to avoid that, before fetching pages, do:

browser 'ex', $b, '\.js$';

'ex' stands for exclude and the last argument is a regular expression that matches all .js file extensions.

Filling in a form

As when you do that manually you have to locate the controls you want to type into or click. In Spidey you do that using unique NAME or ID attributes, CSS selectors and in more complex cases XPath expressions. By default matching is on IDs and that's all that we need:

```

Departing from Edinburgh...

node(action => 'set', value => 'aEDI', match => 'sector1_o', page => $p); ```

This means: in page $p we match an element with ID sector1_o which is the departing airport and set its value to aEDI (the default option is get). We have to use the value for an HTML OPTION tag, not a label like Edinburgh (EDI).

Should upstream change something and your adaptive code break, node will print and log some useful debugging info like this:

WWW::HtmlUnit::Spidey::logdie('Node not found') called at Spidey.pm line 671 eval {...} called at Spidey.pm line 726 Spidey::node('action', 'set', 'value', 'aEDI', 'match', 'sector1_o', 'page', 'WWW::HtmlUnit::com::gargoylesoftware::htmlunit::html::HtmlPag...') called at ./Ryanair.pl line 67

Ditto for the other controls:

```

...to Rome.

node(action => 'set', value => 'CIA', match => 'sector1_d', page => $p);

Depart tomorrow...

my ($d, $m) = next_date(1); node(action => 'set', value => $d, match => 'sector_1_d', page => $p); node(action => 'set', value => $m, match => 'sector_1_m', page => $p);

...return in 1 week.

($d, $m) = next_date(7); node(action => 'set', value => $d, match => 'sector_2_d', page => $p); node(action => 'set', value => $m, match => 'sector_2_m', page => $p); ```

The form won't submit if we don't accept the terms of use. A little variation here because to tick a checkbox, you have to click on it - changing its value is not what you want:

node(action => 'click', match => 'acceptTerms', page => $p);

Then we slide the filled in form to check out your code:

file 'slide', path => 'filled-in_booking_form.html', obj => $p;

And finally we submit the query by clicking on a button (the only BUTTON tag present).

$p = node(action => 'click', by => 'tag', match => 'button', page => $p);

Note that when the page changes, node returns a new page that we need to save somewhere to take further actions. We use the same variable $p because we are not interested in the previous form page anymore.

Reading tables

We better slide the result page to make sure that it is exactly that:

file 'slide', path => '2-results.html', obj => $p;

Then let's reap the fruit of our labor. The departure dates and respective ticket prices are all found in a 1-row table easily located by ID:

my $t = node(match => 'ttable1', page => $p);

Here is an HTML code sample of a placeholder cell for a day with no flights:

<th class="noFl"> <b> Sun </b> , 27 Feb 11 <br/> <div class="planeNoFlights" title="No Flights"> </div> </th>

and one with actual data:

<th class="on"> <a id="tab_1_2011_2_28" href="javascript:setTabIndex('1', '1')"> <b> Mon </b> , 28 Feb 11 <br/> from <br/> <b> 59.99 </b> GBP </a> </th>

We could use:

print $t->asText;

and try to parse some text like this:

Sun, 27 Feb 11 Mon, 28 Feb 11 from 59.99 GBP Tue, 1 Mar 11 Wed, 2 Mar 11 from 39.99 GBP Thu, 3 Mar 11 Fri, 4 Mar 11 from 26.99 GBP Sat, 5 Mar 11 from 39.99 GBP

but it easier to use the DOM and get each cell content separately. Therefore we iterate over each cell of the first row (0) and if it does not contain a DIV tag with class="planeNoFlights" we extract and print the inner text:

``` for (@{table 'cells', $t, 0 }) { eval { $_->getOneHtmlElementByAttribute('div', 'class', 'planeNoFlights');

        1;
    } or do {
        # Current cell does not contain a "No Flights" logo.
        my $f = $_->asText;
        $f =~ s/\n/ /g;
        print "$f\n";
    }
}

```

eval { ... } or do { ... } is the Perl equivalent of try { ... } catch( ... ) { ... } in C++.

For more structured tables, when there is a table header and you only want to extract some fields, Spidey provides a facility, see command table 'read'. The HtmlUnit web site, HtmlUnit being the base library Spidey is based on, has also a short howto about tables.

You can generally use HtmlUnit methods for some features that Spidey does not already provide with a sweeter syntax. E.g. before we used getOneHtmlElementByAttribute which is an HtmlUnit Java method the class HtmlTableCell inherits from HtmlElement - but you do not need to care about class hierarchies or data types here :)

Scraping dates of return flights is done the very same way except that the table ID is ttable2.

Sample output

For the 27th of Feb 2011:

DEPARTURES Mon, 28 Feb 11 from 199.99 GBP Wed, 2 Mar 11 from 39.99 GBP Fri, 4 Mar 11 from 26.99 GBP Sat, 5 Mar 11 from 39.99 GBP RETURNS Fri, 4 Mar 11 from 79.99 GBP Sat, 5 Mar 11 from 124.99 GBP Mon, 7 Mar 11 from 26.99 GBP Wed, 9 Mar 11 from 26.99 GBP

Full source code for the example

It is found in the SVN repository.

More examples

eg/Slashdot.pl and t/01-google.t both found in the Spidey tarball. The latter is a test script which is executed during Spidey installation.