Overview
SiteScraper extracts the data you want from webpages. No programming or HTML knowledge is required.
For an in depth analysis of how it works have a browse of this paper.
Example
Here is a simple example to show how SiteScraper works:
>>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition",
"Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)",
"Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2)
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming",
"Linux Pocket Guide",
"Linux in a Nutshell (In a Nutshell (O'Reilly))",
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)',
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]Explanation
As you may have figured out, this example extracts the book titles from an Amazon search. All SiteScraper requires is the titles from one example search and then it can build a model to extract the titles from future Amazon searches.
(To run this example first update the titles to what Amazon now returns and for more robust results provide a number of example cases.)
Model format
Internally SiteScraper models the data in a webpage using Xpaths. Here is an example Xpath:
/html[1]/body[1]/div[position()>1]/ul[@class='big']/li
This Xpath will match all list elements within an unordered list of class 'big' within the second and following divs (index starts from 0) within the body of the HTML document.
Install
SiteScraper is written in pure Python but depends on version 2 of lxml (for the HTML module). Currently many Linux repositories provide the old version 1, which means you may need to build from source. For example Ubuntu up to version 8.04 used version 1 but 8.10 onwards uses version 2. This dependency is a pain but it is a very useful library.
Regression tests
Included with SiteScraper are a set of regression test cases (in testdata/) that successfully extract data from stock sites, news sites, weather sites, web forums, and search engines.
Each regression test has:
- a few example webpages from a website
- the desired data from each webpage
- and a list of the xpaths that will locate the desired data
License
SiteScraper is licensed under the LGPL license, which means you are free to use it in any project (including commercial) but must publish modifications you make to SiteScraper.
Contact
You can ask questions or tell me how you have used SiteScraper by emailing me at richard sitescraper net. I am particularly interested to hear from you if you have a webpage that fails and ideas how to improve SiteScraper to scrape it correctly.
Thank you!