SiteScraper extracts the data you want from webpages. No programming or HTML knowledge is required.
For an in depth analysis of how it works have a browse of this unpublished technical report.
Here is a simple example to show how SiteScraper works:
>>> from sitescraper import sitescraper >>> ss = sitescraper() >>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0' >>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition", "Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)", "Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]] >>> ss.add(url, data) >>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3) >>> # ss.add(url2, data2) >>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0') ["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming", "Linux Pocket Guide", "Linux in a Nutshell (In a Nutshell (O'Reilly))", 'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)', 'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]
As you may have figured out, this example extracts the book titles from an Amazon search. All SiteScraper requires is the titles from one example search and then it can build a model to extract the titles from future Amazon searches.
Internally SiteScraper models the data in a webpage using Xpaths. Here is an example Xpath:
This Xpath will match all list elements within an unordered list of class 'big' within the second and following divs (index starts from 0) within the body of the HTML document.
SiteScraper is written in pure Python but depends on version 2 of lxml (for the HTML module). Currently many Linux repositories provide the old version 1, which means you may need to build from source. For example Ubuntu up to version 8.04 used version 1 but 8.10 onwards uses version 2. This dependency is a pain but it is a very useful library.
A zip file is available for download but this is usually out of date. Better to checkout the SVN repository.
Some goals for the future:
Each regression test has:
SiteScraper is licensed under the LGPL license, which means you are free to use it in any project (including commercial) but must publish modifications you make to SiteScraper.
You can ask questions or tell me how you have used SiteScraper by emailing me at firstname.lastname@example.org. I am particularly interested to hear from you if you have a webpage that fails and ideas how to improve SiteScraper to scrape it correctly.