|
Project Information
Members
Links
|
OverviewSiteScraper extracts the data you want from webpages. No programming or HTML knowledge is required. For an in depth analysis of how it works have a browse of this unpublished technical report. ExampleHere is a simple example to show how SiteScraper works: >>> from sitescraper import sitescraper
>>> ss = sitescraper()
>>> url = 'http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=python&x=0&y=0'
>>> data = ["Amazon.com: python", ["Learning Python, 3rd Edition",
"Programming in Python 3: A Complete Introduction to the Python Language (Developer's Library)",
"Python in a Nutshell, Second Edition (In a Nutshell (O'Reilly))"]]
>>> ss.add(url, data)
>>> # we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)
>>> # ss.add(url2, data2)
>>> ss.scrape('http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=linux&x=0&y=0')
["Amazon.com: linux", ["A Practical Guide to Linux(R) Commands, Editors, and Shell Programming",
"Linux Pocket Guide",
"Linux in a Nutshell (In a Nutshell (O'Reilly))",
'Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)',
'Linux Bible, 2008 Edition: Boot up to Ubuntu, Fedora, KNOPPIX, Debian, openSUSE, and 11 Other Distributions']]ExplanationAs you may have figured out, this example extracts the book titles from an Amazon search. All SiteScraper requires is the titles from one example search and then it can build a model to extract the titles from future Amazon searches.
Model formatInternally SiteScraper models the data in a webpage using Xpaths. Here is an example Xpath: /html[1]/body[1]/div[position()>1]/ul[@class='big']/li This Xpath will match all list elements within an unordered list of class 'big' within the second and following divs (index starts from 0) within the body of the HTML document. InstallSiteScraper is written in pure Python but depends on version 2 of lxml (for the HTML module). Currently many Linux repositories provide the old version 1, which means you may need to build from source. For example Ubuntu up to version 8.04 used version 1 but 8.10 onwards uses version 2. This dependency is a pain but it is a very useful library. A zip file is available for download but this is usually out of date. Better to checkout the SVN repository. Future goalsSome goals for the future:
Regression testsIncluded with SiteScraper are a set of regression test cases (in testdata/) that successfully extract data from stock sites, news sites, weather sites, web forums, and search engines. Each regression test has:
LicenseSiteScraper is licensed under the LGPL license, which means you are free to use it in any project (including commercial) but must publish modifications you make to SiteScraper. ContactYou can ask questions or tell me how you have used SiteScraper by emailing me at richard@sitescraper.net. I am particularly interested to hear from you if you have a webpage that fails and ideas how to improve SiteScraper to scrape it correctly. |