My favorites | Sign in
Project Home Downloads Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
WikipediaFetch  
Fetching the text of Wikipedia.
Featured
Updated Mar 22, 2010 by scottaki...@gmail.com

Introduction

Wikipedia is great source of text for my key-train program.

  • It has interesting articles.
  • It's almost universally well written.
  • It's highly structured.
  • It comes in several popular languages.

fetch_article.py --help

fetch_article.py will fetch the article given by *link*. ex. "Albert_Einstein"

Options:
  -h, --help  show this help message and exit
  -o OUTFILE  Name of the file to save the article.
  -l LANG     Wikipedia Language (ex. en, pt) defaults to "en"

The article if fetched and stored in a datastructure which stores the title, url, sections, and paragraphs in sections. I store the data in YAML format and gzip them as well. See Computing for example.

Details

  • Used lxml to parse the html.
  • Retrieved the page as "print=true".
  • Skipped all tables.
  • Skipped divs with class="thumb..." in them (these are images).
  • Skipped all <sup> sections (which are references).
  • I pay attention to <p> to separate into paragraphs.
  • I pay attention to <h1>, <h2>, etc. which separate sections.
  • Skipped all &lt;script> tags.
  • Normalized special quotations marks to " or '.
  • Normalized special hyphens to just a plain hyphen.
  • Accept unicode characters (UTF-8).
  • From the original text I convert &nbsp; and &reg; to their equivalents.
  • Skipped certain sections that are normally just links or uninteresting. ex. See Also, Sister Cities, etc.. This list is different for each language.
Powered by Google Project Hosting