Introduction
Wikipedia is great source of text for my key-train program.
- It has interesting articles.
- It's almost universally well written.
- It's highly structured.
- It comes in several popular languages.
fetch_article.py --help
fetch_article.py will fetch the article given by *link*. ex. "Albert_Einstein"
Options:
-h, --help show this help message and exit
-o OUTFILE Name of the file to save the article.
-l LANG Wikipedia Language (ex. en, pt) defaults to "en"
The article if fetched and stored in a datastructure which stores the title, url, sections, and paragraphs in sections. I store the data in YAML format and gzip them as well. See Computing for example.
Details
- Used lxml to parse the html.
- Retrieved the page as "print=true".
- Skipped all tables.
- Skipped divs with class="thumb..." in them (these are images).
- Skipped all <sup> sections (which are references).
- I pay attention to <p> to separate into paragraphs.
- I pay attention to <h1>, <h2>, etc. which separate sections.
- Skipped all <script> tags.
- Normalized special quotations marks to " or '.
- Normalized special hyphens to just a plain hyphen.
- Accept unicode characters (UTF-8).
- From the original text I convert and ® to their equivalents.
- Skipped certain sections that are normally just links or uninteresting. ex. See Also, Sister Cities, etc.. This list is different for each language.