|
Project Information
Links
|
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. What's newInstallation
wget http://justext.googlecode.com/files/justext-1.2.tar.gz tar xzvf justext-1.2.tar.gz cd justext-1.2/ python setup.py install Quick startwget -O page.html http://planet.python.org/ justext -s English page.html > cleaned-page.txt For usage information see: justext --help Python APIimport urllib2
import justext
page = urllib2.urlopen('http://planet.python.org/').read()
paragraphs = justext.justext(page, justext.get_stoplist('English'))
for paragraph in paragraphs:
if paragraph['class'] == 'good':
print paragraph['text']Online demohttp://nlp.fi.muni.cz/projects/justext/ AcknowledgementsThis software is developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to author's PhD research. |