
chared
Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
What's new
Installation
- Make sure you have Python 2.6 or later and lxml library version 2.2.4 or later installed.
- Download the sources:
wget http://chared.googlecode.com/files/chared-1.2.tar.gz
- Extract the downloaded file:
tar xzvf chared-1.2.tar.gz
- Install the package (you may need sudo or a root shell for the latter command):
cd chared-1.2/ python setup.py install
Quick start
Detect the character encoding for a file or URL:
chared -m czech http://nlp.fi.muni.cz/cs/nlplab
Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):
chared-learn -o swahili.edm swahili_pages/*.html
... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):
chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html
For usage information see:
chared --help
chared-learn --help
Python API
```
import urllib2 import chared.detector page = urllib2.urlopen('http://nlp.fi.muni.cz/cs/nlplab').read() cz_model_path = chared.detector.get_model_path('czech') cz_model = chared.detector.EncodingDetector.load(cz_model_path) cz_model.classify(page) ['utf_8'] ```
Acknowledgements
This software is developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd, a corpus tool company.
See also
Unicode over 60 percent of the web at Google blog
Project Information
The project was created on May 24, 2011.
- License: New BSD License
- 3 stars
- svn-based source control