My favorites | Sign in
Project Home Downloads Wiki Issues Source
Checkout   Browse   Changes    
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
"""
Sam Huston 2007

This is a simulation of the article:
"Evaluation of a language identification system for mono- and multilingual text documents"
by Artemenko, O; Mandl, T; Shramko, M; Womser-Hacker, C.
presented at: Applied Computing 2006, 21st Annual ACM Symposium on Applied Computing; 23-27 April 2006

This implementation is intended for monolingual documents only,
however it is performed over a much larger range of languages.
Additionally three supervised methods of classification are explored:
Cosine distance, NaiveBayes, and Spearman-rho

"""

from nltk_contrib import classify
from nltk import detect
from nltk.corpus import udhr
import string

def run(classifier, training_data, gold_data):
classifier.train(training_data)
correct = 0
for lang in gold_data:
cls = classifier.get_class(gold_data[lang])
if cls == lang:
correct += 1
print correct, "in", len(gold_data), "correct"

# features: character bigrams
fd = detect.feature({"char-bigrams" : lambda t: [string.join(t)[n:n+2] for n in range(len(t)-1)]})

training_data = udhr.langs(['English-Latin1', 'French_Francais-Latin1', 'Indonesian-Latin1', 'Zapoteco-Latin1'])
gold_data = {}
for lang in training_data:
gold_data[lang] = training_data[lang][:50]
training_data[lang] = training_data[lang][100:200]

print "Cosine classifier: ",
run(classify.Cosine(fd), training_data, gold_data)

print "Naivebayes classifier: ",
run(classify.NaiveBayes(fd), training_data, gold_data)

print "Spearman classifier: ",
run(classify.Spearman(fd), training_data, gold_data)

Change log

r8170 by StevenBird1 on Jun 7, 2009   Diff
Moved NLTK-Contrib outside NLTK.
Synced some changes for book.
Go to: 
Sign in to write a code review

Older revisions

r6460 by stevenbird on Aug 19, 2008   Diff
nltk_contrib/kakashi
* deleted directory for project which
never eventuated

nltk_contrib/*
...
r4779 by stevenbird on Jun 25, 2007   Diff
fixed import statements; updated
version to 0.8b2
r4778 by stevenbird on Jun 25, 2007   Diff
moved contrib
All revisions of this file

File info

Size: 1582 bytes, 46 lines

File properties

svn:mergeinfo
Powered by Google Project Hosting