My favorites | Sign in
Project Home Downloads Wiki Issues Source
Checkout   Browse   Changes    
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/*
Copyright 2008 Flaptor (flaptor.com)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package com.flaptor.util;

import de.spieleck.app.cngram.NGramProfiles;

/**
* LanguageIdentifier using CNgramJ.
*
* @author dbuthay
*/
public class NgramJLanguageIdentifier {

private final NGramProfiles nps;

public NgramJLanguageIdentifier() throws java.io.IOException {
this.nps = new NGramProfiles();
}


public String identify(String text) {
NGramProfiles.Ranker ranker = nps.getRanker();
ranker.account(text);
NGramProfiles.RankResult res = ranker.getRankResult();
return res.getName(0);
}

}

Change log

r43 by dbuthay on Feb 25, 2008   Diff
Added cngram-trunk.jar, a language
identifier better than nutch's.
Modified LangUtils to use cngram.
Added DocumentParserTest, to check that
return values are null when a document is
not parseable.
Added NgramJLanguageIdentifier, that uses
cngram to perform identification.
Go to: 
Project members, sign in to write a code review

Older revisions

All revisions of this file

File info

Size: 1142 bytes, 42 lines
Powered by Google Project Hosting