GitHub - gchrupala/morfette: Supervised learning of morphology

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data		data
src		src
INSTALL		INSTALL
LICENSE		LICENSE
Makefile		Makefile
README		README
Setup.lhs		Setup.lhs
morfette.cabal		morfette.cabal

Repository files navigation

INTRODUCTION
============

Morfette website: https://code.google.com/p/morfette/

Morfette is a tool for supervised learning of inflectional
morphology. Given a corpus of sentences annotated with lemmas 
and morphological labels, and optionally a lexicon, morfette 
learns how to morphologically analyze new sentences. 

In the learning stage Morfette fits two separate logistic regression
models: one for morphological tagging and one for lemmatization. The
predictions of the models are combined dynamically and produce a 
globally plausible sequence of morphological-tag - lemma pairs for 
a sentence.

In Morfette lemmatization is cast as a classification task where a 
a lemmatization class corresponds to the specification of the edit 
operations which are needed to transform the inflected word form into
the corresponding lemma.

The basic approach is described in (Chrupała et al 2008 and Chrupała 2008). 
The current version of Morfette uses an averaged perceptron to 
fit the models, rather than Maximum Entropy training. The lemmatization 
classes are Edit-Tree-based as described in (Chrupała 2008).

LICENSE
=======

The source code in the src directory is licensed under
the BSD license.

INSTALLATION
============

The easiest way to install Morfette is to first install the Haskell
Platform <http://www.haskell.org/platform/> then execute the following
commands from within the morfette directory:

> cabal update
> cabal install --bindir=$HOME/bin

This will compile Morfette and install the executable in $HOME/bin. 

Cabal can also download Morfette from the source code repository
Hackage:

> cabal install morfette --bindir=$HOME/bin --datadir=$HOME/share

This will download Morfette, compile it, install the executable in
$HOME/bin, and install the data files in a $HOME/share.

There are also pre-built binaries available from the project website. 

USAGE
=====

Usage: morfette command [OPTION...] [ARG...]
train:    train models
train [OPTION...] TRAIN-FILE MODEL-DIR 
    --dict-file=PATH                      path to optional dictionary
    --language-configuration=es|pl|tr|..  language configuration
    --iter-pos=NUM                        iterations for POS model
    --iter-lemma=NUM                      iterations for Lemma model

extract-features:
          extract features
extract-features [OPTION...] MODEL-DIR 
    --dict-file=PATH      path to optional dictionary
    --model-id=pos|lemma  model id (`pos' or `lemma')

predict:  predict postags and lemmas using saved model data
predict [OPTION...] MODEL-DIR 
    --beam=+INT   beam size to use
    --tokenize    tokenize input
    --multi=+INT  n-best output

eval:     evaluate morpho-tagging and lemmatization results
eval [OPTION...] TRAIN-FILE GOLD-FILE TEST-FILE 
    --ignore-case            ignore case for evaluation
    --baseline-file=PATH     path to baseline results
    --dict-file=PATH         path to optional dictionary
    --ignore-punctuation     ignore punctuation for evaluation
    --ignore-pos=POS-prefix  ignore POS starting with POS-prefix for evaluation

version:  show version
version [OPTION...] 

EXAMPLE USAGE
=============

To train a new model:

> morfette train --dict-file=DICT TRAINING-FILE MODEL-DIR 

To use the model in MODEL-DIR to analyze new data:

> morfette predict MODEL-DIR < TEST-DATA > ANALYZED-TEST-DATA

PRETRAINED MODELS
=================

Pretrained models for Spanish and French are available in the data
directory: data/es/model and data/fr/model. For example you can use
the Spanish model like this:

> morfette predict data/es/model < TEST-DATA > ANALYZER-TEST-DATA

DATA FORMAT
===========

Morfette expects both training and testing data to be tokenized and
split into sentences. The format of training data look like this:

Gómez Gómez np0000p
sostiene sostener vmip3s0
que que cs
la el da0fs0
propuesta propuesta ncfs000
no no rn
cambiará cambiar vmif3s0
. . Fp

La el da0fs0
propuesta propuesta ncfs000
será ser vsif3s0
la el da0fs0
misma mismo pi0fs000


There is one token per line, with three columns separated by spaces or
tabs. The columns contain word form, lemma and morphological tag
respectively. Sentences are separated by an empty line. Text should be
encoded in UTF-8.

Test data format is similar, except only the first column is needed:

Gómez
sostiene
que
la
propuesta
no
cambiará
.

La
propuesta
será
la
misma

Optionally, Morfette can also use vector representations of words for
training and prediction, in addition to character strings, as
described in [3]. In this case, the second column of is used for the
vectors. Inside this field vector components are separated by commas:

Gómez 0.1,0.0,0.9 Gómez np0000p
sostiene 0.0,0.0,0.2 sostener vmip3s0

For prediction the format is similar, but only the first two columns
are used:

Gómez 0.1,0.0,0.9
sostiene 0.0,0.0,0.2

Each vector should have the same number of components.


Additionally, Morfette can use a dictionary with mappings between
word-forms, lemmas and morphological tags. The tags do NOT need to be
of the same kind and the tags used in the training data.  The format
of the dictionary file is one record per line with the following
fields:

FORM LEMMA1 POS1 LEMMA2 POS2 ...

Example from the Spanish dictionary:

sientes sentar VMSP2S0 sentir VMIP2S0
siento sentar VMIP1S0 sentir VMIP1S0


References
==========

[1] Grzegorz Chrupała, Georgiana Dinu and Josef van Genabith. 2008.
    Learning Morphology with Morfette. In Proceedings of LREC 2008.
    http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf

[2] Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture
    for Lexical Functional Grammar Parsing. Chapter 6. PhD
    dissertation, Dublin City
    University. 
    http://grzegorz.chrupala.me/papers/phd.pdf

[3] Grzegorz Chrupała. 2010. Efficient induction of probabilistic word
    classes with LDA. IJCNLP. 
    http://grzegorz.chrupala.me/papers/ijcnlp-2011.pdf