My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Tutorial  
This tutorial will walk you through the process of creating a new machine learning component using ClearTK, a part-of-speech tagger trained on the Penn Treebank corpus.
Updated Jan 19, 2012 by phi...@ogren.info

Introduction

This tutorial will walk you through the process of creating a new machine learning component using ClearTK, a part-of-speech tagger trained on the Penn Treebank corpus. It assumes you have already installed ClearTK and Eclipse and your environment is all set up as described in DeveloperSetup. At the end of this tutorial, you should understand:

  • How a CleartkSequenceAnnotator object extracts features and creates a sequence of instances suitable for model training or classification.
  • How to create training data using a SequenceDataWriter and train a model using one of ClearTK's supported sequential classifiers.
  • How to use a model for part-of-speech tagging using a SequenceClassifier.

Writing CleartkSequenceAnnotator classes

Extensions of CleartkSequenceAnnotator (and CleartkAnnotator) are at the core of many machine learning components provided by ClearTK and are the recommended construct for creating new components using ClearTK. CleartkSequenceAnnotators understand how to take a document (represented by a JCas) and create machine learning features for a particular task. They also understand how to take the predictions of a classifier and convert these into annotations over the document (the JCas). Thus, CleartkSequenceAnnotator objects serve as the interface between the UIMA annotations and machine learning classifiers.

All of the code discussed here can be found in the org.cleartk.example package. We'll start with a simple CleartkSequenceAnnotator designed to create features for a part-of-speech tagging task. In Eclipse, create a new Java class (File -> New -> Class), set the Name: to ExamplePOSAnnotator. Set the superclass to be org.cleartk.classifier.CleartkSequenceAnnotator<String>. This should generate a class like:

import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.cleartk.classifier.CleartkSequenceAnnotator;

public class ExamplePOSAnnotator extends CleartkSequenceAnnotator<String> {

	@Override
	public void process(JCas jCas) throws AnalysisEngineProcessException {

	}
}

Note that the generic type OUTCOME_TYPE defined in CleartkSequenceAnnotator is parameterized with String here. This is because the outcome of the classifier will be strings corresponding to part-of-speech tags. The process method defines how features and labels are extracted from the annotations of a JCas, and how classifier predictions are used to create new JCas annotations. We will also override the initialize method which is typically used to initialize feature extractors, reading parameters as necessary from the UimaContext.

Processing a JCas with an CleartkSequenceAnnotator

Let's start out by working on the process method of our CleartkSequenceAnnotator. This method defines how features are generated from the Annotation objects in a JCas. We want to label part-of-speech tags, which in this example are attributes of Token annotations. We are going to do this in a "sequential" fashion by having the classifier tag a sequence of tokens corresponding to one sentence at once. For many other tasks, classification will be performed on one item at a time. In such cases, CleartkAnnotator is the appropriate superclass to use for you component. For each sequence of tokens our CleartkSequenceAnnotator needs to know how to do two things:

  • When training, extract features and part-of-speech tags from the Annotations in the JCas, and pass them to a SequenceDataWriter
  • When predicting, extract features, pass them to a SequenceClassifier, and use the resulting classifications to add/update Annotations to the JCas.

The process method will perform both of these tasks as they usually share a lot of code. For our part-of-speech tagging task, we can define the process method like this:

  private List<SimpleFeatureExtractor> tokenFeatureExtractors;

  private List<ContextExtractor<Token>> contextFeatureExtractors;

  public void process(JCas jCas) throws AnalysisEngineProcessException {
    // generate a list of training instances for each sentence in the
    // document
    for (Sentence sentence : JCasUtil.select(jCas, Sentence.class)) {
      List<Instance<String>> instances = new ArrayList<Instance<String>>();
      List<Token> tokens = JCasUtil.selectCovered(jCas, Token.class, sentence);

      // for each token, extract all feature values and the label
      for (Token token : tokens) {
        Instance<String> instance = new Instance<String>();

        // extract all features that require only the token
        // annotation
        for (SimpleFeatureExtractor extractor : this.tokenFeatureExtractors) {
          instance.addAll(extractor.extract(jCas, token));
        }

        // extract all features that require the token and sentence annotations
        for (ContextExtractor<Token> extractor : this.contextFeatureExtractors) {
          instance.addAll(extractor.extractWithin(jCas, token, sentence));
        }

        // set the instance label from the token's part of speech
        if (this.isTraining()) {
          instance.setOutcome(token.getPos());
        }

        // add the instance to the list
        instances.add(instance);
      }

      // for training, write instances to the data write
      if (this.isTraining()) {
        this.dataWriter.write(instances);
      }

      // for classification, set the labels as the token POS labels
      else {
        Iterator<Token> tokensIter = tokens.iterator();
        for (String label : this.classify(instances)) {
          tokensIter.next().setPos(label);
        }
      }
    }
  }

So, for each Sentence in the document we will examine a "sequence" of Tokens. For each Token we create a new classification instance that contains the part-of-speech tag label and some features. (For the moment, we're ignoring exactly what kind of features are generated - these will be discussed further in the next section.) We then pass the instances off to a SequenceDataWriter if we are in training mode or we classify the sequence of instances (by passing them to a SequenceClassifier via the classify method) which are in turn interpretted directly as the part-of-speech tags to assign to each token in the sequence.

To make sense of this code, it is helpful to realize that:

  • When training:
    • this.dataWriter` will be instantiated and ready to write sequences of instances
    • calling token.getPos() on any token in the sequence will return a non-null part-of-speech tag which will be used to set the expected outcome of the classification instance. That is, in training mode the data provides the expected answers.
  • When predicting
    • this.classifier will be instantiated and ready to classify sequences of instances. Here we used the method classify to call the SequenceClassifier.
    • calling token.getPos() on any token in the sequence will (usually) return a null part-of-speech tag. We will use the classifier to help us fill in these "missing" part-of-speech tags.

Initializing feature extractors in a CleartkSequenceAnnotator

Now that we understand how our extension of CleartkSequenceAnnotator will be converting the JCas to classification instances, we can introduce some features that will be useful for our task. Feature extractors are typically created in the initialize method, which is invoked before the process method is ever called. Since we're building a part-of-speech tagger, some useful features are:

  • the word's stem
  • the word itself
  • the word lowercased
  • a categorical label describing the capitalization of the word - e.g. "ALL_UPPERCASE", "INITIAL_UPPERCASE", "ALL_LOWERCASE", "MIXED_CASE"
  • a categorical label describing the use of numbers in the word - e.g. "DIGITS", "YEAR_DIGITS", "ALPHANUMERIC", etc.
  • character bigram suffix of the word
  • character trigram suffix of the word
  • the two word stems to the left and right of the word

This is by no means an exhaustive set of features that can be found in part-of-speech taggers - but these are representative of the kinds of features that are extracted in part-of-speech taggers and are what we will use here for this example tagger. Here's how we create these feature extractors in our initialize method:

  public void initialize(UimaContext context) throws ResourceInitializationException {
    super.initialize(context);
    // alias for NGram feature parameters
    int fromRight = CharacterNGramProliferator.RIGHT_TO_LEFT;

    // a list of feature extractors that require only the token:
    // the stem of the word, the text of the word itself, plus
    // features created from the word text like character ngrams
    this.tokenFeatureExtractors = Arrays.asList(
        new TypePathExtractor(Token.class, "stem"),
        new ProliferatingExtractor(
            new SpannedTextExtractor(),
            new LowerCaseProliferator(),
            new CapitalTypeProliferator(),
            new NumericTypeProliferator(),
            new CharacterNGramProliferator(fromRight, 0, 2),
            new CharacterNGramProliferator(fromRight, 0, 3)));

    // a list of feature extractors that require the token and the sentence
    this.contextFeatureExtractors = new ArrayList<ContextExtractor<Token>>();
    this.contextFeatureExtractors.add(new ContextExtractor<Token>(
        Token.class,
        new TypePathExtractor(Token.class, "stem"),
        new Preceding(2),
        new Following(2)));

  }

We start with a TypePathExtractor which will extract the stem of the Token annotation, and a SpannedTextExtractor, which simply takes an annotation and returns the text that it covers. We create a ProliferatingExtractor that wraps this SpannedTextExtractor and introduces some feature proliferators for extracting a number of features of the word including the capitalization, numeric description, and character suffixes of the word.

Note the difference between feature extractors and feature proliferators here. Feature extractors take an Annotation from the JCas and extract features from it. Feature proliferators take the features produced by ''a feature extractor'' and generate new features from the old ones. Since feature proliferators don't need to look up information in the JCas, they may be more efficient than feature extractors. So in our initialize method, the CharacterNGramProliferators simply extract suffixes from the text returned by the SpannedTextExtractor. Near the end of initialize, we add the ProliferatingExtractor, which extracts both the word and the "proliferated" features (e.g. its suffixes), to the list of feature extractors used in our process method.

Finally, we create a ContextExtractor which will create features from the surrounding context of a token. In this case, we are going to retrieve the two word stems before and after a token. Note that we will not create features from previous part-of-speech labels as no previous part-of-speech labels are actually available at the time the feature extraction is performed on each token because we will pass the classifier an entire sequence of instances corresponding to the tokens in the sentence at once. Previous part-of-speech labels are very useful features and should be handled internally by by the SequenceClassifier that is being used (as is done by e.g. MalletCRFClassifier and ViterbiClassifier).   And that's it for our first pass of our ExamplePOSAnnotator - there is no more code to write. At this point it is a matter of understanding the other components and how to configure and run them. Now we are ready to learn about training and using machine learning models.

Building a part-of-speech model

This section details how one creates a model from one of the supported machine learning libraries. The SequenceAnnotationHandler we created in the previous steps, ExamplePOSAnnotationHandler, will be employed to create our training data by extracting features from the CAS and writing instances to a file in the format suitable for the machine learning library of choice. A model will then be trained using the selected machine learning library and then packaged into a jar file suitable for use by ClearTK. The following provides the sequence of steps to accomplish this. The code that performs these steps can be found in org.cleartk.example.pos.BuildTestExamplePosModel. Please refer to this code while reading the following:

  • org.cleartk.util.FilesCollectionReader - read ".tree" files in PennTreebank format into a separate view "TreebankView".
  • org.cleartk.syntax.treebank.TreebankGoldAnnotator - parses the PennTreebank data found in the "TreebankView" and creates annotations corresponding to e.g. tokens, sentences, part-of-speech tags, syntactic structure, etc. in the default initial view. If you were to view the data in a visualization tool you would see that the contents of the TreebankView will look like PTB-style parse trees. The default view will contain plain text but will be annotated with tokens, part-of-speech tags, sentences, etc. as derived from the the contents of the TreebankView as performed by the TreebankGoldAnnotator.
  • org.cleartk.token.snowball.SnowballStemmer - performs stemming on each token and posts the result into a feature of each token called "stem".
  • org.cleartk.classifier.ExamplePOSAnnotator - this is the component we created it above. In this case we will use the ViterbiDataWriter which will delegate to the MaxentDataWriter which will write training data in a format suitable for the OpenNLP MaxEnt classifier. The classes in the viterbi package (e.g. ViterbiDataWriter) simplify the use of non-sequential classifiers such as maxent for sequential tagging tasks. For example, it handles adding features corresponding to previous classifications in the sequence (i.e. the part-of-speech tags that have already been determined.) All of the training data will be written to an output directory. In this case it will go to "example/model".

Running this sequence of ClearTK components will result in a number of files written to the output directory "example/model". Generally speaking, these files are not human readable and are not very interesting. However, in this example we can create (somewhat) human readable training data by changing the value for DefaultMaxentDataWriterFactory.PARAM_COMPRESS to false in the method org.cleartk.example.pos.ExamplePOSAnnotator.getWriterDescription(String) - but this exercise can be safely skipped.

Now we need to train our model. This can be done in one of two ways:

  • call org.cleartk.classifier.Train.main("example/model") as is done in BuildTestExamplePosModel. This is the preferred way to train a model in ClearTK. This main method takes as arguments the output directory used by the data writer and an arbitrary number of additional command line arguments specific to the given learner. In this case, the OpenNLP Maxent library provides reasonable default arguments and so we will just pass in the output directory. Train.main does two things: 1) it invokes the learner and trains a model specific to that learner and 2) creates a jar file that ClearTK knows how to load at classification time (see next section) that contains the model created by the learner and additional information that ClearTK needs to make use of it.
  • Directly train a model using the learner that the training data was written for. Every learner that ClearTK supports provides a command line interface for directly training a model. This may be useful in a research setting where you are going to fine-tune the learning parameters specific to the model. However, to create a jar file that ClearTK can use you will need to invoke the appropriate ClassifierBuilder.buildJar method. See Train.main for an example of how to do this.

For this tutorial, we suggest that you run org.cleartk.example.pos.BuildTestExamplePosModel exactly as it is. This will run the pipeline of ClearTK components which results in a training data file, train the model, and create the file "example/model/model.jar" which can be used for part-of-speech tagging as described below.

Running the part-of-speech tagger

Now that we have created a model for part-of-speech tagging we are now ready to tag some parts-of-speech! This step is actually rather similar to the step above "Building a part-of-speech model" except that instead of creating training data from gold-standard data we will be taking raw text and applying our part-of-speech tagging model to assign part-of-speech tags to tokens. The model training step required sentences, tokens, and word stems (in addition to the correct part-of-speech tags) - so we will need to use annotators that create these. The code that performs these steps can be found in org.cleartk.example.pos.RunExamplePOSAnnotator. Please refer to this code while reading the following:

For this example, we will run the following:

  • org.cleartk.util.FilesCollectionReader - reads in plain text from a directory.
  • org.cleartk.sentence.opennlp.OpenNLPSentenceSegmenter - a simple wrapper around the OpenNLP sentence segmenter that provides Sentence annotations.
  • org.cleartk.token.TokenAnnotator - a PennTreebank styled tokenizer.
  • org.cleartk.token.snowball.SnowballStemmer - performs stemming on each token and posts the result into a feature of each token called "stem".
  • org.cleartk.classifier.ExamplePOSAnnotator - here is our part-of-speech tagger (can you feel it!) In this case we will use the ViterbiClassifier which will delegate to the MaxentClassifier which will classify each instance passed to it. Based on the returned results, the part-of-speech tags will be assigned to each token as described above.
  • org.cleartk.example.ExamplePOSPlainTextWriter - prints out the part-of-speech tags in a simple word/tag format.

For this tutorial, we suggest that you run org.cleartk.example.pos.RunExamplePOSAnnotator exactly as it is. This will run the pipeline of ClearTK components described above and results in the tagged file "/ClearTK/example/data/2008_Sichuan_earthquake.txt.pos." Please open this file and observe the results.

You have successfully created a part-of-speech tagger and run it to tag text!

Comment by renaud.r...@gmail.com, Jan 19, 2012

Not sure I understand correctly: Here we used the convenience method classifySequence to call the SequenceClassifier?. but in the code you call classify: for (String label : this.classify(instances)) Could you please explain, thanks.

Comment by project member phi...@ogren.info, Jan 19, 2012

Thank you for pointing out the mistake in the documentation. I believe the method SequenceClassifier?.classify used to be called classifySequence. I have changed the text.

Comment by renaud.r...@gmail.com, Jan 20, 2012

Thank you Philip!


Sign in to add a comment
Powered by Google Project Hosting