|
Tutorial
This tutorial will walk you through the process of creating a new machine learning component using ClearTK, a part-of-speech tagger trained on the Penn Treebank corpus.
IntroductionThis tutorial will walk you through the process of creating a new machine learning component using ClearTK, a part-of-speech tagger trained on the Penn Treebank corpus. It assumes you have already installed ClearTK and Eclipse and your environment is all set up as described in DeveloperSetup. At the end of this tutorial, you should understand:
Writing CleartkSequenceAnnotator classesExtensions of CleartkSequenceAnnotator (and CleartkAnnotator) are at the core of many machine learning components provided by ClearTK and are the recommended construct for creating new components using ClearTK. CleartkSequenceAnnotators understand how to take a document (represented by a JCas) and create machine learning features for a particular task. They also understand how to take the predictions of a classifier and convert these into annotations over the document (the JCas). Thus, CleartkSequenceAnnotator objects serve as the interface between the UIMA annotations and machine learning classifiers. All of the code discussed here can be found in the org.cleartk.example package. We'll start with a simple CleartkSequenceAnnotator designed to create features for a part-of-speech tagging task. In Eclipse, create a new Java class (File -> New -> Class), set the Name: to ExamplePOSAnnotator. Set the superclass to be org.cleartk.classifier.CleartkSequenceAnnotator<String>. This should generate a class like: import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.cleartk.classifier.CleartkSequenceAnnotator;
public class ExamplePOSAnnotator extends CleartkSequenceAnnotator<String> {
@Override
public void process(JCas jCas) throws AnalysisEngineProcessException {
}
}
Note that the generic type OUTCOME_TYPE defined in CleartkSequenceAnnotator is parameterized with String here. This is because the outcome of the classifier will be strings corresponding to part-of-speech tags. The process method defines how features and labels are extracted from the annotations of a JCas, and how classifier predictions are used to create new JCas annotations. We will also override the initialize method which is typically used to initialize feature extractors, reading parameters as necessary from the UimaContext. Processing a JCas with an CleartkSequenceAnnotatorLet's start out by working on the process method of our CleartkSequenceAnnotator. This method defines how features are generated from the Annotation objects in a JCas. We want to label part-of-speech tags, which in this example are attributes of Token annotations. We are going to do this in a "sequential" fashion by having the classifier tag a sequence of tokens corresponding to one sentence at once. For many other tasks, classification will be performed on one item at a time. In such cases, CleartkAnnotator is the appropriate superclass to use for you component. For each sequence of tokens our CleartkSequenceAnnotator needs to know how to do two things:
The process method will perform both of these tasks as they usually share a lot of code. For our part-of-speech tagging task, we can define the process method like this: private List<SimpleFeatureExtractor> tokenFeatureExtractors;
private List<ContextExtractor<Token>> contextFeatureExtractors;
public void process(JCas jCas) throws AnalysisEngineProcessException {
// generate a list of training instances for each sentence in the
// document
for (Sentence sentence : JCasUtil.select(jCas, Sentence.class)) {
List<Instance<String>> instances = new ArrayList<Instance<String>>();
List<Token> tokens = JCasUtil.selectCovered(jCas, Token.class, sentence);
// for each token, extract all feature values and the label
for (Token token : tokens) {
Instance<String> instance = new Instance<String>();
// extract all features that require only the token
// annotation
for (SimpleFeatureExtractor extractor : this.tokenFeatureExtractors) {
instance.addAll(extractor.extract(jCas, token));
}
// extract all features that require the token and sentence annotations
for (ContextExtractor<Token> extractor : this.contextFeatureExtractors) {
instance.addAll(extractor.extractWithin(jCas, token, sentence));
}
// set the instance label from the token's part of speech
if (this.isTraining()) {
instance.setOutcome(token.getPos());
}
// add the instance to the list
instances.add(instance);
}
// for training, write instances to the data write
if (this.isTraining()) {
this.dataWriter.write(instances);
}
// for classification, set the labels as the token POS labels
else {
Iterator<Token> tokensIter = tokens.iterator();
for (String label : this.classify(instances)) {
tokensIter.next().setPos(label);
}
}
}
}So, for each Sentence in the document we will examine a "sequence" of Tokens. For each Token we create a new classification instance that contains the part-of-speech tag label and some features. (For the moment, we're ignoring exactly what kind of features are generated - these will be discussed further in the next section.) We then pass the instances off to a SequenceDataWriter if we are in training mode or we classify the sequence of instances (by passing them to a SequenceClassifier via the classify method) which are in turn interpretted directly as the part-of-speech tags to assign to each token in the sequence. To make sense of this code, it is helpful to realize that:
Initializing feature extractors in a CleartkSequenceAnnotatorNow that we understand how our extension of CleartkSequenceAnnotator will be converting the JCas to classification instances, we can introduce some features that will be useful for our task. Feature extractors are typically created in the initialize method, which is invoked before the process method is ever called. Since we're building a part-of-speech tagger, some useful features are:
This is by no means an exhaustive set of features that can be found in part-of-speech taggers - but these are representative of the kinds of features that are extracted in part-of-speech taggers and are what we will use here for this example tagger. Here's how we create these feature extractors in our initialize method: public void initialize(UimaContext context) throws ResourceInitializationException {
super.initialize(context);
// alias for NGram feature parameters
int fromRight = CharacterNGramProliferator.RIGHT_TO_LEFT;
// a list of feature extractors that require only the token:
// the stem of the word, the text of the word itself, plus
// features created from the word text like character ngrams
this.tokenFeatureExtractors = Arrays.asList(
new TypePathExtractor(Token.class, "stem"),
new ProliferatingExtractor(
new SpannedTextExtractor(),
new LowerCaseProliferator(),
new CapitalTypeProliferator(),
new NumericTypeProliferator(),
new CharacterNGramProliferator(fromRight, 0, 2),
new CharacterNGramProliferator(fromRight, 0, 3)));
// a list of feature extractors that require the token and the sentence
this.contextFeatureExtractors = new ArrayList<ContextExtractor<Token>>();
this.contextFeatureExtractors.add(new ContextExtractor<Token>(
Token.class,
new TypePathExtractor(Token.class, "stem"),
new Preceding(2),
new Following(2)));
}We start with a TypePathExtractor which will extract the stem of the Token annotation, and a SpannedTextExtractor, which simply takes an annotation and returns the text that it covers. We create a ProliferatingExtractor that wraps this SpannedTextExtractor and introduces some feature proliferators for extracting a number of features of the word including the capitalization, numeric description, and character suffixes of the word. Note the difference between feature extractors and feature proliferators here. Feature extractors take an Annotation from the JCas and extract features from it. Feature proliferators take the features produced by ''a feature extractor'' and generate new features from the old ones. Since feature proliferators don't need to look up information in the JCas, they may be more efficient than feature extractors. So in our initialize method, the CharacterNGramProliferators simply extract suffixes from the text returned by the SpannedTextExtractor. Near the end of initialize, we add the ProliferatingExtractor, which extracts both the word and the "proliferated" features (e.g. its suffixes), to the list of feature extractors used in our process method. Finally, we create a ContextExtractor which will create features from the surrounding context of a token. In this case, we are going to retrieve the two word stems before and after a token. Note that we will not create features from previous part-of-speech labels as no previous part-of-speech labels are actually available at the time the feature extraction is performed on each token because we will pass the classifier an entire sequence of instances corresponding to the tokens in the sentence at once. Previous part-of-speech labels are very useful features and should be handled internally by by the SequenceClassifier that is being used (as is done by e.g. MalletCRFClassifier and ViterbiClassifier). And that's it for our first pass of our ExamplePOSAnnotator - there is no more code to write. At this point it is a matter of understanding the other components and how to configure and run them. Now we are ready to learn about training and using machine learning models. Building a part-of-speech modelThis section details how one creates a model from one of the supported machine learning libraries. The SequenceAnnotationHandler we created in the previous steps, ExamplePOSAnnotationHandler, will be employed to create our training data by extracting features from the CAS and writing instances to a file in the format suitable for the machine learning library of choice. A model will then be trained using the selected machine learning library and then packaged into a jar file suitable for use by ClearTK. The following provides the sequence of steps to accomplish this. The code that performs these steps can be found in org.cleartk.example.pos.BuildTestExamplePosModel. Please refer to this code while reading the following:
Running this sequence of ClearTK components will result in a number of files written to the output directory "example/model". Generally speaking, these files are not human readable and are not very interesting. However, in this example we can create (somewhat) human readable training data by changing the value for DefaultMaxentDataWriterFactory.PARAM_COMPRESS to false in the method org.cleartk.example.pos.ExamplePOSAnnotator.getWriterDescription(String) - but this exercise can be safely skipped. Now we need to train our model. This can be done in one of two ways:
For this tutorial, we suggest that you run org.cleartk.example.pos.BuildTestExamplePosModel exactly as it is. This will run the pipeline of ClearTK components which results in a training data file, train the model, and create the file "example/model/model.jar" which can be used for part-of-speech tagging as described below. Running the part-of-speech taggerNow that we have created a model for part-of-speech tagging we are now ready to tag some parts-of-speech! This step is actually rather similar to the step above "Building a part-of-speech model" except that instead of creating training data from gold-standard data we will be taking raw text and applying our part-of-speech tagging model to assign part-of-speech tags to tokens. The model training step required sentences, tokens, and word stems (in addition to the correct part-of-speech tags) - so we will need to use annotators that create these. The code that performs these steps can be found in org.cleartk.example.pos.RunExamplePOSAnnotator. Please refer to this code while reading the following: For this example, we will run the following:
For this tutorial, we suggest that you run org.cleartk.example.pos.RunExamplePOSAnnotator exactly as it is. This will run the pipeline of ClearTK components described above and results in the tagged file "/ClearTK/example/data/2008_Sichuan_earthquake.txt.pos." Please open this file and observe the results. You have successfully created a part-of-speech tagger and run it to tag text! |
Not sure I understand correctly: Here we used the convenience method classifySequence to call the SequenceClassifier?. but in the code you call classify: for (String label : this.classify(instances)) Could you please explain, thanks.
Thank you for pointing out the mistake in the documentation. I believe the method SequenceClassifier?.classify used to be called classifySequence. I have changed the text.
Thank you Philip!