|
Training
How to train OCRopus and perform book-level adaptation.
TrainingGetting StartedTo get started with training, try the following:
The rest will be fully automatic. You'll end up with a model file my.model that you can immediately use. Given the small amount of training data, it won't be very good, but you can still give it a try:
Explore the lines directory and its subdirectories to see how the various files look and relate to each other. More InformationFor both training and adaptive recognition, OCRopus requires just two additional commands. The first trains a text line recognizer and is used for all character shape training, both during initial training and book adaptive recognition:
The trainseg subcommand takes as input images of text lines (in binary, gray scale, or color), cseg.png images representing the segmentation of the input lines into characters, and txt files containing the Unicode codepoints corresponding to the character segments indicated in the cseg.png files. The trainseg subcommand conceptually performs two functions. First, for each actual character indicated in cseg.png, it trains a classifier to output the corresponding character from the txt file. Second, for each character hypothesis returned by the recognizer's segmenter and grouper that does not correspond to an actual output character, it trains a segmentation model to recognize those hypotheses as potential mis-segmentations. Of course, internally, these two functions may be represented in different ways; for example, trainseg might use a mixture density model similar to those used in speech recognition, obviating the need for training a separate character/non-character model. Alternatively, the character/non-character decision may be made by an explicit classifier. The second command line tool used for training is the align subcommand:
The align subcommand takes the raw segmentation files (rseg.png), the recognition lattice files (fst), and a ground-truth transcription (gt.txt) and computes a character segmentation file (cseg.png), which aligns the ground truth text with the input image. Combinations of these different commands now allow a wide variety of different training scenarios:
All these different training and processing steps are closely analogous to training methods commonly used in speech recognition; they can be generally justified as expectation-maximization (EM) training algorithms. More on ClassifiersOCRopus divides the recognition process into several (mostly) independent steps: document cleanup, layout analysis, text line recognition, and linguistic post-processing. The text line recognizer is given isolated text lines as input images, and its task is to produce either a Unicode string containing the corresponding text, or to return a recognition lattice (similar to the output of a Hidden Markov Model recognizer) representing alternative interpretations of the input text line. The new line recognizer is intended for source documents scanned at around 200 dpi or above, usually for alphabetic languages; for such documents, approaching text line recognition as segmentation followed character recognition works well in our experience. For lower resolution documents or other scripts, this text line recognizer may not work well, and other kinds of text line recognizers may be preferable, including segmentation-free convolutional neural networks and Hidden Markov Models (efforts to implement such additional recognizers for OCRopus are underway). Base ClassifiersPerhaps the most important component of a text line recognizer is the actual character classifier itself. OCRopus provides a number of such classifiers. Classifiers in OCRopus are abstracted as components that, during training, receive a stream of feature vectors and corresponding classes, and then are switched into a recognition mode, during which they need to assign classes and corresponding posterior probabilities to feature vectors. Classifiers can be trained incrementally (e.g., using stochastic gradient descent) or in batch mode. For batch mode training, OCRopus can automatically batch data into batches that are then handed to the training algorithm. In order to allow larger batches to be buffered, OCRopus can internally compress feature vectors using small, fixed-point floating point representations. Since the ability for end users to retrain OCRopus for new fonts, scripts, and languages is important, the interface to the classifiers encourages training to be fully automated and self-contained. However, it is possible for developers to “plug in” classifiers that require manual interventions during training. The primary classifiers in OCRopus right now are a nearest neighbor classifier, a highly optimized binary nearest neighbor classifier, and a standard multilayer perceptron (MLP) trained using stochastic gradient descent. We have also experimentally integrated a support vector machine classifier based on the standard libsvm (SVM; 5), but have not been able to make this classifier scale well to the size of training sets and number of classes that occur in OCR problems. more on parameters etc Automatic Parameter Selection and Cross-ValidationConsiderable effort has gone into the development of the MLP classifier inside OCRopus, in particular its training methods. Traditionally, MLP classifiers have a reputation for being susceptible to local minima during training and requiring careful manual choice of the training parameters. We have developed and implemented a number of new training techniques that completely automate the choice of learning rates and number of hidden units. The result of this is that the OCRopus MLP classifier yields quite consistent performance for each training set. When applied to standard dataset, such as the MNIST data, the training algorithm used by OCRopus automatically yields the best reported error rates on those datasets, without any additional tuning or parameter selection. The OCRopus MLP training algorithm is reminiscent of genetic algorithms, as well as cross-entropy-based optimization methods. In order to speed up training, each member of the ensemble is assigned to a separate core on a multi-core processor. We have found this approach to be robust and reliable for a wide range of different kinds of inputs, data sets, and feature vectors. When applied to the MNIST data, the number of hidden units chosen by this algorithm tends to be significantly below that reported in the literature, at similar error rates. parameters controlling training and cross validation Compound ClassifiersAlthough individual MLP recognizers yield reasonable performance on character recognition tasks, results can be considerably improved using classifier combination. A number of classifier combination methods have been implemented within OCRopus, including AdaBoost, rescored AdaBoost, and cascaded MLPs. The rescored AdaBoost classifier uses the standard AdaBoost algorithm to obtain a series of component classifiers, but then uses a least square fitting procedure to adjust the component classifier weights based on the training set. The cascaded MLP classifier is similar to a cascade correlation classifier, but is trained at the network level; that is, a first MLP is trained as usual, then a second MLP is trained that receives the output of the first MLP together with the original feature vector. In addition to these methods we have experimented (but not yet incorporated) different forms of bagging, local learning, and mixtures of experts. The OCRopus component framework, as well as the fully automated training procedures, makes the training of such combined classifiers fully automatic; that is, a user can simply choose an rescored AdaBoost classifier, apply OCRopus to a training data set, and obtain a fully trained and cross-validated classifier that can be used interchangeably with a simple MLP classifier. more on compound classifiers Feature MapsAn important aspect of classifier design is designing the feature vectors that go into the classifier. Several different philosophies exist. Some approaches (e.g., many SVM-based classifiers, as shown in the MNIST results) perform no feature extraction at all and instead modify the classifier itself. Other approaches (e.g., convolutional neural networks) attempt to learn the feature extractors as part of the overall classification task. “Traditional” pattern recognition designs feature maps manually and may perform feature selection. In OCRopus, feature extraction is abstracted in a feature map class. The feature map class is initialized with the entire text line image and then returns feature vectors for arbitrary subrectangles of the input text line. This design allows the feature map class to perform global convolutions on the input, as well as to encapsulate different approaches to rescaling and normalizing the input. more on feature maps and their parameters Datasetsmemory usage, parameters, etc. |
Sign in to add a comment
Could anyone please upload a simple example with simple files? maybe a step by step on how to use it on a real world example?
I got the provided lines model and it works perfectly for ascii text from scanned book images. but i get zero hits on screenshots, my intended use.
what I do, and get nothing right is:
all I get is 0001 010002: transcript doesn't agree with cseg (transcript 11, cseg 10) FIXME info? training content classifier FATAL: CHECK include/glclass.h:254 ds.nsamples()>0
must note that the line recognition always works perfectly. it get's all the small text samples on the screen, and only one or two ocasional nonsense images.
I got around the "transcript does not agree with cseg" error by setting old_csegs=1.
I have problems with the training of ocropus Using ocropus 0.4 I made on gimp a simple image of a string of numbers. I did all the things one is suposed to do ocropus book2pages, pages2lines,lines2fsts and fsts2text. Ocropus recognies the string perfectly. I wanted to train a language model for this type of data. ocropus align gave some problems due to the fact that it looks for -9?.gt.txt data. And not only 0-9?.txt which is the normal output for lines2fsts. So i changed thaat manually Same story with the char segmentation, I had to change the names and add "gt". Now it starts to run but i get the next Error:
chepeadan@linux:~/project/training$ debug=info,transcript ocropus trainseg model1.model test info? test/0-9?0-9?0-9?0-9?/0-9?0-9?0-9?0-9?.cseg.gt.png transcript? test/0001/0001.gt.txt (0) 6, 6 000789 info? training content classifier info? 58 to 4 classes? FATAL: CHECK ocr-line/glclass.cc:1151 ds.nsamples()>=10 && ds.nsamples()<100000000
Can anyone help?