|
Using
A short introduction to using and training OCRopus.
Note: this user manual contains text blocks from papers submitted to the MOCR '09 workshop and the ACM DocEng '09 workshops. Getting StartedThe DevInstall page contains an example of page recognition and of book-level recognition. Also see the Book Level Recognition section below. There is a short example for doing training on the Training page. ComponentsDifferent languages and document types have different requirements in terms of preprocessing, cleanup, layout analysis, OCR, and language modeling. These requirements are addressed by a growing set of different components. We had originally hoped to be able to handle this through a top-level scripting interface, but found that the ability to instantiate, load, and save components was needed in many different places within the OCRopus system. Existing, portable component models (such as Mozilla’s) were both heavy-weight and did not address the needs of OCRopus very well. We therefore introduced a simple component model for OCRopus. The model is implemented by deriving from the IComponent base class. This provides both a standard set of interfaces, as well as some common component functionality. Most importantly, the component model allows components to be saved to a stream, loaded from a stream, and instantiated by name. In addition, many document analysis algorithms are dependent on parameters. Adapting and tuning those algorithms involves extensive experimentation with different parameter values. The new component model allows parameters to be set, both on a per-component basis and globally (via the environment). Parameter settings for a particular component are saved and loaded along with the component, and they can be inspected from the command line easily. Finally, each component can optionally accept commands via a simple method that takes an array of string arguments and returns a string result. This command interface is useful for sending component-specific commands (e.g., for debugging) without having to define a complex class hierarchy with many rarely used methods. The command method provides a simple form of duck typing and dynamic object oriented programming. At the C++ level, a "component" is basically a C++ object that implements the IComponent interface. There are several command line tools for dealing with OCRopus components:
The parameters that ocropus params lists can be set from the environment. When you set parameters from the environment, they apply to all instances of that component (this is usually not a problem). When a component has been saved, you can obtain infromation on it with the ocropus cinfo command. Note that some older "character shape models" (instances of IRecognizeLine) are not saved as components, so cinfo does not work on them. To obtain information about language models, use the OpenFST tools, e.g., fstprint. Book-Level RecognitionAlthough OCRopus can recognize individual pages (or even text lines or isolated characters), a major design goal in OCRopus has been to support book-level recognition. Book-level recognition can achieve reductions in OCR error rates by taking into account regularities in font, degradation, style, and layout across an entire book. To support book-level processing, starting with OCRopus 0.4, OCRopus has a standard representation for entire books. This book-level representation is a directory containing various files and subdirectories representing inputs, intermediate processing results, and the final output of recognition stages. This representation is created and manipulated using a number of command line tools that are executed in sequence in order to achieve book-level recognition and adaptation:
These steps reflect the strictly no-backtracking architecture of OCRopus; that is, each processing step outputs a complete representation of all possible interpretations of the input to the next processing step. The directory tree dir containing the intermediate processing results and represents the only information passed between different processing stages. The general structure of this directory tree is as a two-level hierarchy; the first level represents the pages of the input document, while the second level represents individual text lines and figures. The data relevant to line 17 on (physical) page 5 is stored in the following locations:
Note that there is only a small number of data types (PNG images, Unicode strings, recognition lattices). The recognition lattices themselves are stored in OpenFST format (www.openfst.org) and can be manipulated with standard OpenFST tools. This intermediate representation makes it easy to write command line tools implementing various processing stages. For example, an alternative text line recognizer can be substituted for the standard OCRopus recognizer simply by running it over each of the segmented text lines: for line in dir/????/????.png; do
my-recognizer $line > \
$(echo $line | sed 's/png$/txt/')
doneThe representation also makes it easy to write visualization, ground truthing, and correction tools. For example, a text line transcription tool consists of performing the layout analysis step and computing the text line images, then presenting the user with each text line and asking for a transcription to be input. Although this representation involves reading and writing many files during processing, we have observed the the overhead of this to small compared to the actual recognition process, at least if the directory tree is stored on a local disk. The overall output from the system is in hOCR format, a format that is fully (X)HTML-compliant, but also embeds geometric and other OCR-related information invisibly inside SPAN and DIV tags. The design of the hOCR format provides full, standard support for all common typographic phenomena through the existing HTML and CSS standards, is compatible with existing indexing and search engines, and yet also includes all the information typically required for OCR processing and post-processing. |
Sign in to add a comment