Getting StartedOften, the easiest way of customizing OCRopus is not by writing new C++ tools, but by reconfiguring it via parameter settings or choosing different components, by shell scripting, or by scripting in Python or Lua. You need to write C++ code if you want to implement compute-intensive new processing steps, or if you really need a self-contained executable. Using OCRopus ComponentsLet's start by writing a small program that illustrates the major aspects of OCRopus C++ programming. // ocrobin.cc -- binarize an input file
// usage: ocrobin input.png output.png
// use the standard include files for colib, iulib, and ocropus
// (colib ships as part of iulib)
#include <colib/colib.h>
#include <iulib/iulib.h>
#include <ocropus/ocropus.h>
using namespace colib;
using namespace iulib;
using namespace ocropus;
// get a parameter from the environment, with a default value
param_string method("method","BinarizeByOtsu","binarization method");
int main(int argc,char **argv) {
try {
if(argc!=3) throw "wrong # arguments";
// register all the internal OCRopus components so that make_component works
init_ocropus_components();
// instantiate the binarization component
autodel<IBinarize> binarizer;
make_component(method,binarizer);
// read an input image
bytearray image;
read_image_gray(image,argv[1]);
// apply the binarizer
bytearray output;
binarizer->binarize(output,image);
// write the result
write_image_gray(argv[2],output);
} catch(const char *message) {
fprintf(stderr,"error: %s\n",message);
exit(1);
}
}Put this into a file called ocrobin.cc, then compile it with the command: g++ ocrobin.cc -locropus -liulib -llept -lpng -ljpeg -lgif -ltiff -fopenmp The ocropus and iulib libraries are libraries of the OCRopus project. The lept library is the Leptonica image processing library. The png, jpeg, gif, and tiff libraries are used for image I/O. The -fopenmp flag tells the compiler to compile in, and link with, multicore support. Note the following points: - Unless you're fixing bugs, you should work in your own directory, creating your own little project. That is, you should work outside the iulib and ocropus repositories. Think of your code as a separate "add-on" project. If you want to submit it as part of OCRopus later, that makes it much easier for us than if you send patches.
- The include paths have a prefix of colib/..., iulib/..., and ocropus/... for using the installed versions of colib, iulib, and ocropus. Source files that are part of the OCRopus distribution do not use those prefixes (since they do not use the installed versions of the include files).
- There are a number of tools for getting and setting parameters, including the param... declarations and the parameter settings for OCRopus components.
- You can pass parameter values in the environment easily by prefixing the command with the parameter values; e.g., method=foo ./ocrobin input.jpg output.jpg will set the method parameter to foo.
- The binarization algorithm is defined in a class called BinarizeByOtsu; this class conforms to the IBinarize interface.
- Although we could instantiate BinarizeByOtsu using new, instead we instantiate it using make_component and a name passed in the environment. This allows us to try new algorithms quickly. In order for make_component to work, we first need to register all the components; the built-in components are registered using init_ocropus_components.
- Images in OCRopus are represented as arrays. The type bytearray is short for narray<unsigned char>. Other common array types are shortarray, intarray, and floatarray (see below).
- The function read_image_gray reads an image into a bytearray. Note that output arguments always go on the left (just like assignment; this is the standard UNIX C convention, but it differs from Google's conventions).
- Most exceptions in OCRopus are simply of type const char *; this is used for all user-visible errors, and you can just catch those and report them.
Defining New OCRopus ComponentsHere is an example of code that defines a (not very good) image binarization component and makes it available to the rest of OCRopus (this example is in extras/sample-extension): #include <colib/colib.h>
#include <iulib/iulib.h>
#include <ocropus/ocropus.h>
using namespace colib;
using namespace iulib;
using namespace ocropus;
namespace ocropus { int main_ocropus(int,char **); }
struct MyThresholder : IBinarize {
const char *name() { return "mythresholder"; }
const char *description() { return "performs thresholding based on the mean"; }
MyThresholder() {
pdef("factor",1.0,"threshold is factor * mean");
}
void binarize(bytearray &out,floatarray &in) {
float factor = pgetf("factor");
float mean = sum(in)/in.length();
debugf("info","threshold=%g\n",mean);
int n = in.length();
out.makelike(in);
for(int i=0;i<n;i++)
out[i] = 255 * (in[i]>=factor*mean);
}
};
extern "C" {
void ocropus_init_dl();
}
void ocropus_init_dl() {
component_register<MyThresholder>("MyThresholder");
}
int main(int argc,char **argv) {
component_register<MyThresholder>("MyThresholder");
main_ocropus(argc,argv);
}Compile it with: g++ ocrothresh.cc -locropus -liulib -ljpeg -lpng -lgif -ltiff -fopenmp -llept -lSDL -lSDL_gfx -lgsl -lblas Afterwards, you can use the MyThresholder component anywhere in OCRopus. For example, you can access it from the command line and list its parameters: $ ./a.out params MyThresholder
param default mythresholder_factor=2 1 threshold is factor * mean
name=MyThresholder
description=performs thresholding based on the mean
$ binarizer=MyThresholder ./a.out threshold test.jpg out.png You can also use the component dynamically from within OCRopus (if your build supports dynamic loading): $ g++ -fPIC -shared -g -o ocrothresh.so ocrothresh.cc -locropus -liulib -ljpeg -lpng -lgif -ltiff -fopenmp -llept -lSDL -lSDL_gfx -lgsl -lblas
$ extension=./ocrothresh.so binarizer=MyThresholder ocropus threshold test.jpg out.png
[info] using mythresholder
[info] threshold=234.447
$ Coding ConventionsPlease have a look at the Conventions http://docs.google.com/Doc?id=dfxcv4vc_508vv9g6khd; all contributions should follow these. The most important parts of the conventions are: - Function callers own all storage. The only exception are small constructor-like functions.
- Don't use pointers anywhere; use references, smart pointers or arrays instead.
- Output arguments come before input arguments.
- Your code must be exception safe. Exception safe code does not leak resources even if unexpected exceptions are thrown. Use smart pointers and similar classes to ensure cleanup.
- Write unit tests. Bind your code to Lua and write unit tests in Lua.
- Use only the small "approved" set of data types in interfaces visible outside your compilation unit.
- Stick to the "approved" libraries (library dependencies are the biggest headache when porting and upgrading).
- Use global variables only for debugging.
General programming principles are: - Use assertions and tests liberally (use the ones defined in colib/checks.h)
- Get it working first, then do execution profiling, and only then optimize.
- Don't use pointers for optimization; they rarely help, and they often hurt performance.
- Avoid aliasing and avoid circular pointer structures; they are rarely needed.Don't write messy, complicated code just because you are guessing it's going to be faster; only optimize after you have data from execution profiling.
In terms of formatting, please observe: - Generally, follow K&R code layout and K&R/Stroustrup capitalization.
- Don't use tabs; indent with spaces only, and indent by 4.
If you see significant violations of these coding conventions that don't come with a justification, please submit an issue report. FIXME describe scripts in utilities/ that check for violations of some of these. The Array Data TypeThe most important compound data type in OCRopus is an array class that can represent rank 1-4 arrays, as well as stacks and lists. The constructor looks like this: narray<T>();
narray<T>(int d0);
narray<T>(int d0,int d1);
narray<T>(int d0,int d1,int d2);
narray<T>(int d0,int d1,int d2,int d3); Rather than writing all these overloadings, let's just abbreviate this to narray<T>(int d0,...) The copy constructor and assignment operators are intentionally disabled; you cannot pass an array by value, and you cannot return it from a function/method. That's because if you did so accidentally, it would have an unacceptable performance penalty. Instead of returning arrays, just follow the coding conventions. That is, instead of: floatarray f(double x); // DO NOT DO THIS
floatarray a = f(x); write void f(floatarray &a,double x);
floatarray a;
f(a,x); This is a little more tedious, but it avoids a whole range of memory management issues and makes the code easy to bind to other programming languages. Memory management for arrays is handled by these methods: void resize(int d0,...);
void renew(int d0,...);
void reshape(int d0,...);
void dealloc(); The difference between this is that resize may destroy all the data previously allocated by the array, renew guarantees that it will allocate and initialize new underlying storage, and that reshape will never allocate new storage and must retain the same total number of elements. Dealloc simply deallocates all storage associated with the array, returning it to the original state it was in right after being declared (that is, a.dealloc(); a.resize(10,10); a(0,0) = 99; is valid and common). Accessing the properties and individual elements is handled using these methods: int rank() const;
int dim(int i) const;
T &at(int i0,...);
T &operator()(int i0,...); Even arrays of rank >1 can always be treated as arrays of rank 1 (with elements in C order): int length1d() const;
T &at1d(int i) const;
T &operator[](int i) const; 1D arrays can also be treated as stacks (similar to Python lists); the accessors are the same. The following methods implement the remaining stack/list operations: int length();
void push(T &value);
float &pop();
float &last();
void clear();
void reserve(int n);
void grow_to(int n); The OCR InterfacesThe following are OCR interfaces that the rest of the system understands. If you write to these interfaces, chances are that your algorithm can be used as a drop-in replacement in the system: - IComponent A top-level interface to all character recognition components that permits querying information such as the name of the component and getting/setting parameters in a generic way.
- ICleanupGray An interface to algorithms that clean up gray scale images.
- ICleanupBinary An interface to algorithms that clean up binary images.
- ITextImageClass An interface to text/image segmentation algorithms.
- IBinarize An interface to binarization methods.
- ISegmentPage An interface to physical page layout analysis methods (i.e., methods that divide page images into columns, paragraphs, and text lines).
- ISegmentLine An interface to algorithms that segment text lines into character parts.
- ICharLattice FIXME An interface to weighted finite state transducers (used for representing language models, hypothesis graphs, etc.)
- ICharacterClassifier An interface to isolated character classifiers. This is rarely used for recognizing body text since it has no provisions for recognizing touching characters.
- IRecognizeLine An interface to text line recognizers. This is the most common interface for recognizing characters.
Invoking the Line RecognizerHere is a longer example showing how to invoke the line recognizer. Usage is a.out character.model image.png. #include "colib/colib.h"
#include "iulib/iulib.h"
#include "ocropus/ocropus.h"
#include "ocropus/glinerec.h"
using namespace iulib;
using namespace colib;
using namespace ocropus;
using namespace narray_ops;
using namespace glinerec;
int main(int argc,char **argv) {
init_ocropus_components();
init_glclass();
init_glfmaps();
init_linerec();
autodel<IRecognizeLine> linerec;
make_component(linerec,"linerec");
stdio model(argv[1],"r");
linerec->load(model);
bytearray image;
read_image_gray(image,argv[2]);
autodel<IGenericFst> result;
make_component(result,"OcroFST");
linerec->recognizeLine(*result,image);
nustring str;
str.clear();
// should be using a language model here
result->bestpath(str);
narray<char> s;
str.utf8Encode(s);
s.push(0);
printf("%s\n",&s[0]);
return 0;
}
|