About
OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Releases
The current release is ocropus-0.4.3; it is still an alpha release, so don't expect stability or high performance yet. We will not be providing new tar balls until the beta release. To obtain ocropus-0.4.3 and install it, please use something like the following commands:
mkdir ~/build cd ~/build hg clone https://iulib.googlecode.com/hg/ iulib cd iulib hg update -r ocropus-0.4.3 scons sudo scons install cd ~/build hg clone https://ocropus.googlecode.com/hg/ ocropus cd ocropus hg update -r ocropus-0.4.3 scons sudo scons install
That should work on Ubuntu 9.04 if you have all the necessary packages installed; if not, have a look at the DevInstall page or the Google Group Pages.
Resources
- OCRopus Mailing List (subscribe / contribute)
- OCRopus Group Pages (add your contributions here)
- User-contributed links and resources (add links here)
Related Projects
- iulib Library (you need to install this)
- hOCR Tools -- tools for manipulating OCR output
- DECAPOD -- camera-based document capture and tagged PDF generation
- PyOpenFST -- Python bindings for OpenFST (for language modeling)
Documentation
The following is the most important documentation:
- Release Notes -- summary information about releases
- Development Install -- how to install the development version of OCRopus
- Using -- some information about how to use OCRopus
- Training -- how to train OCRopus
- Publications -- information about algorithms
If you want to contribute to the primary documentation, please check out hg clone https://wiki.ocropus.googlecode.com/hg and submit patches against the documentation.
Additional links you may find useful are here:
- C++ Programming -- extending OCRopus in C++
- C++ Coding Conventions -- memory management, pointers, naming, formatting
- File Formats -- file formats used by OCRopus
- Book-Level Representation -- directory layout for whole book recognition
- hOCR Output Format -- (X)HTML-compatible OCR output format
Bugs and Contributions
Please use the "Issues" tab above to submit bugs, feature requests, etc.
When submitting bug reports, please keep the following in mind:
- include OCRopus version/hg changeset, OS version, compiler version
- sample images that fail (tag with SampleImage if you attach an image)
- stack trace from GDB if you can get that
If you have patches or other contributions, please supply them as a Mercurial bundle (preferred) or patch. Please tag with FixBundle or FixPatch, respectively.
Acknowledgements
The system is combining the work of many contributors and previous projects. The core developers work at the IUPR research group at the DFKI and gratefully acknowledge funding by Google.