About¶
OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Releases¶
We're preparing for the next release. The release consists of the following components:
- iulib -- basic image processing
- ocropus -- OCR-specific functionality (libraries and some command line programs)
- ocroswig -- bindings of iulib and ocropus to Python
- ocropy -- Python library and command line tools
- pyopenfst -- Python bindings of the OpenFST library
Please see the InstallTranscript to see how this is installed.
There is plenty of new functionality:
- all recognition can now be carried out from Python
- there are top-level commands for recognition and training written in Python
- classifiers now can cope with large character sets
- there are tools for clustering and correcting character shapes
- there is support for ligatures
- there are numerous bug fixes
- training is possible on very large datasets (many millions of samples)
We will be calling this release 0.4.4
There is still some functionality missing for what we want to call 0.5:
- the Python tools do not yet do a good job at upper/lower case modeling
- the language models need to be tested and improved
- we need to integrate the book-adaptive recognition tools into the Python code
- Unicode support needs to be integrated into the Python loops
- the main loop of the RAST layout analysis will be rewritten in Python
- there will be some new layout analysis that works for distorted pages
- we need to integrate our orientation detection and text/image segmentation code
Resources¶
- OCRopus Mailing List (subscribe / contribute)
- OCRopus Group Pages (add your contributions here)
- User-contributed links and resources (add links here)
Related Projects¶
- iulib Library (you need to install this)
- hOCR Tools -- tools for manipulating OCR output
- DECAPOD -- camera-based document capture and tagged PDF generation
- PyOpenFST -- Python bindings for OpenFST (for language modeling)
Documentation¶
The following is the most important documentation:
- Release Notes -- summary information about releases
- Development Install -- how to install the development version of OCRopus
- Using -- some information about how to use OCRopus
- Training -- how to train OCRopus
- Publications -- information about algorithms
If you want to contribute to the primary documentation, please check out hg clone https://wiki.ocropus.googlecode.com/hg and submit patches against the documentation.
Additional links you may find useful are here:
- C++ Programming -- extending OCRopus in C++
- C++ Coding Conventions -- memory management, pointers, naming, formatting
- File Formats -- file formats used by OCRopus
- Book-Level Representation -- directory layout for whole book recognition
- hOCR Output Format -- (X)HTML-compatible OCR output format
Bugs / Issues / Enhancements¶
Please use the "Issues" tab above to submit bugs, feature requests, etc.
When submitting bug reports, please keep the following in mind:
- include OCRopus version/hg changeset, OS version, compiler version
- sample images that fail (tag with SampleImage if you attach an image)
- stack trace from GDB if you can get that
Until the beta release (version 0.5) we mainly care about "big stuff" bug reports and failing documents; minor compile issues or cross-platform issues don't matter that much right now. Please also only recognition failures on fairly clean scanned documents for the time being.
Contributing¶
If you want to contribute code to OCRopus, or if you have a patched version or variant, please use Google's Server Side Clone Support for Mercurial. You can maintain your own variant, add experimental features, etc., and share your patches/changes easily with others even if we haven't incorporated them into the main branch yet.
Acknowledgements¶
The system is combining the work of many contributors and previous projects. The core developers work at the IUPR research group at the DFKI and gratefully acknowledge funding by Google.