|
|
OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
Documentation and Resources
- DocumentationIndex -- "official" documentation
- OcropusWiki -- user contributed documentation, language support, etc.
- and Release Notes -- features of past and upcoming releases
- Talks and Slides about OCRopus
- ContributingToOcropus -- how you can contribute
- OCRopus Group -- discussion group and mailing list
Please have a look at the FrequentlyAskedQuestions
Installation
You can install the 0.1.1 (alpha) release: GettingStartedWithAlpha
You can also install the Subversion release: GettingStartedWithBleedingEdge
Reporting Bugs
When submitting bug reports, please keep the following in mind:
- include information about your system configuration in your bug report
- operating system name, distribution, and version
- compiler name and version
- any other potentially relevant version information (e.g., if you have an image I/O problem, the versions of libjpeg, libpng, and libtiff you're using)
- if you're reporting poor recognition rates or errors on some page image, please include the page image itself, preferably with a text or hocr file containing the correct output
- submit Tesseract bugs to Tesseract
Background
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
Related Standards and Projects
- You can find information about many of the algorithms used by the system at the IUPR Publication Server
- Tesseract is currently used as the character recognition engine (additional engines are in development)
- Output of the system is in HTML format, with embedded OCR-specific information
- Coding conventions for the project are here.
Acknowledgements
The system is combining the work of many contributors and previous projects. The core developers work at the IUPR research group at the DFKI and gratefully acknowledge funding by Google.
