My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members
Links

About

OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.

News

2011-10 Refactoring is nearly complete, with OCRopus now divided into a number of well-defined native code modules (ocrorast, ocrolseg, ocrofst) and high-level Python code (ocropy).

The top-level repository (ocropus) is now a repository that you can check out and that should contain everything needed for building OCRopus.

2011-05 There has been significant refactoring and cleanup over the last year.

  • OCRopus is now effectively usable as a NumPy library with native NumPy arrays
  • most of the APIs are documented through the Python interfaces
  • Unicode and ligature support largely works
  • all recognition can now be carried out from Python
  • there are top-level commands for recognition and training written in Python
  • classifiers now can cope with large character sets
  • there are tools for clustering and correcting character shapes
  • there is support for ligatures
  • there are numerous bug fixes
  • training is possible on very large datasets (many millions of samples)

Plans

What remains to be done before the next official release:

  • remove a lot of unused C++ code and consolidate iulib and ocropus C++ code (DONE)
  • factor out some C++ and Python libraries into separate projects (DONE)
  • retraining of all the models
  • the language models need to be tested and improved

Next steps:

  • integration of book-adaptive recognition tools into the Python code
  • integration of new layout analysis that works for distorted pages
  • integration of new orientation detection
  • integration of new text/image segmentation code

Resources

Related Projects

  • iulib Library (you need to install this)
  • hOCR Tools -- tools for manipulating OCR output
  • DECAPOD -- camera-based document capture and tagged PDF generation
  • PyOpenFST -- Python bindings for OpenFST (for language modeling)

Documentation

The following is the most important documentation:

  • Release Notes -- summary information about releases
  • Development Install -- how to install the development version of OCRopus
  • Using -- some information about how to use OCRopus
  • Training -- how to train OCRopus
  • Publications -- information about algorithms

If you want to contribute to the primary documentation, please check out hg clone https://wiki.ocropus.googlecode.com/hg and submit patches against the documentation.

Additional links you may find useful are here:

Bugs / Issues / Enhancements

Please use the "Issues" tab above to submit bugs, feature requests, etc.

When submitting bug reports, please keep the following in mind:

  • include OCRopus version/hg changeset, OS version, compiler version
  • sample images that fail (tag with SampleImage if you attach an image)
  • stack trace from GDB if you can get that

Until the beta release (version 0.5) we mainly care about "big stuff" bug reports and failing documents; minor compile issues or cross-platform issues don't matter that much right now. Please also only recognition failures on fairly clean scanned documents for the time being.

Contributing

If you want to contribute code to OCRopus, or if you have a patched version or variant, please use Google's Server Side Clone Support for Mercurial. You can maintain your own variant, add experimental features, etc., and share your patches/changes easily with others even if we haven't incorporated them into the main branch yet.

Acknowledgements

The system is combining the work of many contributors and previous projects. The core developers work at the IUPR research group at the DFKI and gratefully acknowledge funding by Google and the BMBF TextGrid project.

Powered by Google Project Hosting