|
|
Background
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.
Important Change!
The data files are now separate from the code! See ReadMe or ReleaseNotes wikis for more information.
Supported Platforms
The developers are regularly testing on the following platforms:
- Ubuntu 6.06 (x86/32, x86/64)
- Ubuntu 6.10 (x86/32, x86/64)
- Windows (x86/32)
Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly:
- recent Linux distributions (x86/32, x86/64)
- Mac OS X (x86, PPC)
If you're interested in supporting in supporting other platforms or languages, please get in touch with Ray Smith.
Roadmap
Version 2.00 is now available and contains the following new features:
- Support for English, French, Italian, German, Spanish, Dutch
- Scripts to test accuracy against the original 1995 tests run by UNLV (see TestingTesseract)
- Ability to train in other languages and scripts (see TrainingTesseract)
Please check out the ReleaseNotes before going to Downloads as you need more than one file now.
We are considering the following features for upcoming releases:
- ground truth data release
- integration with OCRopus, to support layout analysis
- integration with Leptonica, to support layout analysis and more image formats
- support for even more languages
- high-resolution character shape modeling for improved recognition rates
- a GUI frontend (again, probably shared with OCRopus)
Core Developers
The core developer on the project is Ray Smith (theraysmith).
Thomas Breuel (tmbdev) and Ilya Mezhirov (mezhirov) work on the OCRopus project, for which Tesseract is one of the pluggable OCR engines; OCRopus also provides layout analysis and statistical language modeling.
Most of the work on Tesseract is sponsored by Google.
Migration
As you have probably noticed, the Tesseract project has migrated from SourceForge to Google hosting. We were actually happy with SourceForge hosting, but since we needed to move from CVS to Subversion anyway, it seemed to make sense to move to Google hosting at the same time. We had planned on announcing the migration first and spending some time on it, but it turned out to be so quick and easy that we were done the same day.
If you have questions or concerns about this migration, please contact Ray Smith.
Google hosting is functionally similar to SourceForge. The major difference is that there is no discussion forum. We have set up a Google group for discussion purposes. See http://groups.google.com/group/tesseract-ocr, but please report bugs through the Issues tab above.
