Background
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.
Important Download Information:
The language data files are separate from the code!
See the ReadMe wiki for installation and usage information!
Additional installation and usage information can be found in the FAQ wiki.
Important License Note
The code is all licensed with the Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL.
Supported Platforms
The developers are regularly testing on the following platforms:
- Ubuntu 6.06 (x86/32, x86/64)
- Ubuntu 6.10 (x86/32, x86/64)
- Windows (x86/32) with Visual C++ Express 2008
Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly:
- recent Linux distributions (x86/32, x86/64)
- Mac OS X (x86, PPC)
People have reported success with Cygwin on Windows, but this is not a tested platform.
If you're interested in supporting other platforms or languages, please get in touch with Ray Smith.
Roadmap
Version 2.04 release is now available for download and contains the following new features:
- Many reported issues fixed, especially portability issues:
1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135, 142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209.
- Compiles in VC++ Express 2008 "out of the box"
- Java debugging viewer builds and runs correctly.
The release candidate will be available from the downloads page soon, after further testing.
Please check out the ReadMe before going to Downloads as you need more than one file. Even the windows executables tarball is incomplete as language files are required.
The upcoming 3.00 release will probably include:
- Page layout analysis.
- Automatic page orientation and script detection capability.
- Special modes for single column, line, word and even character.
- Improved API ready for thread-safety.
- Many more languages, including Chinese.
Core Developers
The core developer on the project is Ray Smith (theraysmith).
Thomas Breuel (tmbdev) and Ilya Mezhirov (mezhirov) work on the OCRopus project, for which Tesseract is one of the pluggable OCR engines; OCRopus also provides layout analysis and statistical language modeling.
Most of the work on Tesseract is sponsored by Google.
Migration
As you have probably noticed, the Tesseract project has migrated from SourceForge to Google hosting. We were actually happy with SourceForge hosting, but since we needed to move from CVS to Subversion anyway, it seemed to make sense to move to Google hosting at the same time. We had planned on announcing the migration first and spending some time on it, but it turned out to be so quick and easy that we were done the same day.
If you have questions or concerns about this migration, please contact Ray Smith.
Google hosting is functionally similar to SourceForge. The major difference is that there is no discussion forum. We have set up a Google group for discussion purposes. See http://groups.google.com/group/tesseract-ocr, but please report bugs through the Issues tab above.