My favorites | Sign in
Google
                
Code license: Apache License 2.0
Labels: OCR, Utility, CPlusPlus, Google

Background

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

Important Download Information:

The language data files are separate from the code!

See the ReadMe wiki for installation and usage information!

Additional installation and usage information can be found in the FAQ wiki.

Important License Note

The code is all licensed with the Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL.

Supported Platforms

The developers are regularly testing on the following platforms:

Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly:

People have reported success with Cygwin on Windows, but this is not a tested platform.

If you're interested in supporting other platforms or languages, please get in touch with Ray Smith.

Roadmap

Version 2.04 release is now available for download and contains the following new features:

1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135, 142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209.

The release candidate will be available from the downloads page soon, after further testing.

Please check out the ReadMe before going to Downloads as you need more than one file. Even the windows executables tarball is incomplete as language files are required.

The upcoming 3.00 release will probably include:

Core Developers

The core developer on the project is Ray Smith (theraysmith).

Thomas Breuel (tmbdev) and Ilya Mezhirov (mezhirov) work on the OCRopus project, for which Tesseract is one of the pluggable OCR engines; OCRopus also provides layout analysis and statistical language modeling.

Most of the work on Tesseract is sponsored by Google.

Migration

As you have probably noticed, the Tesseract project has migrated from SourceForge to Google hosting. We were actually happy with SourceForge hosting, but since we needed to move from CVS to Subversion anyway, it seemed to make sense to move to Google hosting at the same time. We had planned on announcing the migration first and spending some time on it, but it turned out to be so quick and easy that we were done the same day.

If you have questions or concerns about this migration, please contact Ray Smith.

Google hosting is functionally similar to SourceForge. The major difference is that there is no discussion forum. We have set up a Google group for discussion purposes. See http://groups.google.com/group/tesseract-ocr, but please report bugs through the Issues tab above.