|
ReadMe
Important information all Tesseract users need to know.
IntroductionThis package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, all the code in this distribution is now licensed under the Apache License: Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Other Dependencies and Licenses
Installing and Running TesseractAll Users Do NOT Ignore!The tarballs are split into pieces.tesseract-2.04.tar.gz contains all the source code. tesseract-2.01.<lang>.tar.gz contains the language data files for <lang>. You need at least one of these or tesseract will not work. Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory. tesseract-2.01.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them into your tesseract-2.04 directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care to, as they all contain different files. Note that if you are using make install you should unpack your language data to your source tree before you run make install. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up. boxtiff-2.01.<lang>.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files. Instructions for using the training tools are documented separately at TrainingTesseract and for testing at TestingTesseract. Without Additional Libraries, Image format support is limited!Without additional libraries, Tesseract can only read uncompressed TIFF. (And some versions of BMP) Upto version 2.04, you can add libtiff-dev. See the FAQ question on compressed TIFF for installation instructions. Version 3.00 will support additional formats via Leptonica, but requires more libraries to be added. Windows:There is no windows installer! (Still looking for volunteers to create one.) There are windows executables: tesseract-2.04.exe.tar.gz (It is not for the 'exe' language.) They are built with VC++ express 2008 and come with absolutely no warranty. If they work for you then great, otherwise get Visual C++ Express 2008 with service pack 1 and build from the source. You can also try tesseract-2.01.exe.tar.gz, which is built with VC++6, and may work better if your windows is old, but note that this is an older version of Tesseract. If you are building from the sources, there are still (up to v2.04) .dsw and .dsp files for vc++6, but the recommended build platform is now VC++ Express 2008. There are also .sln and .vcproj files for VC++ Express 2008, but these files are not backward compatible with any previous version - not even VC++ Express 2005. Note that the executables produced with the newer compiler are smaller, faster, and, believe it or not, more accurate. (See TestingTesseract.) New with 2.04: the executables are built with static linking, so they stand more chance of working out of the box on more windows systems. The executable must reside in the same directory as the tessdata directory. (The Visual Studio projects build the release executable directly to the correct place!) The command line is: tesseract <image.tif> <output> [-l <langid>] For interfacing to other applications, there is a DLL included with the executables, but you may be better off building it yourself. The DLL is NOT built for static C-Runtime, so you will probably need VC++ Express 2008 to run it. The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.) Non-Windows (or Cygwin):You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either: ./configure make make install to move the data files to the standard place, or: export TESSDATA_PREFIX="directory in which your tessdata resides/" In either case the command line is: tesseract <image.tif> <output> [-l <langid>] New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that. If you are linking to the libraries, as Ocropus does, there is now a single master library called libtesseract_full.a. Libtiff support should now be properly working via configure, but note that you need libtiff-dev, as that contains the header files required to compile the code that uses it. Installation Notes - 3.00 PrereleaseGeneralBIG CAVEAT It is 23:00 on Friday and I have finished uploading the source to svn, but not yet had a chance to check that I got it right. This code compiled on my linux and windows systems, before I checked it in to svn, but the svn code itself is untested! IMPORTANT: 3.00 is not backwards compatible with 2.04. The data files are different. (Single file per language among other things.) You therefore need to make sure you connect your new executable with the new data files. A lot of these are checked in to svn, but due to the large download size the Chinese, Japanese, and Korean ones are not there. (They were there briefly at rev 309.) The command line is the same as it was before: tesseract <image> <outputbasename> [-l lang] [configs] with the one change that "old" and "new" configs files may now be mixed arbitrarily, since the old configs are no more. In the executable, page layout analysis is enabled by default. You may need to turn it off to process small images. No command-line control for this yet. Sorry. See tesseractmain.cpp. The training process is not yet complete. A new executable is needed to repackage the 8 data files into a single file. Called combine_tessdata, there is no vcproj nor Makefile.am entry for it, but the source is included. The dll isn't properly working either. The BaseAPI is equipped with a dllexport for Windows. I strongly recommend all new dll use to go through the BaseAPI where possible, as this is most likely to keep working in future versions as we move towards thread-safety. LinuxIf they are not already installed, you need the following libraries: sudo apt-get install libpng12-dev sudo apt-get install libjpeg62-dev sudo apt-get install libtiff4-dev sudo apt-get install zlibg-dev You also need to install leptonica. There is an apt-get package (name unknown), or the sources are at http://www.leptonica.org/ The instructions at http://www.leptonica.org/source/README.html are clear, but basically it is the usual ./configure make sudo make install Now back to Tesseract. Download the source from svn: svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only The same build process as usual applies: ./configure make Between configure and make, you can check that everything has worked by looking at config_auto.h It should contain #define HAVE_LIBLEPT 1 and also HAVE_LIBPNG, HAVE_LIBTIFF, HAVE_LIBJPEG and HAVE_ZLIB. If you aren't doing a make install (this is an alpha release), you will probably need to use: export TESSDATA_PREFIX=/some/path/tessdata to point to your tessdata directory. The command line is the same as it was before: tesseract <image> <outputbasename> [-l lang] [configs] with the one change that "old" and "new" configs files may now be mixed arbitrarily, since the old configs are no more. In the executable, page layout analysis is enabled by default. You may need to turn it off to process small images. No command-line control for this yet. Sorry. WindowsDownload the source from svn: svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only The same build process as usual applies: Open tesseract.sln with VC++Express 2008 and build all (or just Tesseract) It should compile (in at least release mode) without having to install anything further. The dll dependencies and Leptonica are included. (When the release is final, all the windows-specific parts will be in their own download.) With the full svn download, it should just run immediately after building. tesseract <image> <outputbasename> [-l lang] [configs] For debug mode, you will have to copy the tessdata directory and all the dlls in the top-level directory (except tessdll.dll) to bin.dbg. There are no separate debug versions of these dlls. History:The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and under Windows with VC++6 and VC++Express. The C++ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug. Another "feature" of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. This is ugly, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet. The most recent change is that Tesseract can now recognize 6 languages, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. See http://www.isri.unlv.edu/downloads/AT-1995.pdf. With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. See TestingTesseract for more details. Directory Structure (ordered by dependency):ccmain Top-level code. The main program resides in tesseractmain.cpp.
training Top-level code for training tools.
testing Set of testing scripts and root of hierarchy of results and error reports.
display An "editor" to view and operate on the internal structures.
(Requires a working viewer - batteries not included.)
wordrec The word-level recognizer.
textord The module that organizes(orders) text into lines and words.
classify The low-level character classifiers.
ccstruct Classes to hold information about a page as it is being processed.
viewer The client side of a client server viewing system.
Unfortunately, at this time, the server side is not available.
image Image class and processing functions.
dict Language model code.
cutil Code for file I/O, lists, heaps etc, from the old C code.
ccutil Somewhat newer code for lists, memory allocation etc from the
old C++ code.About the EngineThis code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT FORMATTING, and NO UI. It can only process an image of a single column and create text from it. It can detect fixed pitch vs proportional text. Having said that, in 1995, this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code IS included in the open source release however, and is now included for those willing to try. |
Sign in to add a comment
i get an error massage: Could not open file, -1 my command is: tesseract test1.gif output -1 i have the eng.<files> in a folder called tesserdata of the folder that conatins the exe.
when i run the command in the tsserdata folder i get an error: Unable to load unicharset file C:/contract/visumatic/FreeOCR/tessdata/tessdata/eng.unicharset
can i can some help?
i get the same error as above on Ubuntu 7.04:
/usr/local/bin/tesseract test.tif out.txt Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
You need to download the language charset that you wish to use from the download button above. Extract it to the directory in which your tesseract executable resides and you shall have that error no longer.
You need to download the language charset that you wish to use from the download button above. Extract it to the directory in which your tesseract executable resides and you shall have that error no longer.
Help please, I'm currently trying to link tessdll to a c#-project. In C# I declared
static public extern int TessDllBeginPageUprightBPP(UInt32 xsize, UInt32 ysize, ref byte b, string lang, uint bpp);When I call tessdll my application simply closes. Do I need to initialize tessdll somehow?
Many Thanks!
To mcour...@mindspring.com and tom.garvin:
your command is wrong. it is not "number one" (-1). rigth is the letter L (-l) to type.
On windows I found this really easy to use, here are the steps with the Nov 07 version:
1) download tesseract-2.01.exe.tar.gz and tesseract-2.00.eng.tar.gz 2) extract these files into the same folder (7-zip or whatever expanding software you prefer) 3) open a command window for this folder, where the tesseract.exe file is located. 4) prep a tiff image, in my case I took a digital picture of a book, tweaked it in photoshop and saved as a tiff with no compression. You could do the same with the Gimp. 5) now I put the tiff image into the same folder and then in the command window invoke the operation 'tesseract.exe MyImage?.tif MyImageConverted? -l eng' 6) the process runs in the background for a few seconds and then a new text-file appears with the name 'MyImageConverted?.txt'.
On windows, using VC++Express, when enabling libtiff, I had to take two additional steps:
1) add HAVE_CONFIG_H to the preprocessor definitions
2) create an empty config_auto.h file
this is because HAVE_LIBTIFF is between HAVE_CONFIG_H in file tesseractmain.cpp For the rest it works fine
I have some JPG and BMP files. What utilities can I use to convert these files to TIF files that TESSERACT will recognize? I used the Paint program provided with my Windows XP, but the TIF file it created was not recognized by TESSERACT.
Here is the log file:
Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:number.tif Signal_exit 31 ABORT. LocCode?: 3 AbortCode?: 3
I extracted some English characters and numbers from a scanned document, but the recognition results were not very good. The accuracy was only about 50%. In fact, these characters were very easy to recognize by human. So I think there must be some problems.
1. Should I normalize the characters to specific size before recognition? What's the best width and height of an character for recognition?
2. Will the space between characters affects the recognition performance?
To: samlal...@yahoo.com, MS Paint uses LZW compression. Try IrfanView? (google it), you can save a file to TIFF and choose the compression. Choose no compression and it will work with Tess. You also may need to reduce the color depth, you can do that in IrfanView? as well.
NA ABillionBillion.com Document Management for Everyone
It's important to have a big image of the text (in my case, a character size of 20x20 pixels works right) in order case the result case will be blank. With a image too small, i resized it with photoshop (to 300%), and recognized the text without problems.
The readme says tesseract scans only a single column, but is that limited to only a single page with one column? If not, it's not working. If so, then I may have something useful for others. I wrote a script to scan multipage TIFFs using tesseract. If anyone wants a copy, just e-mail me.
While on the topic, I have a script (intended to run as a cron job) for finding all .tif files and tesseractizing them to help in later content-based searching.
Regards,
Jim
On Ubuntu, it won't find the data files unless you do this:
That said, there are still problems, e.g. many variables in box.config are not found.
This is good stuff! Thanks, The Ray Smith!
A question: it seems like unrecognized characters get replaced by spaces in the output ascii. If this is true, is there a simple way to use some other character, like ~ ?
what is the script for doing multipage tiffs?
I've added my experiences of using Tesseract here:
http://www.scribd.com/doc/2589070/how-to-scan-books-to-text-files
It is from a very non-expert Windows perspective, so might be of use to some people... Please feel free to add any part of it to the documentation or wiki etc.
you must remove alpha channel from TIFF !!
for all the people getting the error:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
or some variation thereof, you'll find that the eng.unicharset and most of the other files in the tessdata directory have size 0 kb. confusingly, they seem to have been put there as place holders. you need to download and install the various language packs separately.
How can i add libTIFF support under Linux environment?
barendgehrels is right. For installing libtiff - follow the original instructions and then do the following:
1) add HAVE_CONFIG_H to the preprocessor definitions 2) create an empty config_auto.h file
I have a bunch of documents, all the same size, with a field that is overwritten with a pattern - (perhaps to foil ocr..). How do I go about removing that pattern before attempting ocr? I can send a sample of the field.
I can not load the executable libtiff from http://gnuwin32.sourceforge.net/packages/tiff.htm
thanks
Can you add the comments from jhearn and m4rtin.m to the main documentation ?
I see that eng.unicharset is not included in the latest zip file (again). I grabbed one from an older zip file, but it appears to not be compatible.
Okay, i didn't have the lang file in the directory before i did make install, what should i do?
I am trying to train tesseract to recognize a different language(malayalam-indian language).Among the 8 files needed i couldnt generate freq-dawg and word-dawg files.Following is the error displayed :
Building DAWG from word list in file, '/home/ind/tesseract/mal/tessdata/malfreq' Compacting the DAWG Segmentation fault
Can anyone pls suggest me a solution to fix this error..
got mine to work after i changed the extension from .tiff to .tif ! doh!
still the output was not good enough to be recognizable. tesseract got confused by lines that were not aligned (because they belonged to a different article on the same page). It read the lines on the left of the page but not the lines on the right-hand side of the page (because they belonged to a different article and therefore slightly offset).
I'm also having problems with the charset files:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I downloaded them from an older release (they were not included in this one) but no luck :(
Wikipedia says "Please note that the website at www.libtiff.org is a hijacked domain and while it now points to the real site for current development at www.remotesensing.org, the libtiff.org site still shows the latest version as 3.6.1, which is not correct. It also has an incorrect address for the Libtiff mailing list."
If that's the truth, it might be better to link to remotesensing instead of libtiff.org.
same problem here:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I am trying to get tessnet2 working but in vain
same Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset problem
downloaded the eng pack but still no luck
I was able to build a tesseract executable using Visual Studio 2005. I extracted a page from a PDF file using .NET and created a TIF file to contain the image. I have installed Libtiff support and built my tesseract executable.
Tesseract appears to run without error when I run it against my Tiff file. Although, the output file contains the following characters "S.¤,SQ,Vi(< G u¤,¤n.<<d 6". These are not the text found in the Tiff file I created.
Does anyone know what I am doing wrong?
how do i get this to run on a macbook?
Could somebody not work on a UI for this?
for UI, you can go thought it. this si working fine as C#.NET wrapper on tesseract-ocr. http://groups.google.com/group/tesseract-ocr/browse_thread/thread/d80a3989c5c0931f#
I need help for how can I use it for different language other then english?
Does anyone have a work around/fix for the Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset issue that people have posted earlier?? I just installed it yesterday and am unable to move ahead at all....
download each of the eng. from http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/ and problem disappered. You will find if you ln -s /usr/local/share/tessdata/eng. all the files are probably zero.. (same in the /usr/src so probably not part of the .tar.gz)
Installation on a mac (ppc, 10.4.11) with english language ocr in mind:
a) download tesseract-2.03.tar.gz and tesseract-2.00.eng.tar.gz from the downloads page
b) open a terminal, cd to wherever you downloaded the above files, then do:
tar xvfz tesseract-2.03.tar.gz
cd tesseract-2.03
./configure
make
sudo make install
cd ..
tar xvfz tesseract-2.00.eng.tar.gz
sudo mv tessdata/ /usr/local/share/tessdata
sudo chown root:yourusername /usr/local/share/tessdata/
rm -rf tessdata
rm -rf tesseract-2.03
rm tesseract-2.03.tar.gz
rm tesseract-2.00.eng.tar.gz
Note in the above replace yourusername with your short username. eg for me it's sudo chown root:pete /usr/local/share/tessdata/
If you don't know what your short username is, type:
whoami
and the response is your short username.
c) you now have a working install of tesseract set up to do ocr on english language documents. To do other language documents, download the relevant language file, then repeat all steps from "tar xvfz tesseract-2.00.language?.tar.gz" above.
d) NOTE: for tesseract to work, the tiff file you're running it on needs to be renamed to end in .tif (not .tiff) AND it needs to be an image without an alpha channel. If you've renamed the file and tesseract is still barfing, this is probably the problem. Use an image conversion utility with the ability to remove alpha channels to re-save your image. For bulk image conversion I recommend Imagemagick (it's gpl and runs well on the mac).
e) finally, just so everything is on one post, to ocr your tiff image, do:
tesseract inputimage.tif outputtext -l eng
and you should get a file called outputtext.txt.
I can compile the project with Visual Studio 2005, but when I run the tesseract.exe app, I get "The ordinal 166 could not be located in the dynamic link library libtiff3.dll". Libtiff3.dll is from the latest version of LibTIFF, and it is definitely in my path. Anyone else come across this?
We are experiencing the same error with Visual Studio 2008. Is there another mailing list to submit this too?
I am getting the following error I issue the command : tesseract.exe out.tif out
Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:out.tif Signal_exit 31 ABORT. LocCode?: 3 AbortCode?: 3
Can someone please help me in this regard. I have already installed libtiff and set up the path etc as specified above but still I am getting the same error.
I am able to get tesseract2.03 running on OSX 10.5 just fine. However, when I compile with libtiff, the output becomes garbage. I tried both libtiff manual build or from Darwinports... any idea how I can go about this?
Here too, 2.03 after running the command from above: tesseract inputimage.tif outputtext -l eng
I get: Tesseract Open Source OCR Engine Image has 24 bits per pixel and size (450,24) Resolution=72
and then the outputtext.txt is created, but it empty. This is for vanilla text copied from the screen, not handwriting. Any experience with non-error causing blank output?
i am unable to work with this software..anyone plz give ur email id..i will send the image..plz plz check and tell me the result whether it works..plz help frnds
Using Instructions from Comment by caitifty, Feb 28, 2009 - Installed 2.03 on MAC 10.5.7 Intel and it compiled and tested with phototest.tif successfully.. starhari86 let me know if u still need some help testing your image.
john-davids-macbook-pro:tesseract-2.03 thinkdunson$ make -bash: make: command not found john-davids-macbook-pro:tesseract-2.03 thinkdunson$ sudo make install Password: sudo: make: command not found
now what? i have no experience with terminal… anyone, please help.
hey barendgehrels, i was just wondering in which directory should i create the empty config_auto.h file?
Problem:
Unable to load unicharset file /usr/local/share/tessdata/spa.unicharset
Solution:
1. Download tesseract-2.00.eng.tar.gz from http://code.google.com/p/tesseract-ocr/downloads/list 2. Extract 3. Copy all files to /usr/local/share/tessdata/
;)