|
FAQ
Frequently Asked Questions
Featured Tesseract FAQsA collection of frequently asked questions and the answers, or pointers to them. If you have question, please post it to forums. Wiki comments are for wiki commenting and not for Troubleshooting. If you found a bug - please create issue. If you have a question - put it to tesseract user or developer forum. Questions in comments are not answered by developers. Windows: tesseract close automaticaly right after launchingtesseract program is command line program, so you need to run it from command line. If you need program with Graphic intercase please have a look at AddOns wiki. libtesseract.so.3: cannot open shared object fileRun 'sudo ldconfig' after 'sudo make install'. See issue 621. Error in pixReadStream:If you see this error, than you have problem with leptonica instalation. Please check issues 340, 391 and 443 Can't open eng.unicharset?Read the ReadMe wiki page. leptonica library missingIf get this error message when you run ./configure and your leptonica header files are located in /usr/local/include (e.g. you installed leptonica to /usr/local) than run: LIBLEPT_HEADERSDIR=/usr/local/include ./configure or: CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure Can't read compressed Tiff filesI get this error message: read_tif_image:Error:Illegal image format:Compression Windows (Visual C++): Libtiff support can be added in either VC++6 or VC++Express with the following: Goto the Windows download for libtiff and follow these steps: Download and run the setup program. Add the paths for include and library files in tools/options/directories Add HAVE_LIBTIFF to the preprocessor definitions. Add libtiff.lib to the list of libraries. Rebuild. Make libtiff3.dll be in your path somewhere. This is done by control panel/system/advanced/environment variables and adding c:/program files/gnuwin32/bin to PATH. Keep your fingers crossed... Non-Windows (and Cygwin): Install libtiff-dev. Procedure differs from OS to OS, but on many something like sudo apt-get install libtiff-dev or some variant thereof should do the trick, before running configure. Can I use tesseract for barcode recognition?No. Tesseract is for text recognition. No output with color imagesThere have been several bug reports of blank or garbage output with color images, both with and without libtiff. Here is the most up-to-date information (last update 23 Sep 2008): Without libtiff, Tesseract only reads uncompressed tiff files. Even then it won't read 32 bit tiff files correctly. Will be fixed in 2.04. (Meaning that it will correctly handle most image depths (except 16 bit) with libtiff. With libtiff, Tesseract reads compressed tiff files, but can't handle any color: 24 or 32 bit. It can only read 1 bit binary images or 8 bit greyscale. (No color maps!) Fixed in 2.04. The API (TessBaseAPI) should be OK with 1, 8, 24 or 32 bit images. Does it support multi-page tiff files?Only with 2.03 and later, and only if you have libtiff installed. See Compressed Tiff above. Why doesn't viewer/svutil.cpp compile?This file is the single greatest cause of portability issues, because it is the interface to a viewer running in an external process. If you can get it to compile on your system, please report an issue logging what you had to change, but please only for the current version. If you can't get it to compile, you can define GRAPHICS_DISABLED in your compiler (for all the source) and it will comment out all the hard-to-compile code and disable the viewer functionality, which most people don't use anyway. On Unix-like systems, the configure script can be instructed to disable graphics like this: configure --disable-graphics How do I Edit Box files used in training?Use bbtesseract http://code.google.com/p/bbtesseract/ or other similar program. How do I recognize only digits?In 2.03 and above: Use TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");BEFORE calling an Init function or put this in a text file called tessdata/configs/digits: tessedit_char_whitelist 0123456789 and then your command line becomes: tesseract image.tif outputbase nobatch digits Warning: Until the old and new config variables get merged, you must have the nobatch parameter too. How do I add just one character or one font to my favourite language, without having to retrain from scratch?See the TrainingTesseract wiki entry on "New! Tif/Box pairs provided!" Is there a Minimum Text Size? (It won't read screen text!)There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed". Output without result or bad output
You can try to have a look at Scantailor - interactive post-processing tool for scanned pages. How do I generate the language data files?Read the TrainingTesseract wiki. How do I provide my own dictionary?Easy: Replace tessdata/eng.user-words with your own word list, in the same format - UTF8 text, one word per line. More difficult, but better for a large dictionary: Replace tessdata/eng.word-dawg with one created from your own word list, using wordlist2dawg. See the TrainingTesseract wiki page for details. wordlist2dawg doesn't work!There is a memory problem with the 2.03 wordlist2dawg. If you don't have something more than 1GB of memory, then your system grinds to a halt and it runs very slowly. Reduce both max_num_edges and reserved_edges by a factor of 10 at line 39-40 of training/wordlist2dawg.cpp and rebuild. If you successfully create a new dawg, and then it doesn't load, due to the error: How to increase the trust in/strength of the dictionary?Try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5. If the text fonts you are recognizing are significantly different from your training data, and you don't mind a slow-down, you could also try lowering ClassPrunerThreshold in classify/intmatcher.cpp to about 200 from 229. These measures should all improve the power of the dictionary to resolve words from non-words. Of course any changes that up the power of the dictionary also up the ability to hallucinate dictionary words. If this is a problem, keep short words out of your dictionary, and don't add a vast list of words that are rarely used if they increase the number of ambiguities with more frequent words. What are configs and how can I have more?Config is an overloaded word in tesseract. One meaning is a file of control parameters used for debugging or modifying its behaviour, such as tessdata/configs/segdemo. The other meaning is used in training and in the classifier: A config represents a (potentially) different shape of a character from a different font. The MAX_NUM_CONFIGS limit applies to the number of different files on the command line of mftraiing containing samples of any one character, as each file is assumed to represent a different font. There is currently (2.03) a limit of 32 configs. You can get away with more than 32 files on the mftraining command line if not all the files contain all the characters. Other ways to fix the problem: If files contain very similar looking samples, then you can cat them together to make a single file to reduce the total number of files. DON'T do this if the characters in two files look very different. Increase MAX_NUM_CONFIGS (in classify/intproto.h) There are consequences. You will make inttemp files generated with a different value of MAX_NUM_CONFIGS unreadable. We are working towards overcoming this weakness for version 3.0. Will not be in 2.04 though. Also, classification will be slower and use more memory. Where is the documentation?There isn't much. We are concentrating on features at the moment. There is some documentation at http://tesseract-ocr.repairfaq.org/ and more at this forum thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/3ef5dd674cef3746/68b5f07bff0b54b2?lnk=gst&q=icdar#68b5f07bff0b54b2 How can I try the next version?Periodically stable versions go to the downloads page. Between releases, and in particular, just before a new release, the latest code is available from svn. You can find the source here: http://code.google.com/p/tesseract-ocr/source/checkout where you can check it out either by command line, or by following the link to the howto on using various client programs and plugins. actual_tessdata_num_entries <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file ..\ccutil\tessdatamanager.cpp, line 55If you get error during running tesseract, please check if you use correct version of traineddata (e.g. 3.00 with 3.01). You can not use 3.01 traineddata with tesseract 3.00. last_char == '\n':Error:Assert failed:in file ..\ccutil\tessdatamanager.cpp, line 95If you get error during running combine_tessdata: This indicates your lang.unicharambigs is missing empty line at the end of file. Error: X classes in inttemp while unicharset contains Y unichars.(Where Y != X) There are 2 possibilities: X ~= Y, usually with X < Y: Usually caused by a failed training process. Look for FATALITY messages from the tr file generation process. Looks like the training process failed to pick up some samples of some characters, and they didn't make it into the inttemp file (in mftraining) because there was no entry in the tr file. There are bad problems with applybox that make this a problem for a lot of people. The character samples need to be spaced out. X a wild number (very large + or -) and Y a sane number between 100 and a few thousand, depending on the language: Corrupt inttemp file, or (if you have NOT trained youself) your hardware has a funny endian architecture that is not automatically detected. Big-endian or Little-endian 32 and 64 bit SHOULD work, but mixed endian (0x12345678 -> 0x56 0x78 0x12 0x13 or similar) will NOT work. Get a sensible hardware architecture, or retrain yourself. Then your inttemp will match the hardware. How can I make the error messages go to tesseract.log instead of stderr?To restore the old behaviour of writing to tesseract.log instead of writing to the console window, you need a text file that contains this: debug_file tesseract.log call the file 'logfile' and put it in tessdata/configs/ Then add logfile to the end of your command line. My question isn't in here!Try searching the forum: http://groups.google.com/group/tesseract-ocr as your question may have come up before even if it is not listed here. |
For recognizing only digits, I did the mentioned task but in the log I received: read_variables_file:variable not found: tessedit_char_whitelistTesseract Open Source OCR Engine
I'm currently testing the executables.
HI, I have downloaded source and languages, decided it would be easier to download 2.0...exe6.tar.gz. I see executables but no install executables nor when I run the tesseract.exe it doesn't due anything.
Can you help me?
I want to know how can I create the following 8 files
- tessdata/eng.freq-dawg
- tessdata/eng.word-dawg
- tessdata/eng.user-words
- tessdata/eng.inttemp
- tessdata/eng.normproto
- tessdata/eng.pffmtable
- tessdata/eng.unicharset
- tessdata/eng.DangAmbigs?
Thanks a lot1. Where is there documentation? Is it downlodable? 2. What image formats are supported (JPG,GIF,BMP,PNG)? 3. Command line arguments? help?
Can tesseract able to recognize 7-segments digits (on clock radio for example) ?
Thanks
I am using tesseract successfully to ocr tiff files. At this particular stage in my own project, it would be convenient if I could OCR directly from a memory string rather than a file. This should be easily possible (I do have the memory).
Is there a command to read/ocr directly from a memory string rather than a filename input?
With pytesser you can OCR from PIL Images. But I'm having difficulty getting pytesser to work for me at all. It doesn't help that it's been at Version .0.0.1 since last year.
Addendum to the above. It looks like the pytesser package doesn't come with the eng.unicharset file, nor do any of the other files have the eng prefix, so it kept throwing a file not found exception. Took forever (not really, I just started this afternoon) to track it down.
what are the command line options for tesseract?
There is a python wrapper for tesseract-ocr at http://wiki.github.com/hoffstaetter/python-tesseract . This may be useful for anyone trying to do character recognition in python.
Hi,
Am working with tesseract-1.03 it works well still some tiff files are not changed to text document. While running with an image it resulted in the empty text document. tell me some suggestion to get proper text document.
thanks in advance
Re your instructions: "Windows (Visual C++): Libtiff support can be added in either VC++6 or VC++Express with the following:"
>> This is not clear enough - my comments below:
Goto the Windows download for libtiff and follow these steps:
Download and run the setup program. Add the paths for include and library files in tools/options/directories
>> WHAT paths and include ??? >> >> I assume you mean for bin, in VC++ Express, >> add: C:\Program Files\GnuWin32?\bin >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: Executable files' selected) >> >> AND >> >> for include, in VC++ Express, >> add: C:\Program Files\GnuWin32?\include >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: include files' selected) >> >> Correct?
Add HAVE_LIBTIFF to the preprocessor definitions.
>> WHERE in the Microsoft Visual C++ Express tool do I set this? >> There is no section, tab or otherwise for adding pre-processor definitions. >> Where are they found in this tool?
Add libtiff.lib to the list of libraries.
>> in VC++ Express, >> add: C:\Program Files\GnuWin32?\lib\libtiff.lib >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: library files' selected) >> >> Correct?
Rebuild.
Make libtiff3.dll be in your path somewhere. This is done by control panel/system/advanced/environment variables and adding c:/program files/gnuwin32/bin to PATH.
>> OK - that was one step that was clear
Keep your fingers crossed...
@arjaydavis - I'm having some trouble myself but I think I can help with some of this. Add HAVE_LIBTIFF to the preprocessor definitions. >>> Project -> Properties -> Configuration Properties -> c/c++ -> Preprocessor (add it to the Preprocessor Definitions line. Add libtiff.lib to the list of libraries. >>> Project -> Properties -> Configuration Properties -> Linker -> Command Line (add to the Additional Options box) WHAT paths and include ??? >> I added the lib folder to 'Show directories for: library files'. Same as you for the Include.
This still isn't working for me with a G4 Tiff so maybe it should be a 'what not to do'
Does anyone have a simple example of how to use tesseract in c or c++ ? I'm looking into using this for my next project but since I'm a c newbie i have no clue where to start. Where can i found the api docs?
Since the above wiki doucmentation is unclear to anyone not versed in Visual C++ incantations (if you're just a managed environment .NET developer like me you're basically screwed) I've fluffed it up a bit here. I took the following steps to compile tesseract with compressed/multipage TIFF support under a Windows 7 64 Bit system.
1. Download tesseract 2.04. Unpack it. In this example I've unpacked to C:\projects\tesseract-2.04. Windows 7 still doesn't understand .tar.gz out of the box. My recommendation is to get a copy of 7-Zip.
2. Download your required language files. I need german and english. I unpack these to the tessdata subdirectory of C:\projects\tesseract-2.04\tessdata.
3. Install libtiff. On my (64 bit) system the suggested install directory is C:\Program Files (x86)\GnuWin32?. Underneath this directory are a bunch of subdirectories containing files we'll need to compile tesseract with tiff support, namely include, bin and lib.
4. Add C:\Program Files (x86)\GnuWin32?\bin to your PATH environment variable so that the output tesseract.exe can find the libtiff dll. Restart.
5. Open the vc solution (tesseract.sln)
6. Change the solution configuration to "Release" mode. Note that if you later change back to Debug mode, you'll need to set up all the following again...
7. In the solution explorer right click the solution node (Solution 'tesseract') and click "Properties". Change to "Configuration Properties" and select "Release" configuration from the dropdown at the top of the window. Navigate to: Tools -> Options -> Projects and Solutions -> VC++ Directories Here we'll be adding the full paths for the subdirectories lib and include from the libtiff install so that VC can find the required header (.h) and static library (.lib) files. In this example they are: $(ProgramFiles?)\GnuWin32?\include $(ProgramFiles?)\GnuWin32?\lib as I'm using an environment variable. I could however just have written them as C:\Program Files (x86)\GnuWin32?\include.
Change the "Show Directories For" dropdown to "Include files". Add the following: $(ProgramFiles?)\GnuWin32?\include
Now change the "Show Directories For" dropdown to "Library files". Add the following: $(ProgramFiles?)\GnuWin32?\lib
8. Now open the project properties window for the tesseract project (not the solution). In the solution explorer right click the tesseract project and click properties. Navigate the horrendous list of options to Configuration Properties -> C/C++ -> Preprocessor and add HAVE_LIBTIFF to the list of Preprocessor Definitions. This causes a bunch of #includes to be enabled in the code.
9. You also want to add an "Additional dependancy". go to the "Additional dependancies" section for the project properties and add libtiff.lib.
10. Build the solution. Watch the error list. If you get a bunch of LNK2109 errors, that means the linker can't find something tesseract references. You're missing a reference to one of the paths from libtiff. If you get an error mentioning mt.exe, you've possibly encountered a bug in the sdk. Just try building again. see http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=106634 for more info.
If/when the solution builds successfully, you'll have a tesseract.exe file in the same directory as the tesseract solution file. drag you multipage compressed tiff here and try running tesseract. for example, if your tiff is called in.tif and you want to output text to out.txt, and the documents' language is german then your command line would look like:
tesseract.exe in.tif out -l deu The output file will have .txt appended to it by tesseract. If you're just translating english text then you can leave off the -l option, as tesseract assumes "eng" if you don't specify anything. If your tif file has the file extension .tiff, then tesseract will crap itself thusly:
C:\projects\tesseract-2.04>tesseract.exe in.tiff out -l deu Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:in.tiff IMAGE::read_header:Error:Can't read this image type:in.tiff tesseract.exe:Error:Read of file failed:in.tiff
Hopefully (fingers crossed, heh) you've now got an OCR'd out.txt file sitting in C:\projects\tesseract-2.04.
Please can someone update this FAQ regarding "Digits only" part. The right command to enter is :
C:>tesseract.exe nine.tif out tessdata/configs/nobatch tessdata/ configs/digits
nobatch is a dummy (but existing) file and digits is a real file containing the string :
tessedit_char_whitelist 0123456789 (like cf. above)
Regards.
From : http://markmail.org/message/yhcsecjgn5752nps#query:tesseract%20could%20not%20open%20file%20nobatch+page:1+mid:trfu4ykvsgguprp5+state:results
while training tesseract 2.04 for bengali script I'm getting error during execution of cntraining "cnTraining.exe has encountered a problem and needs to close. We are sorry for the inconvenience.----Please tell Microsoft about this problem."Can anyone help me? Thanks & Regards
dose it support with chinese? I'm a chinese ,I think this project is very good well done
fatal error C1010: unexpected end of file while looking for precompiled header directive. this error is shown while i'm adding a new class in tesseract2.04 & building it. Please help.
Regarding comments by matt.m.j...@gmail.com, Jan 06, 2010
These instructions were 100% perfect - thanks a lot.
Hello All,
Can ANYONE get TESSDLL.DLL to produce useful results? I have tried the pre-compiled tessdll.dll, and it justs exits while initializing the base class. I have also tried compiling the 2.04 source to tessdll.dll using VS2005, and I have traced through in the debugger, and it chews the image data I provided (300 dpi monochrome, 8.5 x 11 in) for a few minutes and then produces a list of gibberish results. I have the english-language TESSDATA folder, which the DLL seems to find OK. Not sure what I'm doing wrong... Any suggestions?
I used ./configure --disable-graphics thats the last lines when I do Make:
I use Ubuntu 10.04 How I make it work?
it works !
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.
Hello. I have recently started using tesseract on Fedora 11. I'm having trouble with producing output for .jpg, .png, .gif etc files. I have the libjpeg-devel, libpng-devel files installed on my OS. So there should be no problem. But it says: name_to_image_type:Error:Unrecognized image type:doc_image.jpg IMAGE::read_header:Error:Can't read this image type:doc_image.jpg tesseract:Error:Read of file failed:doc_image.jpg Segmentation faultHow can I resolve this issue?
Hi am trying to use pytesser to read text from pdf files containing emages also. Is pytesser capable of doing this. if yes what are the steps to make it work for me?
@pietchaki: I added #include <stdio.h> right above #include <iostream> in viewier/svnutil.cpp, and that fixed it.
aaa
I was able to cross compile the tesseract 2.0.4 to iphone and run the pocket ocr, i used the new tessdata file trained for identify the numbers. when i run the tesseract on mac os x with the above tessdata it gives correct out put,but when i run it in the iphone i'm getting extra number added and some space to result,if i ignore the spaces and the extra number added to end my out put result is identical with mac os x result, can anyone tell me what is the problem,
can tesseract provide me also with the position of each word it converts? kind of metadata
Re: matt.m.j...@gmail.com and the very clear instructions.
This instruction: Quote: 9. You also want to add an "Additional dependancy". go to the "Additional dependancies" section for the project properties and add libtiff.lib.
This is done under: Configuration Properties -> C/C++ -> Linker -> Input Source: http://stackoverflow.com/questions/1512467/setting-up-dependencies-on-visual-c-2008-express
Note, i've still yet to get this working with Libtiff.
http://hocrtopdf.codeplex.com/documentation I wanted to share my project with the Tesseract community. It's a .net library (should work fine on Mono) that convert's hocr files produced by Tesseract into searchable pdfs. It also has a class to compress the images into jbig2 so that you can create highly compressed searchable pdfs as well.
I forgot to mention, grab the latest source because there are currently no releases. The source contains the binaries as well.
Hi,All,I want to use Tesseract in my project, but the image from my project has some image noise, so, how to skip these image noise? the characters size are bigger than 16 x 24, the image nois size is smaller than 8x8.
FYI: The quick brown dog jumped over the lazy fox.
is incorrect. It's missing 's'. Use:
The quick brown fox jumped over the lazy dogs.
Hello,
Would anybody know if/where there are instructions for just getting the word regions/blobs? i.e. I don't need to recognize characters yet, but would like to have words segmented in rectangles or something similar.
Thanks in advance, Dev
@bovor: or rather, "The quick brown fox jumps over the lazy dog."
Hi, Is there any method which gives me rectangle in the image for specific word found ? I want to get x,y coordinate for given word if it exists in image.
can the use of OCR be detected on ms word ? im actualy doing an office work im asking just in case my work gets caught that is was not done the right way...please help mail me if u can tchiang11@gmail.com
i want to run ocr on arabic pdf file and also get the color for fonts too, is it possible to get colors too?
color arabic pdf file: http://versebyversequran.com/data/pdf/Tajweed_Quran_Daralmarifa_full.pdf
How can i get the coordinates (x,y) of a word present in an image?
How to read the text in tif file using tesseract.exe through command prompt
Im developing a Web Service that use OCR Tesseract Wrapper for Java, but i have a problem with an error i cannot resolve myself... I'm desperate: I cant load tessdll on Axis2 HELPPP please
What do you think about my program SunnyPage? 1.0. It uses the tesseract 3.02 alpha. You can use it for Training Tesseract. For Training is free! www.sunnypage.ge