|
FAQ
Frequently Asked Questions
Tesseract FAQsA collection of frequently asked questions and the answers, or pointers to them. Can't open eng.unicharset?Read the ReadMe wiki page. Can't read compressed Tiff filesI get this error message: read_tif_image:Error:Illegal image format:Compression Windows (Visual C++): Libtiff support can be added in either VC++6 or VC++Express with the following: Goto the Windows download for libtiff and follow these steps: Download and run the setup program. Add the paths for include and library files in tools/options/directories Add HAVE_LIBTIFF to the preprocessor definitions. Add libtiff.lib to the list of libraries. Rebuild. Make libtiff3.dll be in your path somewhere. This is done by control panel/system/advanced/environment variables and adding c:/program files/gnuwin32/bin to PATH. Keep your fingers crossed... Non-Windows (and Cygwin): Install libtiff-dev. Procedure differs from OS to OS, but on many something like sudo apt-get install libtiff-dev or some variant thereof should do the trick, before running configure. No output with color imagesThere have been several bug reports of blank or garbage output with color images, both with and without libtiff. Here is the most up-to-date information (last update 23 Sep 2008): Without libtiff, Tesseract only reads uncompressed tiff files. Even then it won't read 32 bit tiff files correctly. Will be fixed in 2.04. (Meaning that it will correctly handle most image depths (except 16 bit) with libtiff. With libtiff, Tesseract reads compressed tiff files, but can't handle any color: 24 or 32 bit. It can only read 1 bit binary images or 8 bit greyscale. (No color maps!) Fixed in 2.04. The API (TessBaseAPI) should be OK with 1, 8, 24 or 32 bit images. Does it support multi-page tiff files?Only with 2.03 and later, and only if you have libtiff installed. See Compressed Tiff above. Why doesn't viewer/svutil.cpp compile?This file is the single greatest cause of portability issues, because it is the interface to a viewer running in an external process. If you can get it to compile on your system, please report an issue logging what you had to change, but please only for the current version. If you can't get it to compile, you can define GRAPHICS_DISABLED in your compiler (for all the source) and it will comment out all the hard-to-compile code and disable the viewer functionality, which most people don't use anyway. How do I Edit Box files used in training?Use bbtesseract. See http://code.google.com/p/bbtesseract/ and http://groups.google.com/group/bbtesseract Alternatively, http://code.google.com/p/wx-tetra/ has another application for editing box files. How do I recognize only digits?In 2.03 and above: Use TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");BEFORE calling an Init function or put this in a text file called tessdata/configs/digits: tessedit_char_whitelist 0123456789 and then your command line becomes: tesseract image.tif outputbase nobatch digits Warning: Until the old and new config variables get merged, you must have the nobatch parameter too. How do I add just one character or one font to my favourite language, without having to retrain from scratch?See the TrainingTesseract wiki entry on "New! Tif/Box pairs provided!" Is there a Minimum Text Size? (It won't read screen text!)There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed". How do I generate the language data files?Read the TrainingTesseract wiki. How do I provide my own dictionary?Easy: Replace tessdata/eng.user-words with your own word list, in the same format - UTF8 text, one word per line. More difficult, but better for a large dictionary: Replace tessdata/eng.word-dawg with one created from your own word list, using wordlist2dawg. See the TrainingTesseract wiki page for details. wordlist2dawg doesn't work!There is a memory problem with the 2.03 wordlist2dawg. If you don't have something more than 1GB of memory, then your system grinds to a halt and it runs very slowly. Reduce both max_num_edges and reserved_edges by a factor of 10 at line 39-40 of training/wordlist2dawg.cpp and rebuild. If you successfully create a new dawg, and then it doesn't load, due to the error: How to increase the trust in/strength of the dictionary?Try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5. If the text fonts you are recognizing are significantly different from your training data, and you don't mind a slow-down, you could also try lowering ClassPrunerThreshold in classify/intmatcher.cpp to about 200 from 229. These measures should all improve the power of the dictionary to resolve words from non-words. Of course any changes that up the power of the dictionary also up the ability to hallucinate dictionary words. If this is a problem, keep short words out of your dictionary, and don't add a vast list of words that are rarely used if they increase the number of ambiguities with more frequent words. What are configs and how can I have more?Config is an overloaded word in tesseract. One meaning is a file of control parameters used for debugging or modifying its behaviour, such as tessdata/configs/segdemo. The other meaning is used in training and in the classifier: A config represents a (potentially) different shape of a character from a different font. The MAX_NUM_CONFIGS limit applies to the number of different files on the command line of mftraiing containing samples of any one character, as each file is assumed to represent a different font. There is currently (2.03) a limit of 32 configs. You can get away with more than 32 files on the mftraining command line if not all the files contain all the characters. Other ways to fix the problem: If files contain very similar looking samples, then you can cat them together to make a single file to reduce the total number of files. DON'T do this if the characters in two files look very different. Increase MAX_NUM_CONFIGS (in classify/intproto.h) There are consequences. You will make inttemp files generated with a different value of MAX_NUM_CONFIGS unreadable. We are working towards overcoming this weakness for version 3.0. Will not be in 2.04 though. Also, classification will be slower and use more memory. Where is the documentation?There isn't much. We are concentrating on features at the moment. There is some documentation at http://tesseract-ocr.repairfaq.org/ and more at this forum thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/3ef5dd674cef3746/68b5f07bff0b54b2?lnk=gst&q=icdar#68b5f07bff0b54b2 How can I try the next version?Periodically stable versions go to the downloads page. Between releases, and in particular, just before a new release, the latest code is available from svn. You can find the source here: http://code.google.com/p/tesseract-ocr/source/checkout where you can check it out either by command line, or by following the link to the howto on using various client programs and plugins. Error: X classes in inttemp while unicharset contains Y unichars.(Where Y != X) There are 2 possibilities: X ~= Y, usually with X < Y: Usually caused by a failed training process. Look for FATALITY messages from the tr file generation process. Looks like the training process failed to pick up some samples of some characters, and they didn't make it into the inttemp file (in mftraining) because there was no entry in the tr file. There are bad problems with applybox that make this a problem for a lot of people. The character samples need to be spaced out. X a wild number (very large + or -) and Y a sane number between 100 and a few thousand, depending on the language: Corrupt inttemp file, or (if you have NOT trained youself) your hardware has a funny endian architecture that is not automatically detected. Big-endian or Little-endian 32 and 64 bit SHOULD work, but mixed endian (0x12345678 -> 0x56 0x78 0x12 0x13 or similar) will NOT work. Get a sensible hardware architecture, or retrain yourself. Then your inttemp will match the hardware. How can I make the error messages go to tesseract.log instead of stderr?To restore the old behaviour of writing to tesseract.log instead of writing to the console window, you need a text file that contains this: debug_file tesseract.log call the file 'logfile' and put it in tessdata/configs/ Then add logfile to the end of your command line. My question isn't in here!Try searching the forum: http://groups.google.com/group/tesseract-ocr as your question may have come up before even if it is not listed here. |
Sign in to add a comment
For recognizing only digits, I did the mentioned task but in the log I received: read_variables_file:variable not found: tessedit_char_whitelistTesseract Open Source OCR Engine
I'm currently testing the executables.
HI, I have downloaded source and languages, decided it would be easier to download 2.0...exe6.tar.gz. I see executables but no install executables nor when I run the tesseract.exe it doesn't due anything.
Can you help me?
I want to know how can I create the following 8 files
- tessdata/eng.freq-dawg
- tessdata/eng.word-dawg
- tessdata/eng.user-words
- tessdata/eng.inttemp
- tessdata/eng.normproto
- tessdata/eng.pffmtable
- tessdata/eng.unicharset
- tessdata/eng.DangAmbigs?
Thanks a lot1. Where is there documentation? Is it downlodable? 2. What image formats are supported (JPG,GIF,BMP,PNG)? 3. Command line arguments? help?
Can tesseract able to recognize 7-segments digits (on clock radio for example) ?
Thanks
I am using tesseract successfully to ocr tiff files. At this particular stage in my own project, it would be convenient if I could OCR directly from a memory string rather than a file. This should be easily possible (I do have the memory).
Is there a command to read/ocr directly from a memory string rather than a filename input?
With pytesser you can OCR from PIL Images. But I'm having difficulty getting pytesser to work for me at all. It doesn't help that it's been at Version .0.0.1 since last year.
Addendum to the above. It looks like the pytesser package doesn't come with the eng.unicharset file, nor do any of the other files have the eng prefix, so it kept throwing a file not found exception. Took forever (not really, I just started this afternoon) to track it down.
what are the command line options for tesseract?
There is a python wrapper for tesseract-ocr at http://wiki.github.com/hoffstaetter/python-tesseract . This may be useful for anyone trying to do character recognition in python.
Hi,
Am working with tesseract-1.03 it works well still some tiff files are not changed to text document. While running with an image it resulted in the empty text document. tell me some suggestion to get proper text document.
thanks in advance
Re your instructions: "Windows (Visual C++): Libtiff support can be added in either VC++6 or VC++Express with the following:"
>> This is not clear enough - my comments below:
Goto the Windows download for libtiff and follow these steps:
Download and run the setup program. Add the paths for include and library files in tools/options/directories
>> WHAT paths and include ??? >> >> I assume you mean for bin, in VC++ Express, >> add: C:\Program Files\GnuWin32?\bin >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: Executable files' selected) >> >> AND >> >> for include, in VC++ Express, >> add: C:\Program Files\GnuWin32?\include >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: include files' selected) >> >> Correct?
Add HAVE_LIBTIFF to the preprocessor definitions.
>> WHERE in the Microsoft Visual C++ Express tool do I set this? >> There is no section, tab or otherwise for adding pre-processor definitions. >> Where are they found in this tool?
Add libtiff.lib to the list of libraries.
>> in VC++ Express, >> add: C:\Program Files\GnuWin32?\lib\libtiff.lib >> in Tools->Options->VC++ Directories >> (NOTE: with 'Show directories for: library files' selected) >> >> Correct?
Rebuild.
Make libtiff3.dll be in your path somewhere. This is done by control panel/system/advanced/environment variables and adding c:/program files/gnuwin32/bin to PATH.
>> OK - that was one step that was clear
Keep your fingers crossed...