|
ReadMe
Important information all Tesseract users need to know.
Featured IntroductionThis package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, all the code in this distribution is now licensed under the Apache License: Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Other Dependencies
Installing and Running TesseractDistribution packagesTesseract is split into several packages:
Note that tesseract-x.xx.tar.gz unpacks to the tesseract-x.xx directory. tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-x.xx directory. It is therefore best to download them into your tesseract-x.xx directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care to, as they all contain different files. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up. Similarly <lang>.traineddata.gz must be unpacked to tessdata directory of tesseract-x.xx instalation. boxtiff-2.01.<lang>.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files. Instructions for using the training tools are documented separately at TrainingTesseract3 and for testing at TestingTesseract. Installation Notes - Tesseract 3.01GeneralIMPORTANT: 3.01 is not backwards compatible with 2.04. The data files are different. (Single file per language among other things.) You therefore need to make sure you connect your new executable with the new data files. Another important change is that you should really be using TessBaseAPI if you are linking with another program. In Linux (non-Windows) the main library is now libtesseract_api.a instead of the old libtesseract_full.a. The command line is: tesseract <image> <outputbasename> [-l lang] [configs] In the executable, page layout analysis is enabled by default. You may need to turn it off to process small images. No command-line control for this yet. Sorry. See tesseractmain.cpp. The training process is described on separate wiki page. Use the most recently available language files for the languages that you want. LinuxIf they are not already installed, you need the following libraries (Ubuntu): sudo apt-get install autoconf automake libtool sudo apt-get install libpng12-dev sudo apt-get install libjpeg62-dev sudo apt-get install libtiff4-dev sudo apt-get install zlib1g-dev You also need to install Leptonica. There is an apt-get package libleptonica-dev, but if you are using an oldish version of Linux, the Leptonica version may be too old, so you will need to build from source. 3.01 requires at least v1.67 of Leptonica. The sources are at http://www.leptonica.org/. The instructions at Leptonica README are clear, but basically it is the usual ./autogen.sh ./configure make sudo make install sudo ldconfig Now back to Tesseract. Download the source from svn: svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only or package tesseract-3.01.tar.gz from download page. The same build process as usual applies: ./autogen.sh ./configure make sudo make install sudo ldconfig On some systems autotools did not create m4 directory automatically (you got error: "configure: error: cannot find macro directory 'm4'"). In this case you must create m4 dicrectory by yourself before running ./configure: mkdir -p m4 Between configure and make, you can check that everything has worked by looking at config_auto.h It should contain #define HAVE_LIBLEPT 1. You can also use: export TESSDATA_PREFIX=/some/path/to/tessdata to point to your tessdata directory (example: if your tessdata path is '/usr/local/share/tessdata' you have to use 'export=TESSDATA_PREFIX='/usr/local/share/'). The command line for running tesseract is: tesseract <image> <outputbasename> [-l lang] [configs] Install language data:
WindowsThere is windows installer for Tesseract-OCR 3.01 including English langugage data. Other language data can be donwloaded and installed from installer. Installer adapt PATH environment of current user (e.g. user that installed tesseract) and setup TESSDATA_PREFIX environment variable for current user. If you have problem to run it, please check if you have installed Microsoft Visual C++ 2008 SP1 Redistributable Package (x86). The dll isn't supported in Tesseract-OCR 3.00/3.01. Instalation from sourceDownload the source from svn: svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only Windows relevant files are located in vs2008 directory. The same build process as usual applies: Open tesseract.sln with VC++Express 2008 and build all (or just Tesseract) It should compile (in at least release mode) without having to install anything further. The dll dependencies and Leptonica are included. With the full svn download, it should just run immediately after building. tesseract <image> <outputbasename> [-l lang] [configs] SupportIf you need support please try to search and use tesseract user forum or tesseract developer forum. It is good to read wiki pages before posting on forum. Installation Notes - Tesseract 2.04LinuxInstalation process is the same as for version 3.00 just use correct source and language data for Tesseract 2.0x. WindowsThere is no windows installer! There are windows executables: tesseract-2.04.exe.tar.gz (It is not for the 'exe' language.) They are built with VC++ express 2008 and come with absolutely no warranty. If they work for you then great, otherwise get Visual C++ Express 2008 with service pack 1 and build from the source. You can also try tesseract-2.01.exe.tar.gz, which is built with VC++6, and may work better if your windows is old, but note that this is an older version of Tesseract. If you are building from the sources, there are still (up to v2.04) .dsw and .dsp files for vc++6, but the recommended build platform is now VC++ Express 2008. There are also .sln and .vcproj files for VC++ Express 2008, but these files are not backward compatible with any previous version - not even VC++ Express 2005. Note that the executables produced with the newer compiler are smaller, faster, and, believe it or not, more accurate. (See TestingTesseract.) New with 2.04: the executables are built with static linking, so they stand more chance of working out of the box on more windows systems. The executable must reside in the same directory as the tessdata directory. (The Visual Studio projects build the release executable directly to the correct place!) The command line is: tesseract <image.tif> <output> [-l <langid>] For interfacing to other applications, there is a DLL included with the executables, but you may be better off building it yourself. The DLL is NOT built for static C-Runtime, so you will probably need VC++ Express 2008 to run it. The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.) Non-Windows (or Cygwin)You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either: ./configure make make install to move the data files to the standard place, or: export TESSDATA_PREFIX="directory in which your tessdata resides/" In either case the command line is: tesseract <image.tif> <output> [-l <langid>] New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that. If you are linking to the libraries, as Ocropus does, there is now a single master library called libtesseract_full.a. Libtiff support should now be properly working via configure, but note that you need libtiff-dev, as that contains the header files required to compile the code that uses it. History:The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and under Windows with VC++2008 Express. The C++ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug. Another "feature" of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. This is ugly, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet. The most recent change is that Tesseract can now recognize 33 languages, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. See http://www.isri.unlv.edu/downloads/AT-1995.pdf. With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. See TestingTesseract for more details. About the EngineThis code is a raw OCR engine. It has NO OUTPUT FORMATTING, and NO UI. It can detect fixed pitch vs proportional text. Having said that, in 1995, this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code IS included in the open source release however, and is now included for those willing to try. |
i get an error massage: Could not open file, -1 my command is: tesseract test1.gif output -1 i have the eng.<files> in a folder called tesserdata of the folder that conatins the exe.
when i run the command in the tsserdata folder i get an error: Unable to load unicharset file C:/contract/visumatic/FreeOCR/tessdata/tessdata/eng.unicharset
can i can some help?
i get the same error as above on Ubuntu 7.04:
/usr/local/bin/tesseract test.tif out.txt Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
You need to download the language charset that you wish to use from the download button above. Extract it to the directory in which your tesseract executable resides and you shall have that error no longer.
You need to download the language charset that you wish to use from the download button above. Extract it to the directory in which your tesseract executable resides and you shall have that error no longer.
Help please, I'm currently trying to link tessdll to a c#-project. In C# I declared
static public extern int TessDllBeginPageUprightBPP(UInt32 xsize, UInt32 ysize, ref byte b, string lang, uint bpp);When I call tessdll my application simply closes. Do I need to initialize tessdll somehow?
Many Thanks!
To mcour...@mindspring.com and tom.garvin:
your command is wrong. it is not "number one" (-1). rigth is the letter L (-l) to type.
On windows I found this really easy to use, here are the steps with the Nov 07 version:
1) download tesseract-2.01.exe.tar.gz and tesseract-2.00.eng.tar.gz 2) extract these files into the same folder (7-zip or whatever expanding software you prefer) 3) open a command window for this folder, where the tesseract.exe file is located. 4) prep a tiff image, in my case I took a digital picture of a book, tweaked it in photoshop and saved as a tiff with no compression. You could do the same with the Gimp. 5) now I put the tiff image into the same folder and then in the command window invoke the operation 'tesseract.exe MyImage?.tif MyImageConverted? -l eng' 6) the process runs in the background for a few seconds and then a new text-file appears with the name 'MyImageConverted?.txt'.
On windows, using VC++Express, when enabling libtiff, I had to take two additional steps:
1) add HAVE_CONFIG_H to the preprocessor definitions
2) create an empty config_auto.h file
this is because HAVE_LIBTIFF is between HAVE_CONFIG_H in file tesseractmain.cpp For the rest it works fine
I have some JPG and BMP files. What utilities can I use to convert these files to TIF files that TESSERACT will recognize? I used the Paint program provided with my Windows XP, but the TIF file it created was not recognized by TESSERACT.
Here is the log file:
Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:number.tif Signal_exit 31 ABORT. LocCode?: 3 AbortCode?: 3
I extracted some English characters and numbers from a scanned document, but the recognition results were not very good. The accuracy was only about 50%. In fact, these characters were very easy to recognize by human. So I think there must be some problems.
1. Should I normalize the characters to specific size before recognition? What's the best width and height of an character for recognition?
2. Will the space between characters affects the recognition performance?
To: samlal...@yahoo.com, MS Paint uses LZW compression. Try IrfanView? (google it), you can save a file to TIFF and choose the compression. Choose no compression and it will work with Tess. You also may need to reduce the color depth, you can do that in IrfanView? as well.
NA ABillionBillion.com Document Management for Everyone
It's important to have a big image of the text (in my case, a character size of 20x20 pixels works right) in order case the result case will be blank. With a image too small, i resized it with photoshop (to 300%), and recognized the text without problems.
The readme says tesseract scans only a single column, but is that limited to only a single page with one column? If not, it's not working. If so, then I may have something useful for others. I wrote a script to scan multipage TIFFs using tesseract. If anyone wants a copy, just e-mail me.
While on the topic, I have a script (intended to run as a cron job) for finding all .tif files and tesseractizing them to help in later content-based searching.
Regards,
Jim
On Ubuntu, it won't find the data files unless you do this:
That said, there are still problems, e.g. many variables in box.config are not found.
This is good stuff! Thanks, The Ray Smith!
A question: it seems like unrecognized characters get replaced by spaces in the output ascii. If this is true, is there a simple way to use some other character, like ~ ?
what is the script for doing multipage tiffs?
I've added my experiences of using Tesseract here:
http://www.scribd.com/doc/2589070/how-to-scan-books-to-text-files
It is from a very non-expert Windows perspective, so might be of use to some people... Please feel free to add any part of it to the documentation or wiki etc.
you must remove alpha channel from TIFF !!
for all the people getting the error:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
or some variation thereof, you'll find that the eng.unicharset and most of the other files in the tessdata directory have size 0 kb. confusingly, they seem to have been put there as place holders. you need to download and install the various language packs separately.
How can i add libTIFF support under Linux environment?
barendgehrels is right. For installing libtiff - follow the original instructions and then do the following:
1) add HAVE_CONFIG_H to the preprocessor definitions 2) create an empty config_auto.h file
I have a bunch of documents, all the same size, with a field that is overwritten with a pattern - (perhaps to foil ocr..). How do I go about removing that pattern before attempting ocr? I can send a sample of the field.
I can not load the executable libtiff from http://gnuwin32.sourceforge.net/packages/tiff.htm
thanks
Can you add the comments from jhearn and m4rtin.m to the main documentation ?
I see that eng.unicharset is not included in the latest zip file (again). I grabbed one from an older zip file, but it appears to not be compatible.
Okay, i didn't have the lang file in the directory before i did make install, what should i do?
got mine to work after i changed the extension from .tiff to .tif ! doh!
still the output was not good enough to be recognizable. tesseract got confused by lines that were not aligned (because they belonged to a different article on the same page). It read the lines on the left of the page but not the lines on the right-hand side of the page (because they belonged to a different article and therefore slightly offset).
I'm also having problems with the charset files:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I downloaded them from an older release (they were not included in this one) but no luck :(
Wikipedia says "Please note that the website at www.libtiff.org is a hijacked domain and while it now points to the real site for current development at www.remotesensing.org, the libtiff.org site still shows the latest version as 3.6.1, which is not correct. It also has an incorrect address for the Libtiff mailing list."
If that's the truth, it might be better to link to remotesensing instead of libtiff.org.
same problem here:
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I am trying to get tessnet2 working but in vain
same Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset problem
downloaded the eng pack but still no luck
I was able to build a tesseract executable using Visual Studio 2005. I extracted a page from a PDF file using .NET and created a TIF file to contain the image. I have installed Libtiff support and built my tesseract executable.
Tesseract appears to run without error when I run it against my Tiff file. Although, the output file contains the following characters "S.¤,SQ,Vi(< G u¤,¤n.<<d 6". These are not the text found in the Tiff file I created.
Does anyone know what I am doing wrong?
how do i get this to run on a macbook?
Could somebody not work on a UI for this?
for UI, you can go thought it. this si working fine as C#.NET wrapper on tesseract-ocr. http://groups.google.com/group/tesseract-ocr/browse_thread/thread/d80a3989c5c0931f#
I need help for how can I use it for different language other then english?
Does anyone have a work around/fix for the Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset issue that people have posted earlier?? I just installed it yesterday and am unable to move ahead at all....
download each of the eng. from http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/ and problem disappered. You will find if you ln -s /usr/local/share/tessdata/eng. all the files are probably zero.. (same in the /usr/src so probably not part of the .tar.gz)
Installation on a mac (ppc, 10.4.11) with english language ocr in mind:
a) download tesseract-2.03.tar.gz and tesseract-2.00.eng.tar.gz from the downloads page
b) open a terminal, cd to wherever you downloaded the above files, then do:
tar xvfz tesseract-2.03.tar.gz
cd tesseract-2.03
./configure
make
sudo make install
cd ..
tar xvfz tesseract-2.00.eng.tar.gz
sudo mv tessdata/ /usr/local/share/tessdata
sudo chown root:yourusername /usr/local/share/tessdata/
rm -rf tessdata
rm -rf tesseract-2.03
rm tesseract-2.03.tar.gz
rm tesseract-2.00.eng.tar.gz
Note in the above replace yourusername with your short username. eg for me it's sudo chown root:pete /usr/local/share/tessdata/
If you don't know what your short username is, type:
whoami
and the response is your short username.
c) you now have a working install of tesseract set up to do ocr on english language documents. To do other language documents, download the relevant language file, then repeat all steps from "tar xvfz tesseract-2.00.language?.tar.gz" above.
d) NOTE: for tesseract to work, the tiff file you're running it on needs to be renamed to end in .tif (not .tiff) AND it needs to be an image without an alpha channel. If you've renamed the file and tesseract is still barfing, this is probably the problem. Use an image conversion utility with the ability to remove alpha channels to re-save your image. For bulk image conversion I recommend Imagemagick (it's gpl and runs well on the mac).
e) finally, just so everything is on one post, to ocr your tiff image, do:
tesseract inputimage.tif outputtext -l eng
and you should get a file called outputtext.txt.
I can compile the project with Visual Studio 2005, but when I run the tesseract.exe app, I get "The ordinal 166 could not be located in the dynamic link library libtiff3.dll". Libtiff3.dll is from the latest version of LibTIFF, and it is definitely in my path. Anyone else come across this?
We are experiencing the same error with Visual Studio 2008. Is there another mailing list to submit this too?
I am getting the following error I issue the command : tesseract.exe out.tif out
Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:out.tif Signal_exit 31 ABORT. LocCode?: 3 AbortCode?: 3
Can someone please help me in this regard. I have already installed libtiff and set up the path etc as specified above but still I am getting the same error.
I am able to get tesseract2.03 running on OSX 10.5 just fine. However, when I compile with libtiff, the output becomes garbage. I tried both libtiff manual build or from Darwinports... any idea how I can go about this?
Here too, 2.03 after running the command from above: tesseract inputimage.tif outputtext -l eng
I get: Tesseract Open Source OCR Engine Image has 24 bits per pixel and size (450,24) Resolution=72
and then the outputtext.txt is created, but it empty. This is for vanilla text copied from the screen, not handwriting. Any experience with non-error causing blank output?
i am unable to work with this software..anyone plz give ur email id..i will send the image..plz plz check and tell me the result whether it works..plz help frnds
Using Instructions from Comment by caitifty, Feb 28, 2009 - Installed 2.03 on MAC 10.5.7 Intel and it compiled and tested with phototest.tif successfully.. starhari86 let me know if u still need some help testing your image.
john-davids-macbook-pro:tesseract-2.03 thinkdunson$ make -bash: make: command not found john-davids-macbook-pro:tesseract-2.03 thinkdunson$ sudo make install Password: sudo: make: command not found
now what? i have no experience with terminal… anyone, please help.
hey barendgehrels, i was just wondering in which directory should i create the empty config_auto.h file?
Problem:
Unable to load unicharset file /usr/local/share/tessdata/spa.unicharset
Solution:
1. Download tesseract-2.00.eng.tar.gz from http://code.google.com/p/tesseract-ocr/downloads/list 2. Extract 3. Copy all files to /usr/local/share/tessdata/
;)
i use tesseract2.04 in windows vista and if i use the programm i get an document full of nonsens. i belive, it's because it sets a false charset. the textdocument uses utf8.
libleptonica-dev
The tesseract 3.0 build for Windows compiles out of the box, but the resulting executable doesn't run because leptonlib.dll is built against libpng12.dll, but libpng13.dll is what's included in SVN.
You can get the appropriate binary for the missing DLL from http://gnuwin32.sourceforge.net/packages/libpng.htm
Where can i download version 3.0?
i was able to get this working in both windows and cygwin. i found the recognition to be far superior in cygwin with the same language files.
after several unsuccessful attempts, i found rtfm to be the best approach.
I've posted my experience/first interactions with tesseract on Ubuntu linux at http://triviaatwork.blogspot.com/2009/08/first-interactions-with-tesseract-ocr.html
On the ReadMe page, under "Installation Notes - 3.00 Prerelease, General", it notes that Japanese language data files were available for version 2.04. Where can I get a hold of those? I have an older version of tesseract.
Is there a way to get or to visualize the page lay-out, the coordinates of the blocks found by the process ? Is ScrollView? a solution and, if yes, how to do it ?
I am using tesseract 2.04. I know there is no layout analysis of tesseract, but is there any information other than the output text? Like line number or character (or word) number for a specific character/word, or anything else. It is OK if these information could be obtained in an intermediate step.
Thanks.
I have found a build error in tesseract-2.04 under Kubuntu 9.10 and g++ 4.4.1.
The file viewer/svutil.cpp does not compile because snprintf is not declared. Exact message:
g++ -DHAVE_CONFIG_H -I. -I.. -I/usr/local/include/liblept -g -O2 -MT svutil.o -MD -MP -MF .deps/svutil.Tpo -c -o svutil.o svutil.cpp svutil.cpp: In constructor ‘SVNetwork::SVNetwork(const char, int)’: svutil.cpp:323: error: ‘snprintf’ was not declared in this scope
It can be easily fixed by including the cstdio header in the file.
For latest version on SVN (3.0), under OSX Snow Leopard with all required libraries installed via MacPorts?, I get this error after make:
make all-recursive Making all in ccstruct source='blobbox.cpp' object='blobbox.o' libtool=no \
/bin/sh: ../config/depcomp: No such file or directory make3?: [blobbox.o] Error 127 make2?: [all-recursive] Error 1 make1?: [all-recursive] Error 1 make: all? Error 2Fixed the error by running this: ./runautoconf
possible for someone to upload tesseract-ocr 3.0 window bins?
Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset how come I only get this in the windows command prompt not the cygwin one. files are there. Anyone been able to solve this problem yet? I have it installed on more then one and only one box is giving me this error.
Hi Tesseract OCR Team and All Forum Members,
I am facing the folowing issue:
Thanks
Windows Binaries and a GUI is posted there in http://code.google.com/p/lime-ocr/
At present, there is only English language pack. You need to get optional language packs from http://tesseract-ocr.googlecode.com/svn/trunk/tessdata/
Hello men,
I'am widely interested in this utility tool. I would like to be able to give other input file format like JPG or maybe better for OCR : the PNG format. Uncompressed Tiff is too heavy file. Is there any location where to download it ? When the next version will be released ?
sambanik > To recognize figure '0' instead of letter 'O' you can force this with the tool to recognize only figures and not letters (cf. Wiki FAQ)
raygos: You can find the Japanese data here: http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/jpn.traineddata?spec=svn309&r=309
Tesseract 2.03 installed on Ubuntu 9.10 through Synaptic package manager. 2.03 is the standard version for Ubuntu 9.10. I have the deu and eng languages installed too, no GUI front ends. I am trying to OCR a 131-page TIFF file, nearly 6mb in size. Tesseract churns through it until about p8 then throws a Segmentation Fault. I would really like to hear anyone's ideas how I go forward?
Is there any way I can make tesseract to work with .tiff files and write his output to standard output?
I had to add the following to the configure file to make Tesseract 2.04 compile on Solaris 10 x86:
{ echo "$as_me:$LINENO: checking for Solaris 10 OS (if so, use -lrt -lsocket -lnsl)" >&5 echo $ECHO_N "checking for Solaris 10 OS (if so, use -lrt -lsocket -lnsl) $ECHO_C" >&6; } if -n "`uname -a | grep SunOS | grep 5.10 `" then
else fiProbably a hack, but it got it to compile without missing symbols ;)
anything better? tried on windows ok-ish results but dealing with raw tiffs is too heavy
The debian packages for leptonica are
>libleptonica and libleptonica-dev
I am using tesseract.I have taken the image that conatains charecters "space". I have used below command, tesseract space.tif result In result.txt, only one charecter is present 'a' instead of "space". Please help to solve this problem, kindly share script if you have.
I tried all steps to install tesseract at Mac all gone well except
sudo chown root:wael /usr/local/share/tessdata/ wael is my machine short name i got chown: wael: Invalid argument
when i skip this step and continue .. when i try to run i got this error Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
I'm very impressed with Tesseract.
Is it possible to tell it to only read a portion of an image - using an x,y origin and a width,height somehow?
Or do I need to chop up my image? (since I know where the text parts will always be)
thanks!
hi guys. how can i make the tesseract to only recoginze character,not including nums.thanks.
Everyone,
I have 3 job openings requiring programmers with Tesseract experience. This is in Roanoke VA and it is long term. If you are interested please email me at carl@aptonet.com
Hi guys,
Then again i download the tesseract 3.00 Prereleasecheckout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only?, and run the command ./configure and make,i did not download the language files because it's already there.after install it i ran the above script
(Note:follwoing changes done in the script LIBFILE=ccmain/libtesseract_full to LIBFILE=/usr/local/lib/libtesseract_api) when i ran the script i'm getting the below error
"usr/bin/lipo: specifed architecture type (arm) for file (lnsout/libtesseract_api.a.arm) does not match it's cputype (16777223) and cpusubtype (3) (should be cputype (12) and cpusubtype (0))"
can any one help me to figure out this.
The above mentioned build script for iOS has been updated to work with tesseract v3 pre-release: http://robertcarlsen.net/2010/09/24/compiling-tesseract-v3-for-iphone-1299
-r
Hi all,
I am running Ubuntu 9.04 netbook on an Acer Aspire. I had managed to configure-make-make install the liblept and tesseract after I installed the other graphics -dev libraries. All seemed fine but when I try to run I get:
dalexy@mobile7:/$ tesseract tesseract: error while loading shared libraries: libtesseract_api.so.3: cannot open shared object file: No such file or directory
The link and file of that name do exist. I have tried putting a tessdata link in /usr/local/bin, but the same message is delivered.
Any suggestions?
David Young
I've compiled tesseract but I don't know how to use the language files from here https://code.google.com/p/tesseract-ocr/downloads/list
I've unpacked language files into /usr/local/share/tessdata/ but I get the error message "Error openning data file /usr/local/share/tessdata/english.traineddata" (or any other language) if I use the -l option even for english. I've tried different language files and the message was the same (of course, different names). If I do not choose the -l option it works (as Engish). So how can I choose the languages?
hi,when i get this error in 3.0 but it works in 2.04
futureha@ubuntu:~$ TESSDATA_PREFIX=/home/futureha/tesseract-3.00/ mftraining combine.tr Failed to load unicharset from file unicharset Building unicharset for mftraining from scratch... Reading combine.tr ... combine has no defined properties.
Error: Unable to open combine.tr!
Fatal error: No error trap defined! Signal_termination_handler called with signal 3000
anyone kowns what happen? thanks
Why does tesseract add itself to the Windows startup? The registry key is: HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion?\Run: Tesseract-OCR
I hate it when programs do this for no good reason as it adds to startup clutter and it looks suspicious; I disabled it in msconfig. By the way, I used the Windows installer for tesseract 3.0: tesseract-ocr-setup-3.00.exe.
For fedora 10 the library names are libpng-devel , libjpeg-devel , libtiff-devel.
zlibg is not an available library when trying to install this via apt-get in ubuntu (at least with the default sources). You may want to update the name if it's been changed or add in info as to where one can get it.
What are the changes in the API? The only note I seem to be able to find is "important change is that you should really be using TessBaseAPI if you are linking with another program" which tells me nothing. I'm getting an error 'TessBaseAPI' has not been declared for a simple test program (which used to work in version 2):
Since there is no dll in the latest version, how should I link in order to be able to use tesseract API?
On calling make on tesseract 2.04 i get a error message. Leptonica 1.67 is installed. An ideas?
g++ -DHAVE_CONFIG_H -I. -I.. -I../ccutil -I../ccstruct -I../image -I../textord -I../viewer -I../ccmain -I/usr/local/include/leptonica -g -O2 -MT leptonica_pageseg.o -MD -MP -MF .deps/leptonica_pageseg.Tpo -c -o leptonica_pageseg.o leptonica_pageseg.cpp leptonica_pageseg.cpp: In static member function âstatic bool LeptonicaPageSeg?::GetHalftoneMask?(Pix, Pix, Boxa, Pixa, bool)â: leptonica_pageseg.cpp:69:3: error: âint32â was not declared in this scope leptonica_pageseg.cpp:69:9: error: expected â;â before âdebugâ leptonica_pageseg.cpp:73:25: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp: In static member function âstatic bool LeptonicaPageSeg?::GetTextlineMask?(Pix, Pix, Pix, Boxa, Pixa, bool)â: leptonica_pageseg.cpp:139:3: error: âint32â was not declared in this scope leptonica_pageseg.cpp:139:9: error: expected â;â before âdebugâ leptonica_pageseg.cpp:143:25: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp: In static member function âstatic bool LeptonicaPageSeg?::GetTextblockMask?(Pix, Pix, Boxa, Pixa, bool)â: leptonica_pageseg.cpp:211:3: error: âint32â was not declared in this scope leptonica_pageseg.cpp:211:9: error: expected â;â before âdebugâ leptonica_pageseg.cpp:220:53: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp: In static member function âstatic bool LeptonicaPageSeg?::GetAllRegions?(Pix, Pix, Pix, Pix, bool)â: leptonica_pageseg.cpp:273:3: error: âint32â was not declared in this scope leptonica_pageseg.cpp:273:9: error: expected â;â before âwâ leptonica_pageseg.cpp:274:27: error: âwâ was not declared in this scope leptonica_pageseg.cpp:274:31: error: âhâ was not declared in this scope leptonica_pageseg.cpp:275:9: error: expected â;â before âdebugâ leptonica_pageseg.cpp:288:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:293:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:298:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:302:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:311:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:320:7: error: âdebugâ was not declared in this scope leptonica_pageseg.cpp:322:58: error: too few arguments to function âPIX pixRenderRandomCmapPtaa(PIX, PTAA, l_int32, l_int32, l_int32)â /usr/local/include/leptonica/leptprotos.h:634:23: note: declared here leptonica_pageseg.cpp:332:7: error: âdebugâ was not declared in this scope make3?: [leptonica_pageseg.o] Error 1 make3?: Leaving directory `/root/tesseract-2.04/pageseg' make2?: [all-recursive] Error 1 make2?: Leaving directory `/root/tesseract-2.04/pageseg' make1?: [all-recursive] Error 1 make1?: Leaving directory `/root/tesseract-2.04' make: all? Error 2
g++ -DHAVE_CONFIG_H -I. -I.. -g -O2 -MT svutil.o -MD -MP -MF .deps/svutil.Tpo -c -o svutil.o svutil.cpp svutil.cpp: In constructor ‘SVNetwork::SVNetwork(const char, int)’: svutil.cpp:323: error: ‘snprintf’ was not declared in this scope
After getting this error i searched the comments. There i found to include cstdio.h, but i could not find where is cstdio.h in my system? Please help.
Addin tesserat as a static library. If anyone had the problem when upgrading to revision 552 that suddenly no dll support was available anymore for their visual c++ projects, then you could go the way of including tesseract as a static library if that is an option. I will just outline the basics. First copy the project tesseract file and rename it to tesslib. Add the tesslib project to the tesseract.sln project and remove the tesseract.cpp and /.h files. Then in the project properties select, lib instead of application. You can also got to the librarian option and select Link Library Dependencies, this will save you some time when including the lib. Build the project. Now open a new project or your existing one and, you will have to add this to the linker command line. tesseract\tesslib.lib /ignore:4099 tesseract\leptonlib-static-mtdll.lib tesseract\libjpeg-static-mtdll.lib tesseract\libpng-static-mtdll.lib tesseract\libtiff-static-mtdll.lib (Make sure the file exsist) Also in Linker/Input Addition Dependencies add WSock32.Lib, needed for the viewer.lib Now add these header files: apitypes.h baseapi.h publictypes.h thresholder.h unichar.h Now you project should build as before, when you include the baseapi.h
I try to learn the 3.0 Api by comparing to the 2.04 API
The phrase "// Now run the main recognition" appear 3 times in each case. Then I try to find the "2.04" equivalents in 3.0 API. (Why?, because there are more examples how to do it in 2.04).
Question (1)
In 3.0 API method:
int TessBaseAPI::RecognizeText?(ETEXT_DESC monitor)
What are the steps needed prior to calling this method (e.g. Is SetImage? and set the Language sufficient and then pass in the initialized ETEXT_DESC monitor ?)
Question2
In the same method page_res = new PAGE_RES(block_list, &tesseract->prev_word_best_choice);
what is "&tesseract->prev_word_best_choice"
Where can I find out more
FYI:
The equivalent in api 2.04 is:
// Low-level function to recognize the current global image to a string. char TessBaseAPI::RecognizeToString?() {
}
By comparing to 2.04 api, this helps me to move one step further in getting familiar with 3.0api.
E.g Now I can understand the purpose for having "RecognizeText?" method which is "// Low-level function to recognize the current global image to a string. " The question next is where to I go to get the text string out. E.g. what is the equivalent of "TesseractToText?" in api 3.0
Thanks for the continue support, the 3.0 is truly with lots of improvements, I believe that with enough feedbacks, there will be sufficient blogs from others soon to explain in details that help new user like myself to use the API.
@jim: I suggest you to read this (ReadMe) page once again and pay attention to this: If you need support please try to search and use tesseract user forum or tesseract developer forum.
Anyone with the error message "tesseract: error while loading shared libraries: libtesseract_api.so.3: cannot open shared object file: No such file or directory ", be sure to follow the instructions in this readme, specifically,
sudo ldconfig
actual_tessdata_num_entries <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in file tessdatamanager.cpp, line 55
It's not zlibg -- the correct apt-get name is zlib1g-dev
There is one more requirement needed to be add to instruction for building under Linux (Ubuntu 10.10): sudo apt-get install libtool
The apt-get package name is: libleptonica-dev
So far no one has posted a solution to the error "Unable to load unicharset file" Could someone please give a detailed list of steps to solving this?
So many work for nothing... Why not use the KIS method? Are we in DOS era? I can't understand... I expected much better...
Anyone tried 3.0 version for with android ?
very cool software! thank you guys, but i didnt found nothing about batch conversion? is there any doc. about batch, nobatch.chop and this kind of config?
cheers
There is Android App 'OCR Test' by Robert Theis. Anyone tried it? Is the source code modified to adapt Android small footprints?
I was unable to get a working build using the Tesseract 3.0 source under AIX 5.1/gcc 3.3.3. I moved back to Tesseract 2.04, and it built and ran with only a single change. In config/config.h.in, the line "#undef LARGE_FILES" needs to be commented out, so that the 64 bit file I/O operations will link correctly.
Also, the OCR results with 2.04 on marginal text (rubber stamps with very small text, in images scanned from old microfilm) were noticeably better with 2.04 on AIX compared to the pre-built Windows version 3.00.
Hello,
I'm using the options "-l eng" and a config file with "tessedit_create_hocr 1". Can anyone explain why I get different text layout interpretations when building from svn on linux (debian) vs. using the 3.0 installer on Windows? I replaced the tessdata on linux with the Windows one (because it interpreted columns properly) and it still works differently.
If it is due to a revision that has happened in svn but not in the installer, how can I find out what revision to revert to in order to make them the same?
Thanks in advance, Dev
Installation process worked like a charm on OpenSuSE 11.4 - x86_64. Only negative I have to say is that I had to copy the man pages manually to the man-directory. Great software by the way.
Hi friends,
"leptonica library missing"
For those on Ubuntu Natty 11.04, I had to add these to the top of my configure file:
CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib"
Actually I haven't found #define HAVE_ZLIB 1 statement in config_auto.h after executing ./configure as suggested in Linux Installation section of this Readme. All needed libraries are installed in my Ubuntu Lucid. I've found #define HAVE_LIBZ 1 instead. Maybe this is correct option I was looking for? When I look through tesseract-3.00/vs2008/include/leptonica/environ.h I see the declaration of variables for the image I/O libraries, plus zlib, and #define HAVE_LIBZ 1 is mentioned there. Maybe there is some mistake in Readme?
I am new in OCR, and now I got a problem.
I have got a program generated tiff image (white background without noise, and not so regular font type). The program generate this characters: ö ü , but lot's of time the tesseract recognize: o and u. And lots of time the o and u charaters recognized as ö and ü
Maybe the tesseract think that the dot at the top of the char is noise, but it isn't. In the picture there is no noise at all!
Can anybody help me?
你好,我的这个运行一个c#版本的,第一天没有问题,第二天就出现异常;程式一运转到 Init()这句就会主动加入,没有错误提醒。Bitmap image = new Bitmap("eurotext.tif"); tessnet2.Tesseract ocr = new tessnet2.Tesseract(); ocr.SetVariable?("tessedit_char_whitelist", "0123456789"); // If digit only ocr.Init(Application.StartupPath? + @"\tessdata", "eng", false); // To use correct tessdata List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty); foreach (tessnet2.Word word in result) Console.WriteLine?("{0} : {1}", word.Confidence, word.Text);
非常郁闷,但是程序放到别人电脑也没问题,很是差异。希望能得到帮助
@barnabas You might want to take a look at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 to learn to train Tesseract for additional fonts.
is there any way to avoid those coordinate numbers in brackets that appear after every word??
I've prepared one bash script for build Tesseract 3.01 for iOS SDK 5 using Clang and fat files for universal binaries: http://tinsuke.wordpress.com/2011/11/01/how-to-compile-and-use-tesseract-3-01-on-ios-sdk-5/
The download section now contains files (for example)
tesseract-ocr-3.01.eng.tar.gz English language data for Tesseract 3.01 Should it be used for tesseract 3.01 instead of eng.traineddata.gz English language data for Tesseract (3.00 and up)?
The archive tesseract-ocr-3.01.eng.tar.gz contains eng.traineddata file that is much bigger than a similar file in eng.traineddata.gz archive and a lot of other files.
@hoogli - thanks for the advice, except "sudo ldconfig" is mentioned in this README only for leptonica, not for the tesseract installation itself. But yes, it works better after :-)
I have got a failure when running trying to build tessaract for iphone with script build_fat.sh from this famous link http://robertcarlsen.net/2010/09/24/compiling-tesseract-v3-for-iphone-1299.
But there is't any problem to build the tessaract on my Mac!I used tessaract revision 640. My MacOSX version is 10.6.7 and iPhone SDK4.3 (I fixed the SDK version in the script).
May be Im doing a simple mistake, but please help.
I got following error when I try to run a sample...
Tesseract Open Source OCR Engine v3.01 with Leptonica Error in pixReadStreamJpeg: function not present Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Unsupported image type.
I have installed leptonica-1.67 and leptonica-1.68 and tried with both and failed and I tried on MAC and Fedora too, neither of worked.
I have the same problem as glah...@gmail.com
$ tesseract pictures/ppkrock1_278445a.jpg out.txt -l swe Tesseract Open Source OCR Engine v3.01 with Leptonica Error in pixReadStreamJpeg: function not present Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Unsupported image type.
"Error in pixReadStreamJpeg: function not present" - simply means you didn't have libjpeg support in leptonica so jpeg files can't be read.