|
TrainingTesseract
How to use the tools provided to train Tesseract for a new language.
For Tesseract 2, see TrainingTesseract2, for Tesseract 3, see TrainingTesseract3 |
► Sign in to add a comment
|
Search
|
|
TrainingTesseract
How to use the tools provided to train Tesseract for a new language.
For Tesseract 2, see TrainingTesseract2, for Tesseract 3, see TrainingTesseract3 |
I have discovered empirically that all of the characters used in the second field of DangAmbigs? need to be listed in unicharset.
What is the "best" font for text recongition by tesseract? I can choose how I print these out, but I'm finding that some fonts work better than others.
The best font is the one closest to the text you want to recognize.
"The training data currently needs to fit on a single page." Is there a limit on page size, which?
Anybody training Tesseract to recognize Russian?
Is anyone using Tesseract to recognize handwritten text (in particular, numbers)?
Hello
Will be any work to support Arabic (along with Hebrew, Persian, Urdu & the other ones)?
What about these specialized algorithms, any plans to implement them? I think it is a problem similar to cursive English, so may be if it can handle cursive English, It could be modified to handle Arabic?
I'm thinking of trying to train it to use Arabic anyway.
Thanks
How much the training process depend on "realism" of input? I.e. should it necessary be scanned image?
Is it possible to train tesseract just on the computer rendered image of some page with known position of symbols and their bounding boxes (one can produce something like this using e.g. LaTeX + some post-processing of dvi file)? If this is possible it would save lots of manual work.
is it possible to train tesseract to recognise Chinese? if I only train it to the most frequent ~3000 characters in Chinese, how slow will it be?
Is there a way to just limit the characters without completely retraining. i.e. I have an application where I need to only scan numbers so I only need 0-9 and a decimal point.
Thanks.
Does any body have the following files
There is something seriously wrong here...I'm trying to use Tesseract 2.01 on WinXP (first time). When I try to follow the 'Tesseract for Training' procedures, I execute this command: tesseract eng.arial.tif junk nobatch box.train
The log file tells me it's unable to open tessdata/eng.inttemp. The documentation tells me this file is created by running mftraining on the .tr files, but it's the above command that creates the .tr files! I get the same problem when trying to use the provided tif/box files. Help! How do I start??
Also, the Windows .exe package was missing the batch.nochop, makebox, nobatch, box.train files - I had to pull these out of the source files instead.
I completed the training on my data set, generated all eight files, and transferred them to the tessdata directory. When I try to run it, I get the following error:
$ tesseract textImg.tif textImgOut.txt -l myLang
Error: Illegal malloc request size!
Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3
Note that if I run without specifying a language, thus using the default settings, tesseract works fine. What am I doing wrong?
Thanks.
I got same problem with jj-j...@hotmail.com
any helps, pls?
Regards,
Duy.
can anybody tell me how to do the training for English from scratch, step by step?
Regards,
Duy.
kbwiley you have the stock eng.freq-dawg empty, you must replace it with something else but right now I don't known how to make it
look at the download section, names are confusing so the program is tesseract-2.01.tar.gz the data files for english are tesseract-2.00.eng.tar.gz for italian tesseract-2.00.ita.tar.gz and so unpack in the data dir, it worked for me
are the source word lists for the eng.dawg files available anyplace? Thanks!
Try this script. It can generate good picture from box file and training page image. example:
$ ./boxes.sh fontfile.box trainingpage.tif result.bmp
http://pastebin.ca/891649
It need bash, grep, imagemagick Good for searching a mistakes in fontfile.box and splitting merged letters and so
anybody training Tesseract to recognize Portuguese?
has any body ever used tesseract to recognize text containing subscripts and superscripts? text like this:
thx
Is there a straightforward way to tell tesseract that all characters it will encounter are numbers? Is there a command line switch or must I train it on a numbers-only training file?
is "tesseract-21?.00.eng.tar.gz"trained with "boxtiff-2.01.eng.tar.gz"?
im trying to train tesseract for a new language, but it dont work :\ i have windows XP and it works until the step "Run Tesseract for Training". i have the files tdata.tif and tdata.box (over-worked). if is start the program now "tesseract tdata.tif junk nobatch box.train" my cpu usage rise to 100% and it never stops... the tesseract.log is empty. What can i do know? need help
I have the same problem with jj-j...@hotmail.com, now, I'm compile the tesseract-2.03. Maybe it can work.
You have to use Linux to get this program is stable to use.
All commands worked allowing me to generate the training files for my new "language." When I finished, tried running tesseract with -l MyLanguage? and received:
Error: Illegal malloc request size!
Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3
This was on Mac OS X 10.5.3
please help me with the error T_T
"Error: 48 classes in inttemp while unicharset contains 49 unichars."
If anyone understands about the problem, please email me (bestwish2u1025@yahoo.com)
many thanks.
Best wish to all _
the portuguese tesseract package i downloaded from the Ubuntu repository comes with just one word - how can we contribute officially with the tesseract dictionaries?
Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%
Can tesseract make it? thanks
Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%
Can tesseract make it? thanks
I think similar to Mr Mohamed m.k.in creating support for Arabic,Urdu,Persian languages
need help of someone
Please Mr Mohamed if u can contact me on ashrafirafique@gmail.com
Hi, I am created russian (rsl.) tessdata files and it works!!! How can I check in my files? br rumen
When I tried to train it for a single letter to test the system, the box file was empty. Other times when I tried to train it for a full charset of 125 symbols (ascii codes + a few diacritical characters) it yielded only 50 lines in the box file (50 from 125). What is happening?
More to the point, can you provide help in training from UTF characters? Like, the one you used in your example, the "ü". Or even better, if I could list the character set and it would immediately optimize for those characters. The whole TIFF - BOX process is cumbersome for 99% of the applications.
Usually, how many words do you put in frequent_words_list and words_list?
Please Mr Mohamed if u can contact me on nouri.mohammadreza@gmail.com
Anyone looking for a list of words so they can generate their word-dawg files should be able to find some comprehensive word lists here http://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html
I need OCR to get a machine readable version of a translation of an unknown language. I am going to try to build the lexicon, so at present, I have no dictionary. Is it possible to bootstrap a dictionary during the training process, or, alternatively, is there a way to turn off the top down processing, so that only individual segments are analyzed? Any help would be very welcome. Thank you.
I have seen the code for a month, but still don't understand the principle of the recognition. It seems that it hasn't extracted the character of the words. Since that , what does it depend on to recognize the character. Thank you very much
You really want to use tesseractTrainer.py for editing box files.
You really don't want to touch box files with a barge-pole. If you are editing box files you are not taking it seriously.
Generate the TIFF and the Box file together from the given font and text.
I don't know what libraries you are using but on .Net it would be something like:
This will allow automated training of the engine for any given font and language. Even the silly curly fonts.
I say it will be "something like" this primarily because the rectangles generated by MeasureTextRanges? and GetBounds? may not correspond exactly to the rectangles produced by Tesseract, so you may need a two-pass solution where Tesseract first does its best then you match your rectangles to Tesseract's rectangles in order to correct the box file.
But either way, there is no good reason to edit a box file by hand.
If you're on unix/linux, you can also try editing boxfiles with tessboxes. It has some simple logic for cropping characters automatically.
I've uploaded a copy to the files area http://tesseract-ocr.googlegroups.com/web/tessboxes-0.5.tar.gz or you can try to download it from here http://www.lbreyer.com/tessboxes.html
Can i request to train Tesseract in Hindi or other languages like Marathi, Gujarati that are written in Devnagiri script?
Exist the prediction to develop the "Source training data" for Portuguese?
How to make freq-dawg, word-dawg, user-words files ? i have already a dictionary file that i saved UTF-8. give same linux command to make it.
Under the wordlist2dawg section, please add a link the wordlist2dawg memory fault in the FAQs. Could have saved me some time.
Thanks for this article, anyhow.
First, thank you for your effort. This looks like a promising technology and I can't wait to get it to work.
I can't train properly following the sequence as stated on this page. The command line under "Run Tesseract for Training" (tesseract fontfile.tif junk nobatch box.train) appears to require eng.unicharset and inttemp--which appear to require products of the training session.
I've installed tesseract-2.01.exe.tar.gz and boxtiff-2.01.eng.tar.gz on Windows XP. I left all the box files alone. Configurations appear to be okay and in the right place.
Suggestions would be greatly appreciated.
Apparently there's a circular dependency in creating these files from scratch.
In case anybody else has this problem, the solution is to install the eight core files which are found in tesseract-2.00.eng.tar.gz . Put these in your tessdata folder and then you can begin to train your own batch.
--Ray
people, i've did all the steps, made all 8 files, running command tesseract image.tif output -l geo ( geo for Georgian) and tesseract.log file tells me that it was unable to load geo.unicharset file. WHY?.. it is in tessdata folder, with other 7 geo. files ... please, help
or , maybe , you could tell me where in code is loading of unicharset file, so i could understand the reason of error
Hello All,
I am trying out tesseract. I downloaded tesseract2.01 ( and extracted to a folder) and also boxtiff-2.01.eng.tar.gz and extracted to eng folder. I am using MS Windows Vista.
Q1. Do I need anything else to use tessearct for English
Q2. Do I need to train tesseract for English? Since I already have .tif and .box files into eng folder, I guess it is not necessary. Can anyone confirm please. If I need to trian, can someone give some steps please.
Q3. Anyone has answers to an earlier question posted in 2007 -
Thx
Please notify I moved tessboxer from http://www.ospilka.com/dl/tessboxer.zip to http://sites.google.com/site/spilkaondrej/
I have the same problem as krazilek. I try to train fonts for Khmer language. Please help.
Please look at this tutorial: http://vannait.blogspot.com/2009/06/how-to-train-tesseract-ocr.html
Does anyone knows if I can train tesseract for recognizing full simbols? I mean, reading a box containing more than one chars? Like:
Handwrited "thousand" -> boxfile: 1000 x0 y0 x1 y1
or something like that?
Alright, here's my rant:
But the tesseract itself has shown improvement (or "learning") thanks for your work!
Hi, is there a keyword to set size limitation for the box?? The engine tries to pick up very small feature, e.g., a dust, and assign it a character. Since the language I try to train the engine with has clear character size, it will be wonderful to let the engine just ignore too small feature. Thanks!!
G'day all - I am trying to apply a template to the extracted image text. I currently slice an image up into smaller segments, and the perform the OCR on each one. The trouble is that the alignment of the template doesn't always match the image due to manual scanning errors. Is it possible to extract the x/y coordinates of a text conversion from an image? Alternatively, is there a different way to perform this type of registration process other than hunting and slicing throughout a document?
hi all. people, who are training tesseract new languages under windows, i've written useful program, which is like bbTesseract, but with more functionality and more stable. soon instruction and program will be here
for more information write me krazilek@gmail.com
Anybody training tesseract to read Afrikaans? Contact me at jacobuspdebeer@gmail, to team up with all this coding. Regards, jaco
My ABBYY fine reader has no problem with Afrikaans. Is is necessary to develop a new program?
Has anyone tried used tesseract to read Saba script - Amharic, Tigrinya, Ge'ez? The characters are all stand alone so it would not be as difficult as Arabic.
Hi cheers,
I suceeded to embed tesseract into an embedded linux in an arm plateform.
Now, i'm looking for a library wich can be able to recognize OCR-A and OCR-B fonts.
I try to do it with french and english library but the result it's not very good...
tesseractTrainer.py for is released as a standalone utility for windows under lime-ocr project. This requires no python installation and can run out of the box. You can download it from http://lime-ocr.googlecode.com/files/Tesseract-Trainer-1.5.0.1.exe
Thanks Mr nishad ! I tried to use the linux tools to train tesseract and they working well I obtained very good results^^
I think similar to Mr Mohamed m.k.in and ashrafirafique creating support for Arabic,Urdu,Persian languages
please help me
contact me on feyzollahi.a@gmail.com
I‘m Chinese can it auto convert train text to image ?And auto train from exist image and train text?
Is this Tesseract will support Tamil fonts?
I was having trouble getting a .tiff image with only 8bits or less per pixel. I was getting error messages like:
check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:16 check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:32
In using Gimp 2.6.3, I found the following would work to create a .tiff image with the right bits per pixel.
1. Create your image 2. Choose the menu option Image->Mode->Indexed... 3. In the dialog that appears:
4. Now save your image (choose a .tif/.tiff extension) 5. In the Save as TIFF dialog, choose None for compression.The resulting .tiff image can be read by Tesseract.
In order to get Tesseract-Trainer (available from http://lime-ocr.googlecode.com/files/Tesseract-Trainer-1.5.0.1.exe ) to be able to open a tiff file, I had to download and install libtiff from http://gnuwin32.sourceforge.net/packages/tiff.htm.
I then had to add the path to libtiff3.dll to my Path environment variable. For me (on Vista 64bit), that path was: C:\Program Files (x86)\GnuWin32?\bin
To set the environment variable on Windows Vista: 1. Right click on My Computer on the desktop and choose the Properties item 2. Click on Advanced System Settings 3. Click the Environment Variables... button 4. In the System variables area, scroll down till you find the "Path" entry and click on it 5. Click Edit... 6. Add ;C:\Program Files (x86)\GnuWin32?\bin (semicolon plus whatever the path is to the directory that holds your libtiff3.dll) 7. Click OK all the way out.
Hi, guys. I wrote a quick and easy box creator for Windows. I hope you like it. http://code.google.com/p/owlboxer/
Wow, http://code.google.com/p/owlboxer/ is the best boxfile editor!!!
Hi peaple, also I have a problem wroted by "jeremy.r.brown": when converting png file to tiff and trying to reed it, tesseract says: >> >>check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:16 >>Segmentation fault
How can I fix this? Please help
Oh, I tryed with convert command this option: >>convert -type Truecolor .. ..
and its works.
thanks anyway to all for this project first of all!!!
What about Tesseract 3???? plaase, upgrade this tutorial.
Hello all,
@rsallar: i just had a look at the .traineddata files, and it looks like they are just self-containers for all previous configuration files to me. i'll try to investigate a bit and find how they are ordered inside, and if there is a descriptor of any kind with offsets.
Pierre.
Hello again,
For people interested in the new undocumented training data format, i've just tried to understand how it works. i used the eng.traineddata, and found the following, which is verified on other formats.
Header: Always begins with 0A00 0000 FFFF FFFF FFFF FFFF. Maybe it's a version marker? Then, the header is composed of offsets (i count 9 of them).
Note that the last 4 blocks had a lot of similarities. i guess those are the same "kind" of data. i'll try to write a little packer once i'll have figured if each block contains relative or whole offsets. Also i'll have to look at the source code to make sure there is no mistake here.
Hope that helps, Pierre.
No need to write a packer. The training files can be combined with combine_tessdata, which is included with the source files.
In order to run tesseractTrainer.py on OS X you first need to install X11 and xCode tools and MacPorts?. Then use MacPorts? to install python, pygtk and gnome themes. For details on the latter see http://www.php-architect.com/blog/2009/02/25/installing-python-pygtk-on-mac-osx/
As best I can tell, the training step above that looks like this:
Should really look like this:
And this line:
Should look like this:
tesseractTrainer.py is out-of-date
This documentation is pretty bad. Both unicharset_extractor and the mftraining commands don't work as shown here.
Couple of notes, this page has been updated to reflect how to do train w/ the svn version of code not the 2.0 series. In addition it is a little vague around the what to do with the mftraining and cntraining output files.
Additionally the output files of mftraining/cntraining need to be renamed to lang.<filename> before trying to do the combine_tessdata. combine_tessdata expects lang.<filename> files when trying to build the trainingdata file.
After running mftraining and cntraining rename the output files inttemp, normproto and pffmtable to lang.<filename> then run "combine_tessdata lang." You should now have a lang.traineddata that you can use tesseract with. After copying lang.traineddata to your tessdata folder and creating a lang.user-words file (empty or not), run tesseract fontfile.tif output -l lang
Hope this helps someone save a few hours of frustraion.
The mftraining and cntraning from 2.04 don't work, I keep getting an error 1000 code.
The tesseract.exe generated from the latest source crashes when I try to generate the box file so I can't seem to get it working with either.
I am new to Tesseract. I would like to train Tesseract to recognize old Latin text. I have looked at this http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but I still not sure how to do it. Are there any other tutorial or docs i can read to get started? Thanks.
Is there a tutorial somewhere which actually shows us how to train tesseract. I have some fairly clean images which I am ocring with tesseract and the results are abysmal so I thought that I better figure out how to train it to the particular font that I am using.
anyone use tesseract for train devnagri fonts??
I use tessnet2 application on windows .But ocr.Init gives error.and stop application.What can I do ? thanks
I'am trying to train Tesseract 3.0, but the last step (putting it al together) is unclear to me.
If i have 1 font (works) the last step is.
..\tesseract nld.font1.exp0.tif output -l eng
But what is the last step if i have serval tif files? If i repeat the last step with another tif file the traineddata file doesen't seem to be affected.
The training procedure description speeks of an image.tif, how can i feed multiple files?
What is the most probable font style that can be used in Tesseract 3.0?
I have tried to convert font styles Times New Roman, Calibri and Verdana, and it produces wrong output. Like for example, "b" becomes "h".
Thanks!
I am new to Tesseract. I would like to train Tesseract to recognize khmer unicode text. I have looked at this http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but I still not sure how to do it. Are there any other tutorial or docs i can read to get started? Thanks.
hi is there a way to recognize multiple languages in a single tif? And what about Italics and Bold?