My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
TrainingTesseract  
How to use the tools provided to train Tesseract for a new language.
Updated Sep 5, 2010 by jore...@gmail.com

For Tesseract 2, see TrainingTesseract2, for Tesseract 3, see TrainingTesseract3

Comment by piggy.ya...@gmail.com, Aug 6, 2007

I have discovered empirically that all of the characters used in the second field of DangAmbigs? need to be listed in unicharset.

Comment by leeann...@gmail.com, Aug 22, 2007

What is the "best" font for text recongition by tesseract? I can choose how I print these out, but I'm finding that some fonts work better than others.

Comment by project member theraysm...@gmail.com, Aug 22, 2007

The best font is the one closest to the text you want to recognize.

Comment by gtrw...@gmail.com, Aug 26, 2007

"The training data currently needs to fit on a single page." Is there a limit on page size, which?

Comment by podrug...@gmail.com, Aug 29, 2007

Anybody training Tesseract to recognize Russian?

Comment by slg1234...@gmail.com, Aug 31, 2007

Is anyone using Tesseract to recognize handwritten text (in particular, numbers)?

Comment by mohamed....@gmail.com, Sep 18, 2007

Hello

Will be any work to support Arabic (along with Hebrew, Persian, Urdu & the other ones)?

What about these specialized algorithms, any plans to implement them? I think it is a problem similar to cursive English, so may be if it can handle cursive English, It could be modified to handle Arabic?

I'm thinking of trying to train it to use Arabic anyway.

Thanks

Comment by begemotv...@gmail.com, Oct 13, 2007

How much the training process depend on "realism" of input? I.e. should it necessary be scanned image?

Is it possible to train tesseract just on the computer rendered image of some page with known position of symbols and their bounding boxes (one can produce something like this using e.g. LaTeX + some post-processing of dvi file)? If this is possible it would save lots of manual work.

Comment by oldh...@gmail.com, Oct 14, 2007

is it possible to train tesseract to recognise Chinese? if I only train it to the most frequent ~3000 characters in Chinese, how slow will it be?

Comment by Mountain...@gmail.com, Oct 27, 2007

Is there a way to just limit the characters without completely retraining. i.e. I have an application where I need to only scan numbers so I only need 0-9 and a decimal point.

Thanks.

Comment by abishaik...@gmail.com, Nov 16, 2007

Does any body have the following files

eng.freq-dawg eng.word-dawg eng.user-words eng.inttemp eng.normproto eng.pffmtable eng.unicharset eng.DangAmbigs?
which will recognize captical letters(A to Z) and digits(0 to 9).If yes, kindly mail it to abishaik786@gmail.com

Comment by jj-j...@hotmail.com, Nov 21, 2007

There is something seriously wrong here...I'm trying to use Tesseract 2.01 on WinXP (first time). When I try to follow the 'Tesseract for Training' procedures, I execute this command: tesseract eng.arial.tif junk nobatch box.train

The log file tells me it's unable to open tessdata/eng.inttemp. The documentation tells me this file is created by running mftraining on the .tr files, but it's the above command that creates the .tr files! I get the same problem when trying to use the provided tif/box files. Help! How do I start??

Also, the Windows .exe package was missing the batch.nochop, makebox, nobatch, box.train files - I had to pull these out of the source files instead.

Comment by kbwi...@gmail.com, Dec 12, 2007

I completed the training on my data set, generated all eight files, and transferred them to the tessdata directory. When I try to run it, I get the following error:

$ tesseract textImg.tif textImgOut.txt -l myLang

Error: Illegal malloc request size!

Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3

Note that if I run without specifying a language, thus using the default settings, tesseract works fine. What am I doing wrong?

Thanks.

Comment by nguyen.v...@gmail.com, Jan 3, 2008

I got same problem with jj-j...@hotmail.com

any helps, pls?

Regards,

Duy.

Comment by nguyen.v...@gmail.com, Jan 3, 2008

can anybody tell me how to do the training for English from scratch, step by step?

Regards,

Duy.

Comment by sher...@gmail.com, Jan 31, 2008

kbwiley you have the stock eng.freq-dawg empty, you must replace it with something else but right now I don't known how to make it

Comment by sher...@gmail.com, Jan 31, 2008

look at the download section, names are confusing so the program is tesseract-2.01.tar.gz the data files for english are tesseract-2.00.eng.tar.gz for italian tesseract-2.00.ita.tar.gz and so unpack in the data dir, it worked for me

Comment by olliejo...@gmail.com, Feb 2, 2008

are the source word lists for the eng.dawg files available anyplace? Thanks!

Comment by m4rti...@gmail.com, Feb 4, 2008

Try this script. It can generate good picture from box file and training page image. example:

$ ./boxes.sh fontfile.box trainingpage.tif result.bmp

http://pastebin.ca/891649

It need bash, grep, imagemagick Good for searching a mistakes in fontfile.box and splitting merged letters and so

Comment by carlossn...@gmail.com, Feb 27, 2008

anybody training Tesseract to recognize Portuguese?

Comment by n.sad...@gmail.com, Mar 30, 2008

has any body ever used tesseract to recognize text containing subscripts and superscripts? text like this: thx

Comment by rpe...@gmail.com, Apr 2, 2008

Is there a straightforward way to tell tesseract that all characters it will encounter are numbers? Is there a command line switch or must I train it on a numbers-only training file?

Comment by wwl...@sina.com, Apr 18, 2008

is "tesseract-21?.00.eng.tar.gz"trained with "boxtiff-2.01.eng.tar.gz"?

Comment by alex28...@gmail.com, Apr 18, 2008

im trying to train tesseract for a new language, but it dont work :\ i have windows XP and it works until the step "Run Tesseract for Training". i have the files tdata.tif and tdata.box (over-worked). if is start the program now "tesseract tdata.tif junk nobatch box.train" my cpu usage rise to 100% and it never stops... the tesseract.log is empty. What can i do know? need help

Comment by zhuheng...@gmail.com, May 2, 2008

I have the same problem with jj-j...@hotmail.com, now, I'm compile the tesseract-2.03. Maybe it can work.

Comment by spammer...@mail.ru, May 21, 2008

You have to use Linux to get this program is stable to use.

Comment by tomwin...@gmail.com, May 31, 2008

All commands worked allowing me to generate the training files for my new "language." When I finished, tried running tesseract with -l MyLanguage? and received:

Error: Illegal malloc request size!

Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3

This was on Mac OS X 10.5.3

Comment by bestwish...@yahoo.com, Jun 9, 2008

please help me with the error T_T

"Error: 48 classes in inttemp while unicharset contains 49 unichars."

If anyone understands about the problem, please email me (bestwish2u1025@yahoo.com)

many thanks.

Best wish to all _

Comment by nitrofur...@gmail.com, Jun 19, 2008

the portuguese tesseract package i downloaded from the Ubuntu repository comes with just one word - how can we contribute officially with the tesseract dictionaries?

Comment by beigua...@gmail.com, Jun 29, 2008

Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%

Can tesseract make it? thanks

Comment by beigua...@gmail.com, Jun 29, 2008

Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%

Can tesseract make it? thanks

Comment by ashrafir...@gmail.com, Aug 8, 2008

I think similar to Mr Mohamed m.k.in creating support for Arabic,Urdu,Persian languages

need help of someone

Please Mr Mohamed if u can contact me on ashrafirafique@gmail.com

Comment by tstsign...@yahoo.de, Aug 13, 2008

Hi, I am created russian (rsl.) tessdata files and it works!!! How can I check in my files? br rumen

Comment by horia.cr...@gmail.com, Aug 13, 2008

When I tried to train it for a single letter to test the system, the box file was empty. Other times when I tried to train it for a full charset of 125 symbols (ascii codes + a few diacritical characters) it yielded only 50 lines in the box file (50 from 125). What is happening?

More to the point, can you provide help in training from UTF characters? Like, the one you used in your example, the "ü". Or even better, if I could list the character set and it would immediately optimize for those characters. The whole TIFF - BOX process is cumbersome for 99% of the applications.

Comment by horia.cr...@gmail.com, Aug 14, 2008

Usually, how many words do you put in frequent_words_list and words_list?

Comment by nouri.mo...@gmail.com, Aug 16, 2008

Please Mr Mohamed if u can contact me on nouri.mohammadreza@gmail.com

Comment by kris240...@gmail.com, Aug 27, 2008

Anyone looking for a list of words so they can generate their word-dawg files should be able to find some comprehensive word lists here http://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html

Comment by jkornbl...@gmail.com, Aug 28, 2008

I need OCR to get a machine readable version of a translation of an unknown language. I am going to try to build the lexicon, so at present, I have no dictionary. Is it possible to bootstrap a dictionary during the training process, or, alternatively, is there a way to turn off the top down processing, so that only individual segments are analyzed? Any help would be very welcome. Thank you.

Comment by beigua...@gmail.com, Sep 7, 2008

I have seen the code for a month, but still don't understand the principle of the recognition. It seems that it hasn't extracted the character of the words. Since that , what does it depend on to recognize the character. Thank you very much

Comment by dwayneba...@gmail.com, Sep 11, 2008

You really want to use tesseractTrainer.py for editing box files.

Comment by ben%lidd...@gtempaccount.com, Oct 5, 2008

You really don't want to touch box files with a barge-pole. If you are editing box files you are not taking it seriously.

Generate the TIFF and the Box file together from the given font and text.

I don't know what libraries you are using but on .Net it would be something like:

  1. Create a Bitmap object, which we will later save as a TIFF
  2. Create a Graphics object on the bitmap ( = device context, essentially a thing wot you can use to draw with).
  3. Set the Graphics measurement units to Pixels (to avoid conversions. Other units, inches, points, can be used in theory).
  4. Create a StringFormat? object. GenericTypographic? will do.
  5. Take each line of text one by one.
  6. Create a CharacterRange? for each character in the string, and call SetMeasurableCharacterRanges? on the StringFormat?.
  7. Call MeasureCharacterRanges? to get a set of Region objects which describe the location of each character.
  8. For each character, call Region.GetBounds? to get the bounding rectangle of the CharacterRange? (each being one character) and output these to the box file.
  9. Call DrawText? with the same StringFormat? to draw the characters onto the bitmap.
  10. Once that is done with each line, output the Bitmap as a TIFF.
  11. Train Tesseract against the given box file.

This will allow automated training of the engine for any given font and language. Even the silly curly fonts.

I say it will be "something like" this primarily because the rectangles generated by MeasureTextRanges? and GetBounds? may not correspond exactly to the rectangles produced by Tesseract, so you may need a two-pass solution where Tesseract first does its best then you match your rectangles to Tesseract's rectangles in order to correct the box file.

But either way, there is no good reason to edit a box file by hand.

Comment by la...@lbreyer.com, Nov 13, 2008

If you're on unix/linux, you can also try editing boxfiles with tessboxes. It has some simple logic for cropping characters automatically.

I've uploaded a copy to the files area http://tesseract-ocr.googlegroups.com/web/tessboxes-0.5.tar.gz or you can try to download it from here http://www.lbreyer.com/tessboxes.html

Comment by prakash....@gmail.com, Dec 21, 2008

Can i request to train Tesseract in Hindi or other languages like Marathi, Gujarati that are written in Devnagiri script?

Comment by mour...@gmail.com, Jan 12, 2009

Exist the prediction to develop the "Source training data" for Portuguese?

Comment by ishwor.c...@gmail.com, Jan 24, 2009

How to make freq-dawg, word-dawg, user-words files ? i have already a dictionary file that i saved UTF-8. give same linux command to make it.

Comment by jonathan...@gmail.com, Mar 7, 2009

Under the wordlist2dawg section, please add a link the wordlist2dawg memory fault in the FAQs. Could have saved me some time.

Thanks for this article, anyhow.

Comment by ray.robo...@gmail.com, Mar 24, 2009

First, thank you for your effort. This looks like a promising technology and I can't wait to get it to work.

I can't train properly following the sequence as stated on this page. The command line under "Run Tesseract for Training" (tesseract fontfile.tif junk nobatch box.train) appears to require eng.unicharset and inttemp--which appear to require products of the training session.

I've installed tesseract-2.01.exe.tar.gz and boxtiff-2.01.eng.tar.gz on Windows XP. I left all the box files alone. Configurations appear to be okay and in the right place.

Suggestions would be greatly appreciated.

Comment by ray.robo...@gmail.com, Mar 24, 2009

Apparently there's a circular dependency in creating these files from scratch.

In case anybody else has this problem, the solution is to install the eight core files which are found in tesseract-2.00.eng.tar.gz . Put these in your tessdata folder and then you can begin to train your own batch.

--Ray

Comment by krazi...@gmail.com, May 4, 2009

people, i've did all the steps, made all 8 files, running command tesseract image.tif output -l geo ( geo for Georgian) and tesseract.log file tells me that it was unable to load geo.unicharset file. WHY?.. it is in tessdata folder, with other 7 geo. files ... please, help

Comment by krazi...@gmail.com, May 4, 2009

or , maybe , you could tell me where in code is loading of unicharset file, so i could understand the reason of error

Comment by writesr...@gmail.com, Jun 2, 2009

Hello All,

I am trying out tesseract. I downloaded tesseract2.01 ( and extracted to a folder) and also boxtiff-2.01.eng.tar.gz and extracted to eng folder. I am using MS Windows Vista.

Q1. Do I need anything else to use tessearct for English

Q2. Do I need to train tesseract for English? Since I already have .tif and .box files into eng folder, I guess it is not necessary. Can anyone confirm please. If I need to trian, can someone give some steps please.

Q3. Anyone has answers to an earlier question posted in 2007 -

Comment by slg1234590, Aug 31, 2007
Is anyone using Tesseract to recognize handwritten text (in particular, numbers)?
If so, how to.

Thx

Comment by spilka.o...@gmail.com, Jun 9, 2009
Comment by khemsoch...@gmail.com, Jun 23, 2009

I have the same problem as krazilek. I try to train fonts for Khmer language. Please help.

Comment by kruyva...@gmail.com, Jun 24, 2009
Comment by pablo.c...@gmail.com, Jul 25, 2009

Does anyone knows if I can train tesseract for recognizing full simbols? I mean, reading a box containing more than one chars? Like:

Handwrited "thousand" -> boxfile: 1000 x0 y0 x1 y1

or something like that?

Comment by frenzy...@gmail.com, Jul 31, 2009

Alright, here's my rant:

  • You really should have mentioned the DangAmbigs? file before instructing us to save our changes to the box files because I can't remember what all of these misinterpreted pairs of letters were misinterpreted as.
  • The unicode instructions were a little unclear, too. (I interpreted them as telling me to change the four sets of numbers.)
  • Also, why do you mention the Python Tesseract Box Editor after these instructions? It would be better to mention them at the same time before going into detail, so we can get onto the easy tool immediately.
  • You also could have mentioned that the dawg files would take five hours and that the process would render my computer mostly unusable as it took up most of my memory (or at least with an old computer like mine)

But the tesseract itself has shown improvement (or "learning") thanks for your work!

Comment by microfl...@gmail.com, Aug 6, 2009

Hi, is there a keyword to set size limitation for the box?? The engine tries to pick up very small feature, e.g., a dust, and assign it a character. Since the language I try to train the engine with has clear character size, it will be wonderful to let the engine just ignore too small feature. Thanks!!

Comment by wademcdo...@gmail.com, Aug 6, 2009

G'day all - I am trying to apply a template to the extracted image text. I currently slice an image up into smaller segments, and the perform the OCR on each one. The trouble is that the alignment of the template doesn't always match the image due to manual scanning errors. Is it possible to extract the x/y coordinates of a text conversion from an image? Alternatively, is there a different way to perform this type of registration process other than hunting and slicing throughout a document?

Comment by krazi...@gmail.com, Aug 24, 2009

hi all. people, who are training tesseract new languages under windows, i've written useful program, which is like bbTesseract, but with more functionality and more stable. soon instruction and program will be here

for more information write me krazilek@gmail.com

Comment by jacobusp...@gmail.com, Sep 15, 2009

Anybody training tesseract to read Afrikaans? Contact me at jacobuspdebeer@gmail, to team up with all this coding. Regards, jaco

Comment by bernard....@gmail.com, Oct 17, 2009

My ABBYY fine reader has no problem with Afrikaans. Is is necessary to develop a new program?

Has anyone tried used tesseract to read Saba script - Amharic, Tigrinya, Ge'ez? The characters are all stand alone so it would not be as difficult as Arabic.

Comment by errakhao...@gmail.com, Dec 3, 2009

Hi cheers,

I suceeded to embed tesseract into an embedded linux in an arm plateform.

Now, i'm looking for a library wich can be able to recognize OCR-A and OCR-B fonts.

I try to do it with french and english library but the result it's not very good...

Comment by nishad...@gmail.com, Dec 22, 2009

tesseractTrainer.py for is released as a standalone utility for windows under lime-ocr project. This requires no python installation and can run out of the box. You can download it from http://lime-ocr.googlecode.com/files/Tesseract-Trainer-1.5.0.1.exe

Comment by errakhao...@gmail.com, Dec 28, 2009

Thanks Mr nishad ! I tried to use the linux tools to train tesseract and they working well I obtained very good results^^

Comment by Feyzolla...@gmail.com, Dec 30, 2009

I think similar to Mr Mohamed m.k.in and ashrafirafique creating support for Arabic,Urdu,Persian languages

please help me

contact me on feyzollahi.a@gmail.com

Comment by mishung....@gmail.com, Jan 1, 2010

I‘m Chinese can it auto convert train text to image ?And auto train from exist image and train text?

Comment by durgana...@gmail.com, Jan 16, 2010

Is this Tesseract will support Tamil fonts?

Comment by jeremy.r.brown, Feb 16, 2010

I was having trouble getting a .tiff image with only 8bits or less per pixel. I was getting error messages like:

check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:16 check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:32

In using Gimp 2.6.3, I found the following would work to create a .tiff image with the right bits per pixel.

1. Create your image 2. Choose the menu option Image->Mode->Indexed... 3. In the dialog that appears:

a. choose Generate Optimum palette b. set the Maximum number of colors to 256 c. click Convert
4. Now save your image (choose a .tif/.tiff extension) 5. In the Save as TIFF dialog, choose None for compression.

The resulting .tiff image can be read by Tesseract.

Comment by jeremy.r.brown, Feb 16, 2010

In order to get Tesseract-Trainer (available from http://lime-ocr.googlecode.com/files/Tesseract-Trainer-1.5.0.1.exe ) to be able to open a tiff file, I had to download and install libtiff from http://gnuwin32.sourceforge.net/packages/tiff.htm.

I then had to add the path to libtiff3.dll to my Path environment variable. For me (on Vista 64bit), that path was: C:\Program Files (x86)\GnuWin32?\bin

To set the environment variable on Windows Vista: 1. Right click on My Computer on the desktop and choose the Properties item 2. Click on Advanced System Settings 3. Click the Environment Variables... button 4. In the System variables area, scroll down till you find the "Path" entry and click on it 5. Click Edit... 6. Add ;C:\Program Files (x86)\GnuWin32?\bin (semicolon plus whatever the path is to the directory that holds your libtiff3.dll) 7. Click OK all the way out.

Comment by 194145, Mar 3, 2010

Hi, guys. I wrote a quick and easy box creator for Windows. I hope you like it. http://code.google.com/p/owlboxer/

Comment by sux...@gmail.com, Mar 20, 2010

Wow, http://code.google.com/p/owlboxer/ is the best boxfile editor!!!

Comment by sux...@gmail.com, Mar 20, 2010

Hi peaple, also I have a problem wroted by "jeremy.r.brown": when converting png file to tiff and trying to reed it, tesseract says: >> >>check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:16 >>Segmentation fault

How can I fix this? Please help

Comment by sux...@gmail.com, Mar 20, 2010

Oh, I tryed with convert command this option: >>convert -type Truecolor .. ..

and its works.

thanks anyway to all for this project first of all!!!

Comment by rsallar, Mar 24, 2010

What about Tesseract 3???? plaase, upgrade this tutorial.

Comment by hicksc...@gmail.com, Apr 4, 2010

Hello all,

@rsallar: i just had a look at the .traineddata files, and it looks like they are just self-containers for all previous configuration files to me. i'll try to investigate a bit and find how they are ordered inside, and if there is a descriptor of any kind with offsets.

Pierre.

Comment by hicksc...@gmail.com, Apr 4, 2010

Hello again,

For people interested in the new undocumented training data format, i've just tried to understand how it works. i used the eng.traineddata, and found the following, which is verified on other formats.

Header: Always begins with 0A00 0000 FFFF FFFF FFFF FFFF. Maybe it's a version marker? Then, the header is composed of offsets (i count 9 of them).

Header addr Points to... Remark
@0x000c Unicharset In all training data, was 0x0054 since the Unicharset was always the first element after header.
@0x0014 Dang Ambigs None
@0x001c Int Temp None
@0x0024 PFFM Table None
@0x002c Norm. Proto None
@0x0034 Unknown. See below.
@0x003c Unknown. See below.
@0x0044 Unknown. See below.
@0x004c Unknown. See below.

Note that the last 4 blocks had a lot of similarities. i guess those are the same "kind" of data. i'll try to write a little packer once i'll have figured if each block contains relative or whole offsets. Also i'll have to look at the source code to make sure there is no mistake here.

Hope that helps, Pierre.

Comment by johan.wi...@gmail.com, Apr 6, 2010

No need to write a packer. The training files can be combined with combine_tessdata, which is included with the source files.

Comment by karl.wet...@gmail.com, Apr 22, 2010

In order to run tesseractTrainer.py on OS X you first need to install X11 and xCode tools and MacPorts?. Then use MacPorts? to install python, pygtk and gnome themes. For details on the latter see http://www.php-architect.com/blog/2009/02/25/installing-python-pygtk-on-mac-osx/

Comment by the.ange...@gmail.com, Jul 13, 2010

As best I can tell, the training step above that looks like this:

tesseract fontfile.tif junk nobatch box.train.stderr

Should really look like this:

tesseract fontfile.tif fontfile nobatch box.train.stderr

And this line:

wordlist2dawg words_list word-dawg

Should look like this:

wordlist2dawg words_list word-dawg unicharset
Comment by huangdon...@gmail.com, Jul 19, 2010

tesseractTrainer.py is out-of-date

Comment by dro...@gmail.com, Jul 23, 2010

This documentation is pretty bad. Both unicharset_extractor and the mftraining commands don't work as shown here.

Comment by mwhah...@gmail.com, Sep 2, 2010

Couple of notes, this page has been updated to reflect how to do train w/ the svn version of code not the 2.0 series. In addition it is a little vague around the what to do with the mftraining and cntraining output files.

Additionally the output files of mftraining/cntraining need to be renamed to lang.<filename> before trying to do the combine_tessdata. combine_tessdata expects lang.<filename> files when trying to build the trainingdata file.

After running mftraining and cntraining rename the output files inttemp, normproto and pffmtable to lang.<filename> then run "combine_tessdata lang." You should now have a lang.traineddata that you can use tesseract with. After copying lang.traineddata to your tessdata folder and creating a lang.user-words file (empty or not), run tesseract fontfile.tif output -l lang

Hope this helps someone save a few hours of frustraion.

Comment by philing...@typhoonsoftware.com, Sep 9, 2010

The mftraining and cntraning from 2.04 don't work, I keep getting an error 1000 code.

The tesseract.exe generated from the latest source crashes when I try to generate the box file so I can't seem to get it working with either.

Comment by eibcandl...@gmail.com, Dec 20, 2010

I am new to Tesseract. I would like to train Tesseract to recognize old Latin text. I have looked at this http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but I still not sure how to do it. Are there any other tutorial or docs i can read to get started? Thanks.

Comment by andrewhu...@gmail.com, Jan 18, 2011

Is there a tutorial somewhere which actually shows us how to train tesseract. I have some fairly clean images which I am ocring with tesseract and the results are abysmal so I thought that I better figure out how to train it to the particular font that I am using.

Comment by paliwal....@gmail.com, May 25, 2011

anyone use tesseract for train devnagri fonts??

Comment by buseli_t...@hotmail.com, Jun 28, 2011

I use tessnet2 application on windows .But ocr.Init gives error.and stop application.What can I do ? thanks

Comment by er...@arlanet.com, Jul 2, 2011

I'am trying to train Tesseract 3.0, but the last step (putting it al together) is unclear to me.

If i have 1 font (works) the last step is.

..\tesseract nld.font1.exp0.tif output -l eng

But what is the last step if i have serval tif files? If i repeat the last step with another tif file the traineddata file doesen't seem to be affected.

The training procedure description speeks of an image.tif, how can i feed multiple files?

Comment by reinjavi...@gmail.com, Aug 31, 2011

What is the most probable font style that can be used in Tesseract 3.0?

I have tried to convert font styles Times New Roman, Calibri and Verdana, and it produces wrong output. Like for example, "b" becomes "h".

Thanks!

Comment by kul.bu...@gmail.com, Dec 13, 2011

I am new to Tesseract. I would like to train Tesseract to recognize khmer unicode text. I have looked at this http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but I still not sure how to do it. Are there any other tutorial or docs i can read to get started? Thanks.

Comment by prema...@gmail.com, Dec 17, 2011

hi is there a way to recognize multiple languages in a single tif? And what about Italics and Bold?


Sign in to add a comment
Powered by Google Project Hosting