|
TrainingTesseract
How to use the tools provided to train Tesseract for a new language.
IntroductionTesseract 2.0 is fully trainable. This page describes the training process, provides some guidelines on applicability to various languages, and what to expect from the results. Background and LimitationsTesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters. Tesseract 2.0 can handle any Unicode characters (coded with UTF-8), but there are limits as to the range of languages that it will be successful with, so please take this section into account before building up your hopes that it will work well on your particular language! Tesseract can only handle left-to-right languages. While you can get something out with a right-to-left language, the output file will be ordered as if the text were left-to-right. Top-to-bottom languages will currently be hopeless. Tesseract is unlikely to be able to handle connected scripts like Arabic. It will take some specialized algorithms to handle this case, and right now it doesn't have them. Tesseract is likely to be so slow with large character set languages (like Chinese) that it is probably not going to be useful. There also still need to be some code changes to accommodate languages with more than 256 characters. Any language that has different punctuation and numbers is going to be disadvantaged by some of the hard-coded algorithms that assume ASCII punctuation and digits. Data files requiredTo train for another language, you have to create 8 data files in the tessdata subdirectory. The naming convention is languagecode.file_name Language codes follow the ISO 639-3 standard. The 8 files used for English are:
How little can you get away with? You must create inttemp, normproto, pfftable, freq-dawg, word-dawg and unicharset using the procedure described below. If you are only trying to recognize a limited range of fonts (like a single font for instance), then a single training page might be enough. DangAmbigs and user-words may be empty files. The dictionary files freq-dawg and word-dawg don't have to be given many words if you don't have a wordlist to hand, but accuracy will be lower than if you have a decent sized (10s of thousands for English say) dictionary, but for 2.04 and below at least, empty dawg files and dawgs with no words are NOT allowed. Training ProcedureSome of the procedure is inevitably manual. As much automated help as possible is provided. More automated tools may appear in the future. The tools referenced below are all built in the training subdirectory. Generate Training ImagesThe first step is to determine the full character set to be used, and prepare a text or word processor file containing a set of examples. The most important points to bear in mind when creating a training file are:
Next print and scan (or use some electronic rendering method) to create an image of your training page. Upto 32 training images can be used. It is best to create pages in a mix of fonts and styles, including italic and bold. NOTE: training from real images is actually quite hard, due to the spacing-out requirements. This will be improved in a future release. For now it is much easier if you can print/scan your own training text. You will also need to save your training image as a UTF-8 text file for use in the next step where you have to insert the codes into another file. Clarification for large amounts of training data The 32 images limit is for the number of FONTS. Each font may be put in a single multi-page tiff (only if you are using libtiff!) and the box file can be modified to specify the page number for each character after the coordinates. Thus an arbitrarily large amount of training data may be created for any given font, allowing training for large character-set languages. An alternative to multi-page tiffs is to create many single-page tiffs for a single font, and then you must cat together the tr files for each font into several single-font tr files. In any case, the input tr files to mftraining must each contain a single font, and the order files are given to mftraining must match the order they are given to unicharset_extractor. Make Box FilesFor the next step below, Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. Tesseract 2.0 has a mode in which it will output a text file of the required format, but if the character set is different to its current training, it will naturally have the text incorrect. So the key process here is to manually edit the file to put the correct characters in it. Run Tesseract on each of your training images using this command line: tesseract fontfile.tif fontfile batch.nochop makebox You then have to rename fontfile.txt to fontfile.box. Now the hard part. You have to edit the file fontfile.box and put the UTF-8 codes for each character in the file at the start of each line, in place of the incorrect character put there by Tesseract. Example: The distribution includes an image eurotext.tif. Running the above command produces a text file that includes the following lines (lines 142-155): s 734 491 751 516 p 753 483 776 515 r 779 492 796 516 i 799 492 810 525 n 814 492 837 516 g 839 483 862 516 t 865 491 878 520 u 101 452 122 483 b 126 453 146 486 e 149 452 168 477 r 172 453 187 476 d 211 450 232 483 e 236 450 255 474 n 259 451 281 474 Since Tesseract was run in English mode, it does not correctly recognize the umlaut. This character needs to be corrected using a suitable editor. An editor that understands UTF-8 should be used for this purpose. HTML editors are usually a good choice. (Mozilla on linux allows you to edit utf8 text files directly from the browser. Firefox and IE do not let you do this. MS Word is very good at handling different text encodings, and Notepad++ is another editor that understands UTF-8.) Linux and Windows both have a character map that can be used for copying characters that cannot be typed. In this case the u needs to be changed to ü. In theory, each line in the box file should represent one of the characters from your training file, but if you have a horizontally broken character, such as the lower double quote „ it will probably have 2 boxes that need to be merged! Example: lines 117-130: D 101 503 131 534 e 135 501 154 527 r 158 503 173 526 , 197 496 205 507 , 206 496 214 508 s 220 499 236 524 c 239 499 258 523 h 262 500 284 532 n 288 500 310 524 e 313 499 332 523 l 336 500 347 533 l 352 500 363 532 e 367 499 386 524 " 389 520 407 532 As you can see, the low double quote character has been expressed as two single commas. The bounding boxes must be merged as follows:
This gives: D 101 503 131 534 e 135 501 154 527 r 158 503 173 526 „ 197 496 214 508 s 220 499 236 524 c 239 499 258 523 h 262 500 284 532 n 288 500 310 524 e 313 499 332 523 l 336 500 347 533 l 352 500 363 532 e 367 499 386 524 " 389 520 407 532 If you didn't sucessfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. (As of 2.04, there is a limit of 24 bytes for the description of a "character". This will allow you between 6 and 24 unicodes to describe the character, depending on where your codes sit in the unicode set. If anyone hits this limit, please file an issue describing your situation.) Note that the coordinate system used in the box file has (0,0) at the bottom-left. If you have an editor that understands UTF-8, this process will be a lot easier than if it doesn't, as each UTF-8 character has upto 4 bytes to code it, and dumb editors will show you all the bytes separately. There is a visual basic tool that you can use (windows only) to make box file creation much easier. See http://groups.google.com/group/tesseract-ocr/files and look for bbtesseract. You can also check out this thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2321deb561450e76/554c7a8cec11c073#554c7a8cec11c073 in the forum for more information. Thanks to unkowner for contributing this. There is also a .net boxer here: http://www.ospilka.com/dl/tessboxer.zip If you have php, there is a php box viewer and editor on the downloads page: http://tesseract-ocr.googlecode.com/files/boxfilereader.php For Linux and other systems featuring Python and needed modules, one can use the excellent GUI tool tesseractTrainer.py contributed in November 2007 by Catalin Francu, available in the files section of the tesseract-ocr Google group: http://groups.google.com/group/tesseract-ocr/files and also on the downloads page: http://code.google.com/p/tesseract-ocr/downloads/list There's even a screenshot available:
Bootstrapping a new character setIf you are trying to train a new character set, it is a good idea to put in the effort on a single font to get one good box file, run the rest of the training process, and then use Tesseract in your new language to make the rest of the box files as follows: tesseract fontfile.tif fontfile -l yournewlanguage batch.nochop makebox This should make the 2nd box file easier to make, as there is a good chance that Tesseract will recognize most of the text correctly. You can always iterate this sequence adding more fonts to he training set (i.e. to the command line of mfTraining and cnTraining below) as you make them, but note that there is no incremental training mode that allows you to add new training data to existing sets. This means that each time you run mfTraining and cnTraining you are making new data files from scratch from the tr files you give on the command line, and these programs cannot take an existing intproto/pffmtable/normproto and add to them directly. New! Tif/Box pairs provided!The Tif/Box file pairs are on the downloads page. (Note the tiff files are G4 compressed to save space, so you will have to have libtiff or uncompress them first). You could follow the following process to make better training data for your own language or subset of an existing language:
Run Tesseract for TrainingFor each of your training image, boxfile pairs, run Tesseract in training mode: tesseract fontfile.tif junk nobatch box.train OR tesseract fontfile.tif junk nobatch box.train.stderr The first form sends all the errors to tesseract.log (on all platforms) like it did on windows versions 2.03 and below. With box.train.stderr, all errors are sent to stderr, on all platforms, just like it did on non-windows platforms for versions 2.03 and below. Note that the box filename must match the tif filename, including the path, or Tesseract won't find it. The output of this step is fontfile.tr which contains the features of each character of the training page. Note also that the output name is derived from the input image name, not the normal output name, shown here as junk. junk.txt will also be written with a single newline and no text. Important Check for errors in the output from apply_box. If there are FATALITIES reported, then there is no point continuing with the training process until you fix the box file. The new box.train.stderr config file makes is easier to choose the location of the output. A FATALITY usually indicates that this step failed to find any training samples of one of the characters listed in your box file. Either the coordinates are wrong, or there is something wrong with the image of the character concerned. If there is no workable sample of a character, it can't be recognized, and the generated inttemp file won't match the unicharset file later and Tesseract will abort. Another error that can occur that is also fatal and needs attention is an error about "Box file format error on line n". If preceded by "Bad utf-8 char..." then the utf-8 codes are incorrect and need to be fixed. The error "utf-8 string too long..." indicates that you have exceeded the 8 (v2.01) byte limit on a character description. If you need a description longer than 8 bytes, please file an issue. Box file format errors without either of the above errors indicate either something wrong with the bounding box integers, or possibly a blank line in the box file. Blank lines are actually harmless, and the error can be ignored in this case. They could be ignored by the code, but it doesn't ignore them in case there is something unintentional wrong with the box file. There is no need to edit the content of the fontfile.tr file. The font name inside it need not be set. For the curious, here is some information on the format: Every character in the box file has a corresponding set of entries in the .tr file (in order) like this UnknownFont <utf8 code(s)> 2 mf <number of features> x y length dir 0 0 ... (there are a set of these determined by <number of features> above) cn 1 ypos length x2ndmoment y2ndmoment The mf features are polygon segments of the outline normalized to the 1st and 2nd moments. x= x position [-0.5.0.5] y = y position [-0.25, 0.75] length is the length of the polygon segment [0,1.0] dir is the direction of the segment [0,1.0] The cn feature is to correct for the moment normalization to distinguish position and size (eg c vs C and , vs ') ClusteringWhen the character features of all the training pages have been extracted, we need to cluster them to create the prototypes. The character shape features can be clustered using the mftraining and cntraining programs: mftraining fontfile_1.tr fontfile_2.tr ... This will output two data files: inttemp (the shape prototypes) and pffmtable (the number of expected features for each character). (A third file called Microfeat is also written by this program, but it is not used.) cntraining fontfile_1.tr fontfile_2.tr ... This will output the normproto data file (the character normalization sensitivity prototypes). Compute the Character SetTesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering: unicharset_extractor fontfile_1.box fontfile_2.box ... Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower. This data must be encoded in the unicharset data file. Each line of this file corresponds to one character. The character in UTF-8 is followed by a hexadecimal number representing a binary mask that encodes the properties. Each bit corresponds to a property. If the bit is set to 1, it means that the property is true. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit. Example:
; 0 b 3 W 5 7 8 If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes. NOTE The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated (i.e. they must all be recreated when the box file is changed) as they have to be in sync. The lines in unicharset must be in the correct order, as inttemp stores an index into unicharset and the actual characters returned by the classifier come from unicharset at the given index. Dictionary DataTesseract uses 3 dictionary files for each language. Two of the files are coded as a Directed Acyclic Word Graph (DAWG), and the other is a plain UTF-8 text file. To make the DAWG dictionary files, you first need a wordlist for your language. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into two sets: the frequent words, and the rest of the words, and then use wordlist2dawg to make the DAWG files: wordlist2dawg frequent_words_list freq-dawg wordlist2dawg words_list word-dawg NOTE: wordlists must contain at least one word! Empty files and dictionaries with no words are not currently supported! (Surely you know at least one word to be recognized.) If words always have some punctuation in them, like google.com then it is a good idea to include them in the dictionary. The third dictionary file is called user-words and is usually empty. The last fileThe final data file that Tesseract uses is called DangAmbigs. It represents the intrinsic ambiguity between characters or sets of characters, and is currently entirely manually generated. To understand the file format, look at the following example: 1 m 2 r n 3 i i i 1 m The first field is the number of characters in the second field. The 3rd field is the number of characters in the 4th field. As with the other files, this is a UTF-8 format file, and therefore each character may be represented by multiple bytes. The first line shows that the pair 'rn' may sometimes be recognized incorrectly as 'm'. The second line shows that the character 'm' may sometimes be recognized incorrectly as the sequence 'iii' Note that the characters on both sides should occur in unicharset. This file cannot be used to translate characters from one set to another. The DangAmbigs file may also be empty. Putting it all togetherThat is all there is to it! All you need to do now is collect together all 8 files and rename them with a lang. prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and put them in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following: tesseract image.tif output -l lang (Actually, you can use any string you like for the language code, but if you want anybody else to be able to use it easily, ISO 639 is the way to go.) |
Sign in to add a comment
I have discovered empirically that all of the characters used in the second field of DangAmbigs? need to be listed in unicharset.
What is the "best" font for text recongition by tesseract? I can choose how I print these out, but I'm finding that some fonts work better than others.
The best font is the one closest to the text you want to recognize.
"The training data currently needs to fit on a single page." Is there a limit on page size, which?
Anybody training Tesseract to recognize Russian?
Is anyone using Tesseract to recognize handwritten text (in particular, numbers)?
Hello
Will be any work to support Arabic (along with Hebrew, Persian, Urdu & the other ones)?
What about these specialized algorithms, any plans to implement them? I think it is a problem similar to cursive English, so may be if it can handle cursive English, It could be modified to handle Arabic?
I'm thinking of trying to train it to use Arabic anyway.
Thanks
How much the training process depend on "realism" of input? I.e. should it necessary be scanned image?
Is it possible to train tesseract just on the computer rendered image of some page with known position of symbols and their bounding boxes (one can produce something like this using e.g. LaTeX + some post-processing of dvi file)? If this is possible it would save lots of manual work.
is it possible to train tesseract to recognise Chinese? if I only train it to the most frequent ~3000 characters in Chinese, how slow will it be?
Is there a way to just limit the characters without completely retraining. i.e. I have an application where I need to only scan numbers so I only need 0-9 and a decimal point.
Thanks.
Does any body have the following files
There is something seriously wrong here...I'm trying to use Tesseract 2.01 on WinXP (first time). When I try to follow the 'Tesseract for Training' procedures, I execute this command: tesseract eng.arial.tif junk nobatch box.train
The log file tells me it's unable to open tessdata/eng.inttemp. The documentation tells me this file is created by running mftraining on the .tr files, but it's the above command that creates the .tr files! I get the same problem when trying to use the provided tif/box files. Help! How do I start??
Also, the Windows .exe package was missing the batch.nochop, makebox, nobatch, box.train files - I had to pull these out of the source files instead.
I completed the training on my data set, generated all eight files, and transferred them to the tessdata directory. When I try to run it, I get the following error:
$ tesseract textImg.tif textImgOut.txt -l myLang
Error: Illegal malloc request size!
Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3
Note that if I run without specifying a language, thus using the default settings, tesseract works fine. What am I doing wrong?
Thanks.
I got same problem with jj-j...@hotmail.com
any helps, pls?
Regards,
Duy.
can anybody tell me how to do the training for English from scratch, step by step?
Regards,
Duy.
kbwiley you have the stock eng.freq-dawg empty, you must replace it with something else but right now I don't known how to make it
look at the download section, names are confusing so the program is tesseract-2.01.tar.gz the data files for english are tesseract-2.00.eng.tar.gz for italian tesseract-2.00.ita.tar.gz and so unpack in the data dir, it worked for me
are the source word lists for the eng.dawg files available anyplace? Thanks!
Try this script. It can generate good picture from box file and training page image. example:
$ ./boxes.sh fontfile.box trainingpage.tif result.bmp
http://pastebin.ca/891649
It need bash, grep, imagemagick Good for searching a mistakes in fontfile.box and splitting merged letters and so
anybody training Tesseract to recognize Portuguese?
has any body ever used tesseract to recognize text containing subscripts and superscripts? text like this:
thx
Is there a straightforward way to tell tesseract that all characters it will encounter are numbers? Is there a command line switch or must I train it on a numbers-only training file?
is "tesseract-21?.00.eng.tar.gz"trained with "boxtiff-2.01.eng.tar.gz"?
im trying to train tesseract for a new language, but it dont work :\ i have windows XP and it works until the step "Run Tesseract for Training". i have the files tdata.tif and tdata.box (over-worked). if is start the program now "tesseract tdata.tif junk nobatch box.train" my cpu usage rise to 100% and it never stops... the tesseract.log is empty. What can i do know? need help
I have the same problem with jj-j...@hotmail.com, now, I'm compile the tesseract-2.03. Maybe it can work.
You have to use Linux to get this program is stable to use.
All commands worked allowing me to generate the training files for my new "language." When I finished, tried running tesseract with -l MyLanguage? and received:
Error: Illegal malloc request size!
Fatal error: No error trap defined! Signal_termination_handler called with signal 2001 Signal_exit 30 SIGNAL ABORT. LocCode?: 3 SignalCode?: 3
This was on Mac OS X 10.5.3
please help me with the error T_T
"Error: 48 classes in inttemp while unicharset contains 49 unichars."
If anyone understands about the problem, please email me (bestwish2u1025@yahoo.com)
many thanks.
Best wish to all _
the portuguese tesseract package i downloaded from the Ubuntu repository comes with just one word - how can we contribute officially with the tesseract dictionaries?
Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%
Can tesseract make it? thanks
Is there anybody only trainning digital figures? I am doing such things but the accury is needed 100%
Can tesseract make it? thanks
I think similar to Mr Mohamed m.k.in creating support for Arabic,Urdu,Persian languages
need help of someone
Please Mr Mohamed if u can contact me on ashrafirafique@gmail.com
Hi, I am created russian (rsl.) tessdata files and it works!!! How can I check in my files? br rumen
When I tried to train it for a single letter to test the system, the box file was empty. Other times when I tried to train it for a full charset of 125 symbols (ascii codes + a few diacritical characters) it yielded only 50 lines in the box file (50 from 125). What is happening?
More to the point, can you provide help in training from UTF characters? Like, the one you used in your example, the "ü". Or even better, if I could list the character set and it would immediately optimize for those characters. The whole TIFF - BOX process is cumbersome for 99% of the applications.
Usually, how many words do you put in frequent_words_list and words_list?
Please Mr Mohamed if u can contact me on nouri.mohammadreza@gmail.com
Anyone looking for a list of words so they can generate their word-dawg files should be able to find some comprehensive word lists here http://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html
I need OCR to get a machine readable version of a translation of an unknown language. I am going to try to build the lexicon, so at present, I have no dictionary. Is it possible to bootstrap a dictionary during the training process, or, alternatively, is there a way to turn off the top down processing, so that only individual segments are analyzed? Any help would be very welcome. Thank you.
I have seen the code for a month, but still don't understand the principle of the recognition. It seems that it hasn't extracted the character of the words. Since that , what does it depend on to recognize the character. Thank you very much
You really want to use tesseractTrainer.py for editing box files.
You really don't want to touch box files with a barge-pole. If you are editing box files you are not taking it seriously.
Generate the TIFF and the Box file together from the given font and text.
I don't know what libraries you are using but on .Net it would be something like:
This will allow automated training of the engine for any given font and language. Even the silly curly fonts.
I say it will be "something like" this primarily because the rectangles generated by MeasureTextRanges? and GetBounds? may not correspond exactly to the rectangles produced by Tesseract, so you may need a two-pass solution where Tesseract first does its best then you match your rectangles to Tesseract's rectangles in order to correct the box file.
But either way, there is no good reason to edit a box file by hand.
If you're on unix/linux, you can also try editing boxfiles with tessboxes. It has some simple logic for cropping characters automatically.
I've uploaded a copy to the files area http://tesseract-ocr.googlegroups.com/web/tessboxes-0.5.tar.gz or you can try to download it from here http://www.lbreyer.com/tessboxes.html
Can i request to train Tesseract in Hindi or other languages like Marathi, Gujarati that are written in Devnagiri script?
Exist the prediction to develop the "Source training data" for Portuguese?
How to make freq-dawg, word-dawg, user-words files ? i have already a dictionary file that i saved UTF-8. give same linux command to make it.
Under the wordlist2dawg section, please add a link the wordlist2dawg memory fault in the FAQs. Could have saved me some time.
Thanks for this article, anyhow.
First, thank you for your effort. This looks like a promising technology and I can't wait to get it to work.
I can't train properly following the sequence as stated on this page. The command line under "Run Tesseract for Training" (tesseract fontfile.tif junk nobatch box.train) appears to require eng.unicharset and inttemp--which appear to require products of the training session.
I've installed tesseract-2.01.exe.tar.gz and boxtiff-2.01.eng.tar.gz on Windows XP. I left all the box files alone. Configurations appear to be okay and in the right place.
Suggestions would be greatly appreciated.
Apparently there's a circular dependency in creating these files from scratch.
In case anybody else has this problem, the solution is to install the eight core files which are found in tesseract-2.00.eng.tar.gz . Put these in your tessdata folder and then you can begin to train your own batch.
--Ray
people, i've did all the steps, made all 8 files, running command tesseract image.tif output -l geo ( geo for Georgian) and tesseract.log file tells me that it was unable to load geo.unicharset file. WHY?.. it is in tessdata folder, with other 7 geo. files ... please, help
or , maybe , you could tell me where in code is loading of unicharset file, so i could understand the reason of error
Hello All,
I am trying out tesseract. I downloaded tesseract2.01 ( and extracted to a folder) and also boxtiff-2.01.eng.tar.gz and extracted to eng folder. I am using MS Windows Vista.
Q1. Do I need anything else to use tessearct for English
Q2. Do I need to train tesseract for English? Since I already have .tif and .box files into eng folder, I guess it is not necessary. Can anyone confirm please. If I need to trian, can someone give some steps please.
Q3. Anyone has answers to an earlier question posted in 2007 -
Thx
Please notify I moved tessboxer from http://www.ospilka.com/dl/tessboxer.zip to http://sites.google.com/site/spilkaondrej/
I have the same problem as krazilek. I try to train fonts for Khmer language. Please help.
Please look at this tutorial: http://vannait.blogspot.com/2009/06/how-to-train-tesseract-ocr.html