Project Proposal
Project Description
The aim of this project is to add Indic script support to the Tesseract OCR engine, which currently does not support connected script such as devnagri. This includes adding some routines to the existing code base, training the engine with sample images and then testing for accuracy for subsequent debugging and refinement in the algorithms.
Project Features
Training will be done for 2 Bengali fonts De-skewing routines will be implemented to straighten a tilted image De-italicising routines will be implemented to deal with italicised text Will be the first freely available OCR engine to support Indic script Will use existing character segmentation, feature extraction and word level recognition routines of the Tesseract engine
Tools and used software
Tesseract OCR engine 2.03 http://code.google.com/p/tesseract-ocr/
Gimp 2.2.17 http://www.gimp.org/
bbtesseract (GUI for editing training data, such as box files) 0.5.34 http://code.google.com/p/bbtesseract/
Project Plan Take the input image and then manipulate it in a manner so that it then fit to be processed by the Tesseract OCR engine. For devnagri scripts, it translates to clipping the maatra(shironaam) between successive characters.
Online Documentation http://code.google.com/p/tesseract-ocr/wiki/TesseractProjects, http://tesseract-ocr.repairfaq.org/, http://debayanin.googlepages.com/hackingtesseract