What's new? | Help | Directory | Sign in
Google
ocropus
The OCRopus(tm) open source document analysis and OCR system
  
  
  
    
Search
for
Updated May 11, 2008 by tmbdev
OcropusWiki  
Quick Index to User Contributed Pages

Please contribute to these pages; in order to edit them, you need to join the OCRopus Group

Compilation

OCRopus is being developed on current versions of Ubuntu Linux. It can also be used on other platforms. Here is some information on how to use it on other platforms:

Output Formats

By default, OCRopus uses the hOCR output format. hOCR can be converted into other formats, or you can write your own output format generator in Lua or C++. Here is a list of commonly used output formats:

Alphabetic Languages

These languages all share similar recognition and language modeling strategies and should be fairly easy to handle by both the Tesseract recognizer and the default OCRopus recognizers. Common phenomena include diacritics, descenders/ascenders, word spacing, variable spacing, variable width characters, hyphenation, and upper/lower case.

Roman characters with diacritics:

Similar character sets:

Small alphabet, diacritics, right-to-left:

Roman alphabet, multiple diacritics:

Large alphabet, 2D arrangement:

Arabic Scripts

Touching characters, diacritics:

Many ligatures, no baseline:

CJK

CJK scripts have large character sets, multiple writing directions (even within the same document), and frequently mix with Roman characters.

Indic Scripts

Indic scripts may be visually quite distinct, but they share a number of common characteristics, such as large character sets, diacritics that can occur anywhere around a character (including before and after), and large numbers of ligatures.

Other Pages

A complete list of pages can be found here: OCRopus Group Wiki.


Sign in to add a comment