|
|
Please contribute to these pages; in order to edit them, you need to join the OCRopus Group
Compilation
OCRopus is being developed on current versions of Ubuntu Linux. It can also be used on other platforms. Here is some information on how to use it on other platforms:
- Compiling Ocropus On Mac Os X
- Compiling Ocropus On Other Platforms Than Linux
- Compiling Ocropus On Solaris
- Compiling Ocropus On Windows
- Installing Ocropus With Gnu Autotools
Output Formats
By default, OCRopus uses the hOCR output format. hOCR can be converted into other formats, or you can write your own output format generator in Lua or C++. Here is a list of commonly used output formats:
Alphabetic Languages
These languages all share similar recognition and language modeling strategies and should be fairly easy to handle by both the Tesseract recognizer and the default OCRopus recognizers. Common phenomena include diacritics, descenders/ascenders, word spacing, variable spacing, variable width characters, hyphenation, and upper/lower case.
Roman characters with diacritics:
Similar character sets:
- Fraktur
- OCRopus for Russian
- Greek
Small alphabet, diacritics, right-to-left:
- Hebrew
Roman alphabet, multiple diacritics:
- Vietnamese
Large alphabet, 2D arrangement:
Arabic Scripts
Touching characters, diacritics:
Many ligatures, no baseline:
- OCRopus for Urdu
- Persian
CJK
CJK scripts have large character sets, multiple writing directions (even within the same document), and frequently mix with Roman characters.
- OCRopus for Japanese
- OCRopus for Simplified Chinese
- Traditional Chinese
- Korean
Indic Scripts
Indic scripts may be visually quite distinct, but they share a number of common characteristics, such as large character sets, diacritics that can occur anywhere around a character (including before and after), and large numbers of ligatures.
Other Pages
A complete list of pages can be found here: OCRopus Group Wiki.
Sign in to add a comment
