Since Sep 8, 2008 / Last update: Oct. 16, 2009
Introduction
NHocr is a command line OCR (Optical Character Recognition) program for Japanese language. It has been designed to recognize machine-printed Japanese characters and some ASCII characters/symbols in an image. NHocr is probably the first Open Source Japanese OCR software (offline, machine-printed), except some experimental, partial codes open to academic communities.
You can also use NHocr through WeOCR service at:
The program is highly experimental, and the character recognition performance is limited. (You would become happier with a commercial product if you want a high performance OCR.)
The character feature used in NHocr is based on Peripheral Local Moment (P-LM) proposed by Hori et al. in late 90's.
NHocr is originally a product of the author's weekend programming. The development work may be rather slow.
Limitations of the current version
- The current NHocr can handle text block image only, since it has not been equipped with a page layout analysis engine.
- The recognition accuracy may deteriorate when wide and narrow characters are mixed or when proportional fonts are used.
- The character segmentation performance is limited, since a very simple segmentation algorithm is used in the current version.
- The recognition accuracy with ASCII characters may not be so good. Using another OCR, such as tesseract, is recommended for European languages.
- No language processing (post-processing) is yet included.
Supported platforms and requirements
Solaris SPARC/x86 and Linux are officially supported. NHocr would work on other UNIX(-like) platforms and MS-Windows.
NHocr depends on O2-tools package available at:
NHocr uses FreeType 2 available at:
Supported languages
The current version of NHocr supports Japanese only.
The author is interested in supporting other oriental languages such as Chinese. Character code table cctable-xxx is required. Contributions are welcome.
Code availability
The source code distribution was scheduled for 2Q in 2009.
At last, the first source code package has been available as version 0.16 since May 2009.
License
Apache License 2.0 applies to newer versions.
A derivative of MIT-X applies to version 1.5e-32 and older.