My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members
Links

About

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

There is a Public Specification for the hOCR Format.

Available Programs

Included command line programs:

  • hocr-check -- check the hOCR file for errors
  • hocr-combine -- combine pages in multiple hOCR files into a single document
  • hocr-eval -- compute number of segmentation and OCR errors
  • hocr-eval-geom -- compute over, under, and mis-segmentations
  • hocr-eval-lines -- compute OCR errors of hOCR output relative to text ground truth
  • hocr-split -- split an hOCR file into individual pages
  • hocr-merge-dc -- merge Dublin Core meta data into the hOCR HTML header

See the CommandLine Wiki page for more information.

Planned Programs

  • hocr-merge-text -- merges groundtruth text into hOCR output (linewise or unaligned)

Possible Programs

  • hocr-eval-pagenos -- determine accuracy of logical page number labels
  • hocr-generate-cuts -- given an hocr file and a binary image, generate reasonable cuts
  • hocr-generate-xboxes -- given an hocr file containing cuts and a binary image, generate bounding boxes
  • hocr-as-no-html -- remove all HTML markup from an hOCR file
  • hocr-as-absolute-html -- generate HTML with absolute positioning from hOCR output
  • hocr-as-xytable -- generate an XY-table layout from hOCR output
  • hocr-as-simple-html -- generate simple, logically marked up from hOCR output

Planned Converters

Please let us know if you want to help with these.

  • unz2hocr -- UNLV zone file to hOCR converter
  • icdar2hocr -- convert ICDAR competition format to hOCR
  • xdoc2hocr -- convert XDOC format to hOCR
  • dafs2hocr -- convert DAFS to hOCR
  • djvu2hocr -- convert DjVu XML to hOCR
  • lura2hocr -- convert Luratech Abbyy XML to hOCR
Powered by Google Project Hosting