My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
  Advanced search   Search tips   Subscriptions
Issue 263: patch to enable hOCR output
6 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  ----
Closed:  May 2010

Sign in to add a comment
Reported by, Nov 22, 2009

I propose a patch which implements support for hOCR output format (with
page, line and word bounding boxes).

Additionally this patch causes tesseract to recognize the 'tiff' extension,
even if compiled without leptonica.

Nov 28, 2009
Thanks for this. HOCR seems to be a good "standard" format which Cuneiform and some
commercial packages support. So I'd rather write code to parse and work with it than
any one-off custom output formats...

But... I can't get it to compile after applying your patch.

I grabbed a 3.00 SVN copy of the tesseract code and got it to build earlier. Then I
downloaded your patch and applied it followed by a "make clean" and another "make"...
Which this time does not complete cleanly.

Thoughts? Fedora12 X86_64


make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
Making all in testing
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/testing'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/testing'
Making all in java
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/java'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/java'
Making all in api
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/api'
make[3]: Entering directory `/tmp/tesseract-ocr-read-only/api'
g++ -DHAVE_CONFIG_H -I. -I..  -I../ccutil -I../ccstruct -I../image -I../viewer
-I../ccops -I../dict -I../classify -I../ccmain -I../wordrec -I../cutil -I../textord
-I/usr/local/include/liblept  -g -O2 -MT baseapi.o -MD -MP -MF .deps/baseapi.Tpo -c
-o baseapi.o baseapi.cpp
baseapi.cpp: In function ‘int tesseract::IsParagraphBreak(TBOX, TBOX, int, int)’:
baseapi.cpp:712: error: expected ‘;’ before ‘)’ token
make[3]: *** [baseapi.o] Error 1
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/tesseract-ocr-read-only'
make: *** [all] Error 2

Nov 29, 2009
Oops, you are right. The line 712 in baseapi.cpp was completely irrelevant and I
wonder why it was there. Anyway, here's the corrected version of the patch.
Nov 29, 2009
Thanks for the fast reply... Now though... Hmm... How does one activate this feature?
Following the example from the FAQ of setting a variable I did this:

I created /usr/share/tesseract/tessdata/configs/hocr
with contents:
tessedit_create_hocr T

and called it like this:
tesseract image.tif outputbase nobatch hocr

to no avail though... 
read_variables_file: Can't open hocr

So... Any pointers?

Nov 29, 2009
I think this should work (and actually does work for me). However, since tesseract
can't find the file I assume you should have placed it at a wrong location. Are you
sure your tessdata directory is /usr/share/tesseract/tessdata/ (and not just
/usr/share/tessdata or /usr/local/share/tessdata/)?
Feb 14, 2010
In my test with hocr2pdf I wound up with decent horizontal placement, but inverted
vertical placement. Output from Cuneiform produced a correct looking pdf with
hocr2pdf, which makes me believe that this is a bug in this patch. Is there a program
that this output is known to work well with?
Feb 15, 2010
Ah, you are right. The problem is that in hOCR we should count coordinates from the
top right corner, while tesseract puts the coordinate origin at the bottom of the
page. So please test this version of the patch.
10.1 KB   View   Download
Feb 15, 2010
Results are good! here's my test pdf file. it was created with the svn version of
tesseract patched with your bbox patch and hocr2pdf from a page scanned at 300dpi. 
227 KB   Download
May 19, 2010
Project Member #8
Applied. Had to remove STL, as it is incompatible with Android.
Status: Fixed
Nov 26, 2010
I am using tesseract latest version on ubuntu and running it like this:

tesseract image.tif outputbase nobatch hocr

but get:

cordoval@cordoval-laptop:~/Downloads$ tesseract luis1.jpg luis.txt hocr
read_variables_file: Can't open hocr
Tesseract Open Source OCR Engine with Leptonica
cordoval@cordoval-laptop:~/Downloads$ less luis.txt.txt 

Nov 27, 2010
Project Member #10
read_variables_file: Can't open hocr -> you do not have hocr config file.
Feb 15, 2011
how do I apply a patch? I only downloaded the file: tesseract-hocr-fixed-bbox.patch and I don't know what to do with it... could you help me please? regards
Feb 15, 2011
Project Member #12
with program/utility 'patch'. Try to use google.
Mar 1, 2013
I have installed Bookscanning-Software "Homer" on Windows and had "read_variables_file: Can't open hocr" Message in Tesseract-Logfile. Solution: Check path-variables in system settings for duplicate tesseract-installations.
Feb 7, 2014

I need to add an arabic sakkalmajalla font to tessdata 
how can I do that , can anyone help mw please
Sign in to add a comment

Powered by Google Project Hosting