|Issue 263:||patch to enable hOCR output|
|6 people starred this issue and may be notified of changes.||Back to list|
Sign in to add a comment
Hi, I propose a patch which implements support for hOCR output format (with page, line and word bounding boxes). Additionally this patch causes tesseract to recognize the 'tiff' extension, even if compiled without leptonica.
Nov 28, 2009
Thanks for this. HOCR seems to be a good "standard" format which Cuneiform and some commercial packages support. So I'd rather write code to parse and work with it than any one-off custom output formats... But... I can't get it to compile after applying your patch. I grabbed a 3.00 SVN copy of the tesseract code and got it to build earlier. Then I downloaded your patch and applied it followed by a "make clean" and another "make"... Which this time does not complete cleanly. Thoughts? Fedora12 X86_64 Thanks... ..snip.. make: Nothing to be done for `all-am'. make: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata' make: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata' Making all in testing make: Entering directory `/tmp/tesseract-ocr-read-only/testing' make: Nothing to be done for `all'. make: Leaving directory `/tmp/tesseract-ocr-read-only/testing' Making all in java make: Entering directory `/tmp/tesseract-ocr-read-only/java' make: Nothing to be done for `all'. make: Leaving directory `/tmp/tesseract-ocr-read-only/java' Making all in api make: Entering directory `/tmp/tesseract-ocr-read-only/api' make: Entering directory `/tmp/tesseract-ocr-read-only/api' g++ -DHAVE_CONFIG_H -I. -I.. -I../ccutil -I../ccstruct -I../image -I../viewer -I../ccops -I../dict -I../classify -I../ccmain -I../wordrec -I../cutil -I../textord -I/usr/local/include/liblept -g -O2 -MT baseapi.o -MD -MP -MF .deps/baseapi.Tpo -c -o baseapi.o baseapi.cpp baseapi.cpp: In function ‘int tesseract::IsParagraphBreak(TBOX, TBOX, int, int)’: baseapi.cpp:712: error: expected ‘;’ before ‘)’ token make: *** [baseapi.o] Error 1 make: Leaving directory `/tmp/tesseract-ocr-read-only/api' make: *** [all-recursive] Error 1 make: Leaving directory `/tmp/tesseract-ocr-read-only/api' make: *** [all-recursive] Error 1 make: Leaving directory `/tmp/tesseract-ocr-read-only' make: *** [all] Error 2
Nov 29, 2009
Oops, you are right. The line 712 in baseapi.cpp was completely irrelevant and I wonder why it was there. Anyway, here's the corrected version of the patch.
Nov 29, 2009
Thanks for the fast reply... Now though... Hmm... How does one activate this feature? Following the example from the FAQ of setting a variable I did this: I created /usr/share/tesseract/tessdata/configs/hocr with contents: tessedit_create_hocr T and called it like this: tesseract image.tif outputbase nobatch hocr to no avail though... read_variables_file: Can't open hocr So... Any pointers? Thanks...
Nov 29, 2009
I think this should work (and actually does work for me). However, since tesseract can't find the file I assume you should have placed it at a wrong location. Are you sure your tessdata directory is /usr/share/tesseract/tessdata/ (and not just /usr/share/tessdata or /usr/local/share/tessdata/)?
Feb 14, 2010
In my test with hocr2pdf I wound up with decent horizontal placement, but inverted vertical placement. Output from Cuneiform produced a correct looking pdf with hocr2pdf, which makes me believe that this is a bug in this patch. Is there a program that this output is known to work well with?
Feb 15, 2010
Ah, you are right. The problem is that in hOCR we should count coordinates from the top right corner, while tesseract puts the coordinate origin at the bottom of the page. So please test this version of the patch.
Feb 15, 2010
Results are good! here's my test pdf file. it was created with the svn version of tesseract patched with your bbox patch and hocr2pdf from a page scanned at 300dpi.
May 19, 2010
Applied. Had to remove STL, as it is incompatible with Android. Thanks.
Nov 26, 2010
I am using tesseract latest version on ubuntu and running it like this: tesseract image.tif outputbase nobatch hocr but get: cordoval@cordoval-laptop:~/Downloads$ tesseract luis1.jpg luis.txt hocr read_variables_file: Can't open hocr Tesseract Open Source OCR Engine with Leptonica cordoval@cordoval-laptop:~/Downloads$ less luis.txt.txt
Nov 27, 2010
read_variables_file: Can't open hocr -> you do not have hocr config file.
Feb 15, 2011
how do I apply a patch? I only downloaded the file: tesseract-hocr-fixed-bbox.patch and I don't know what to do with it... could you help me please? regards
Feb 15, 2011
with program/utility 'patch'. Try to use google.
Mar 1, 2013
I have installed Bookscanning-Software "Homer" on Windows and had "read_variables_file: Can't open hocr" Message in Tesseract-Logfile. Solution: Check path-variables in system settings for duplicate tesseract-installations.
Feb 7, 2014
Hi I need to add an arabic sakkalmajalla font to tessdata how can I do that , can anyone help mw please
|► Sign in to add a comment|