ocropus-gpageseg: Defective line splitting #210

wrznr · 2017-05-02T10:46:06Z

Expected Behavior

Simple running text should be consistently split into lines.

Current Behavior

Currently working on data from the Grenzboten project together with @uvius. For some images, line splitting does not work. It is not clear why because very similar images are split correctly.

Steps to Reproduce (for bugs)

Download test files:

Run ocropus-gpageseg on testfile(s).
Inspect results.

Test files have been created with ocropus-nlbin. Tested various command line parameter settings without success.

Your Environment

Python version: Python 2.7.3
Git revision of ocropy: commit 49c7f9e
Operating System and version: Linux lal 3.2.0-4-amd64 fixed case when image dtype is int16. #1 SMP Debian 3.2.73-2+deb7u2 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

zuphilip · 2017-05-02T17:36:08Z

Okay, I looked at the debug output with --debug and it seems that the detected scale is too small (approximately half of the correct size):

The disconnected (red) components are then creating the different lines.

If you increase that value by hand by setting the --scale parameter:

ocropus-gpageseg grenzboten.bin.png -n --debug --scale 30

then the output looks good:

(Don't forget to remove all old images from the directory containing the lines.)

wrznr · 2017-05-16T08:27:07Z

Alright, thanks. That fixes the issue for the specific image (and many others). But if I set this parameter globally for the whole (pre-segmented) book, new problems arise with smaller (e.g. on-line images). Is there a known bug in the scale detection?

zuphilip · 2017-05-16T14:58:53Z

new problems arise with smaller (e.g. on-line images)

I don't know what exactly you mean with "on-line images", but in general when you have to deal with font sizes which vary much (header vs. body text vs. footnote text), then ocropus has some problems and you might need some other steps.

Is there a known bug in the scale detection?

Nothing I am aware of, but the example you provide looks like not an optimal guess from ocropus for the scale parameter. My guess is that for your test image the binarization will produce characters that are splitted into several connected components, and this influences the estimation of the scale parameter. I tried another binarization method here, and then the result seems also okay.

wrznr · 2017-05-17T10:07:27Z

Sorry @zuphilip. This is a typo and should be "one-line images" (i.e., images which cover only a single line). So it's not the varying font size but rather varying clipping sizes from the whole page image which cause the issues.

I tried another binarization method here, and then the result seems also okay.

This is a great idea. I used ocropus-nlbin which seems the most obvious choice. From my experience, the tesseract line splitting is far superior to ocropous-gpagseg but this probably boils down to binarization.

Many thanks for your ongoing support!

amitdo · 2017-05-17T10:20:15Z

https://github.com/tmbdev/ocropy/blob/master/OLD/ocropus-sauvola

zuphilip · 2017-05-17T13:02:15Z

The scale estimation in ocropus for your example will produce this scalemap

As far as I understand the following happen then: For each of these boxes the algorithm continues to calculate the area and then take the square root (i.e. geometric mean of width and height). Overall the median of these numbers (without outliners) is then taken. Maybe in your example there are too many small connected components an/or the font is too narrow...

( The corresponding Jupyter notebook is here: https://gist.github.com/zuphilip/e551ba6b733b5094749799651e4fbd3e )

Sauvola is one possibility and I the ocropus-nlbin has more parameters to try out. Moreover, it should be possible to mix some of the steps of Tesseract with some of the steps with Ocropus.

@wrznr Thank you for asking interesting questions!

wrznr · 2017-05-30T12:44:29Z

Indeed, using e.g. scantailor for binarization results in an almost error-free line splitting! Only small one-line segments like page numbers and signature marks (which is probably to be expected) are not correctly processed. Significant step forward!

While this is great for me, is it a problem for ocropus (I.e. problems in the combination of nlbin and gpageseg)?

amitdo mentioned this issue Dec 13, 2017

Delete unused lru.py #273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocropus-gpageseg: Defective line splitting #210

ocropus-gpageseg: Defective line splitting #210

wrznr commented May 2, 2017

zuphilip commented May 2, 2017

wrznr commented May 16, 2017

zuphilip commented May 16, 2017

wrznr commented May 17, 2017

amitdo commented May 17, 2017

zuphilip commented May 17, 2017

wrznr commented May 30, 2017

ocropus-gpageseg: Defective line splitting #210

ocropus-gpageseg: Defective line splitting #210

Comments

wrznr commented May 2, 2017

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Your Environment

zuphilip commented May 2, 2017

wrznr commented May 16, 2017

zuphilip commented May 16, 2017

wrznr commented May 17, 2017

amitdo commented May 17, 2017

zuphilip commented May 17, 2017

wrznr commented May 30, 2017