Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocropus-gpageseg: Defective line splitting #210

Open
wrznr opened this issue May 2, 2017 · 7 comments
Open

ocropus-gpageseg: Defective line splitting #210

wrznr opened this issue May 2, 2017 · 7 comments

Comments

@wrznr
Copy link

wrznr commented May 2, 2017

Expected Behavior

Simple running text should be consistently split into lines.

Current Behavior

Currently working on data from the Grenzboten project together with @uvius. For some images, line splitting does not work. It is not clear why because very similar images are split correctly.

Steps to Reproduce (for bugs)

  1. Download test files:

179411_01 nrm
179411_01 bin

  1. Run ocropus-gpageseg on testfile(s).
  2. Inspect results.

Test files have been created with ocropus-nlbin. Tested various command line parameter settings without success.

Your Environment

@zuphilip
Copy link
Collaborator

zuphilip commented May 2, 2017

Okay, I looked at the debug output with --debug and it seems that the detected scale is too small (approximately half of the correct size):

scale-default_lineseeds

The disconnected (red) components are then creating the different lines.

If you increase that value by hand by setting the --scale parameter:

ocropus-gpageseg grenzboten.bin.png -n --debug --scale 30

then the output looks good:

scale-30_lineseeds

(Don't forget to remove all old images from the directory containing the lines.)

@wrznr
Copy link
Author

wrznr commented May 16, 2017

Alright, thanks. That fixes the issue for the specific image (and many others). But if I set this parameter globally for the whole (pre-segmented) book, new problems arise with smaller (e.g. on-line images). Is there a known bug in the scale detection?

@zuphilip
Copy link
Collaborator

new problems arise with smaller (e.g. on-line images)

I don't know what exactly you mean with "on-line images", but in general when you have to deal with font sizes which vary much (header vs. body text vs. footnote text), then ocropus has some problems and you might need some other steps.

Is there a known bug in the scale detection?

Nothing I am aware of, but the example you provide looks like not an optimal guess from ocropus for the scale parameter. My guess is that for your test image the binarization will produce characters that are splitted into several connected components, and this influences the estimation of the scale parameter. I tried another binarization method here, and then the result seems also okay.

@wrznr
Copy link
Author

wrznr commented May 17, 2017

Sorry @zuphilip. This is a typo and should be "one-line images" (i.e., images which cover only a single line). So it's not the varying font size but rather varying clipping sizes from the whole page image which cause the issues.

I tried another binarization method here, and then the result seems also okay.

This is a great idea. I used ocropus-nlbin which seems the most obvious choice. From my experience, the tesseract line splitting is far superior to ocropous-gpagseg but this probably boils down to binarization.

Many thanks for your ongoing support!

@amitdo
Copy link
Contributor

amitdo commented May 17, 2017

@zuphilip
Copy link
Collaborator

The scale estimation in ocropus for your example will produce this scalemap

grenzboten-scalemap

As far as I understand the following happen then: For each of these boxes the algorithm continues to calculate the area and then take the square root (i.e. geometric mean of width and height). Overall the median of these numbers (without outliners) is then taken. Maybe in your example there are too many small connected components an/or the font is too narrow...

( The corresponding Jupyter notebook is here: https://gist.github.com/zuphilip/e551ba6b733b5094749799651e4fbd3e )

Sauvola is one possibility and I the ocropus-nlbin has more parameters to try out. Moreover, it should be possible to mix some of the steps of Tesseract with some of the steps with Ocropus.

@wrznr Thank you for asking interesting questions!

@wrznr
Copy link
Author

wrznr commented May 30, 2017

Indeed, using e.g. scantailor for binarization results in an almost error-free line splitting! Only small one-line segments like page numbers and signature marks (which is probably to be expected) are not correctly processed. Significant step forward!

While this is great for me, is it a problem for ocropus (I.e. problems in the combination of nlbin and gpageseg)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants