New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocropus-gpageseg: Defective line splitting #210
Comments
Okay, I looked at the debug output with The disconnected (red) components are then creating the different lines. If you increase that value by hand by setting the
then the output looks good: (Don't forget to remove all old images from the directory containing the lines.) |
Alright, thanks. That fixes the issue for the specific image (and many others). But if I set this parameter globally for the whole (pre-segmented) book, new problems arise with smaller (e.g. on-line images). Is there a known bug in the scale detection? |
I don't know what exactly you mean with "on-line images", but in general when you have to deal with font sizes which vary much (header vs. body text vs. footnote text), then ocropus has some problems and you might need some other steps.
Nothing I am aware of, but the example you provide looks like not an optimal guess from ocropus for the scale parameter. My guess is that for your test image the binarization will produce characters that are splitted into several connected components, and this influences the estimation of the scale parameter. I tried another binarization method here, and then the result seems also okay. |
Sorry @zuphilip. This is a typo and should be "one-line images" (i.e., images which cover only a single line). So it's not the varying font size but rather varying clipping sizes from the whole page image which cause the issues.
This is a great idea. I used Many thanks for your ongoing support! |
The scale estimation in ocropus for your example will produce this scalemap As far as I understand the following happen then: For each of these boxes the algorithm continues to calculate the area and then take the square root (i.e. geometric mean of width and height). Overall the median of these numbers (without outliners) is then taken. Maybe in your example there are too many small connected components an/or the font is too narrow... ( The corresponding Jupyter notebook is here: https://gist.github.com/zuphilip/e551ba6b733b5094749799651e4fbd3e ) Sauvola is one possibility and I the ocropus-nlbin has more parameters to try out. Moreover, it should be possible to mix some of the steps of Tesseract with some of the steps with Ocropus. @wrznr Thank you for asking interesting questions! |
Indeed, using e.g. scantailor for binarization results in an almost error-free line splitting! Only small one-line segments like page numbers and signature marks (which is probably to be expected) are not correctly processed. Significant step forward! While this is great for me, is it a problem for ocropus (I.e. problems in the combination of |
Expected Behavior
Simple running text should be consistently split into lines.
Current Behavior
Currently working on data from the Grenzboten project together with @uvius. For some images, line splitting does not work. It is not clear why because very similar images are split correctly.
Steps to Reproduce (for bugs)
ocropus-gpageseg
on testfile(s).Test files have been created with
ocropus-nlbin
. Tested various command line parameter settings without success.Your Environment
The text was updated successfully, but these errors were encountered: