My favorites | Sign in
Project Home Downloads Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
  Advanced search   Search tips   Subscriptions
Issue 59: Corrupt jbig2 pages in output PDF
1 person starred this issue and may be notified of changes. Back to list
Status:  Started
Owner:  ----


 
Reported by fdnc...@gmail.com, Jun 26, 2012
What steps will reproduce the problem?
1. Run pdfsizeopt.py Pages1-7.pdf on windows taking the defaults and you'll get the problem.


What is the expected output? What do you see instead?
I expect the pages to be viewable and compressed.  The attached PDF is what I see, blank pages with error stating "Insufficient data for an image".

What version of the product are you using? On what operating system?
Latest from svn.  Windows7 32-bit.

Please provide any additional information below.
The attached log is the output of the run.  I'm also attaching the before compress PDF file and the after compress PDF file.  Also I found another viewer (STDU Viewer) that partially decodes the output PDF file so I'm attaching a screenshot of what it looks like.  And my statically compiled with vs2010 jbig2.exe from Adam Langley's source on github.

Thanks,
Darren
Pages1-7.pdf
251 KB   Download
Pages1-7.psom.pdf
183 KB   Download
image_optimze_error.txt
23.9 KB   View   Download
STDU_Viewer_Partial_Image.png
246 KB   View   Download
jbig2.exe
852 KB   Download
Jun 26, 2012
Project Member #1 pts...@gmail.com
Thank you for the detailed bug report. Based on the files image_optimze_error.txt and Pages1-7.psom.pdf you have uploaded I could figure what's going wrong. I'm almost sure that I've identitied an easy-to-fix bug in your jbig2.exe. Once you fix the bug, recompile jbig2.exe, and rerun pdfsizeopt, it will be fine.

On Windows it's possible to open files in either ASCII or binary mode. ASCII is the default; you can have binary by passing ...|O_BINARY to the 2nd argument of open(), or passing a string containing "b" (e.g. "rb" instead of "r"; "wb" instead of "w") to the 2nd argument of fopen(), or calling setmode(1, O_BINARY) to put stdout to binary mode. If a file is opened in ASCII mode, than all writes (e.g. write(...), putchar(...), fwrite(...), fprintf(...)) of "\n" (10) actually write "\r\n" (13, 10) to the file.

In our case, jbig2.exe writes the JBIG2-compressed image to its stdout, e.g. see the line

info: executing image optimizer jbig2: jbig2 -p pso.conv-3.sam2p-pr.png >pso.conv-3.jbig2

in the image_optimze_error.txt you have uploaded. The bug is that jbig2.exe writes to stdout in ASCII mode, but binary mode would be correct. It's easy to fix: please add setmode(1, O_BINARY) to the beginning of the main() function of jbig2.exe , recomplie jbig2.exe, and rerun the optimization like this:

$ pdfsizeopt.py --use-pngout=no Pages1-7.pdf

Now Pages1-7.psom.pdf should be correct, and the JBIG2 file should be a few bytes shorter, as indicated on the console output. Old, incorrect:

info: optimized image XObject 3 file_name=pso.conv-3.jbig2 size=2109 (58%) methods=jbig2:2109,#orig:3637,pngout:6793,sam2p_np:7011,sam2p_pr:8586,gs:11056

New, correct:

info: optimized image XObject 3 file_name=pso.conv-3.jbig2 size=2102 (58%) methods=jbig2:2102,#orig:3637,sam2p_np:7011,sam2p_pr:8586,gs:11050

(Please note the difference between 2019 and 2012 bytes.)

If this O_BINARY change doesn't fix the problem, then please upload the entire directory (containing the pso.* temporary files) ZIPped as an attachment to this issue. Also include the recompiled jbig2.exe you use, and the console output of pdfsizeopt.

To illustrate my point, I've modified a few bytes of Pages-1.7.psop.pdf : I've removed the 7 extra \r characters (and added some padding after the obj the make the file size the same). This effectively fixed the image of page 2. So if you make jbig2.exe not emit the \r characters, most probably the whole PDF would be fixed.

If you manage to fix jbig2.exe, please upload it as an attachment to this issue, so others would also benefit.
Pages1-7.psom.fix1.pdf
183 KB   Download
Jun 27, 2012
#2 fdnc...@gmail.com
That fixed it.  Thanks for all your help!!!

Attached is my vs2010 compiled jbig2.exe and all the source code in case someone else wants to compile it.
jbig2enc_20120627.zip
535 KB   Download
Jun 27, 2012
Project Member #3 pts...@gmail.com
Thank you for sharing your jbig2.exe and your source tree.

jbig2.exe was one of the missing dependencies of pdfsizeopt on Windows. Today I compiled the remaining few dependencies, so now pdfsizeopt is officially available on Windows, and it's easier to install than ever. If you're interested, please check out the new installation page at https://code.google.com/p/pdfsizeopt/wiki/InstallationInstructionWindows .

It would be very useful if you could upload all the library dependencies of jbig2enc_20120627.zip , including the URLs where you downloaded them from, and a .cmd file which compiles all the dependencies from scratch. So we could say to a future developer to install Visual Studio, download and extract a .zip file, run a .cmd file, and wait for jbig2.exe to be built automatically.
Status: Fixed
Jul 9, 2012
#4 fdnc...@gmail.com
Hey, glad I could help.

I followed the instructions here
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/README.html#building-on-windows
to
compile Leptonica (http://leptonica.com/) and download the dependancies.

I think you can just download the dependacies (
http://leptonica.org/source/leptonica-1.68-win32-lib-include-dirs.zip)  and
put everything in the right place to compile the jbig2 encoder.  I may have
done that.  I can't remember. ;)

Darren
Jul 9, 2012
#5 fdnc...@gmail.com
This is what I get when I run your new windows version.

C:\Users\x991808\Desktop\pdfsizeopt_win32bin>pdfsizeopt.exe 000000.PDF
info: This is pdfsizeopt.py rUNKNOWN size=309327.
info: loading PDF from: 000000.PDF
info: loaded PDF of 515655 bytes
info: separated to 26 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 0 Type1C fonts loaded
info: eliminated 2 unused objs in 2 classes
info: saving PDF with 24 objs with Multivalent to: 000000.psom.pdf
info: writing Multivalent input PDF: pso.conv.mi.tmp.pdf
info: generated object stream of 529 bytes in 21 objects (14%)
info: written 513629 bytes to Multivalent input PDF: pso.conv.mi.tmp.pdf
error: Multivalent.jar not found. Make sure it is on the $PATH, or it is
one of the files on the $CLASSPATH.
Traceback (most recent call last):
  File ".\pdfsizeopt.py", line 7698, in <module>
    main(sys.argv)
  File ".\pdfsizeopt.py", line 7694, in main
    may_obj_heads_contain_comments=may_obj_heads_contain_comments)
  File ".\pdfsizeopt.py", line 7425, in Save
    may_obj_heads_contain_comments=may_obj_heads_contain_comments)
  File ".\pdfsizeopt.py", line 7322, in _RunMultivalent
    assert 0, 'Multivalent.jar not found, see above'
AssertionError: Multivalent.jar not found, see above
Jul 9, 2012
Project Member #6 pts...@gmail.com
AssertionError: Multivalent.jar not found, see above

Did you follow the installation instructions? Did you download the newest pdfsizeopt.py (its size is 313571)? If that still doesn't fix the problem, please copy-paste the output of

  dir /s C:\Users\x991808\Desktop\pdfsizeopt_win32bin
Jul 10, 2012
#7 fdnc...@gmail.com
Yes, I followed the instructions but I tried again this morning (re-doing all the instructions) and everything is working fine now.  Running a massive PDF to test at the moment.  So far so good.  I just wish there was a way to speed up pngout.  That thing takes forever.
Jul 10, 2012
#8 fdnc...@gmail.com
One last thing you should add is the msvcr100.dll since I compiled jbig2.exe with vs2010.  Here's mine.
Jul 10, 2012
Project Member #9 pts...@gmail.com
About pngout: you can use --use-pngout=no . There is a speed vs size tradeoff here. pngout is slow, but its output is small.
Jul 10, 2012
Project Member #10 pts...@gmail.com
Based on the information you have provided, I managed to compile a jbig2.exe (see it attached) suitable for use with pdfsizeopt. I compiled it using MinGW (cross-compiling on Linux), so it doesn't need msvcr100.dll . (I also removed the attached msvcr100.dll to avoid copyright issues in the future.)

In the near future, I'll release this new jbig2.exe so it will be used by default with pdfsizeopt on Windows.

FYI My jbig2.exe is noticeably smaller than yours, because I removed many unnecessary functions from the leptonica library (editing .c files by hand), and I also removed a few command-line flags which pdfsizeopt doesn't need.

Thank you very much for your help providing patches and compilation instructions, it helped me a lot in understanding jbig2 on Windows and preparing my own version.
jbig2.exe
355 KB   Download
Status: Started
Jul 11, 2012
#11 fdnc...@gmail.com
Excellent!  Glad to hear you were able to get it compiled.  It wasn't trivial in VS2010 for me but MinGW is probably the easier choice, especially is you're used to Linux/gcc.  Sorry I wasn't able to provide the batch file you requested.  Just too much going on right now to mess with it.

You might want to try out this alternate version of JBIG2Enc https://github.com/zdenop/jbig2enc/tree/R.Hatlapatka.  It's supposed to have better autothresholding which I interpret to mean better compression on some images assuming the thresholding works.  I haven't tried it yet.

BTW - I tried the --use-pngout=no on my 146MB PDF file.  It took 20 minutes instead of 2.5 hours and the file sizes were identical.  So pngout doesn't seem to help unless you have color images.  Mine test file was all CCITTFaxDecode so maybe if you see that (which is always bitonal) you shouldn't call pngout?  Just an idea to save time.

Powered by Google Project Hosting