Issue 71: pdfsizeopt gets __main__.PdfXrefStreamError: duplicate obj 5
Status:  Fixed
Owner:
Closed:  Oct 2012
Reported by william.bader@gmail.com, Oct 12, 2012
Someone sent me a PDF that fails with the sequence below.  I pulled pdfsizeopt.py from svn today, 12 Oct 2012.  From other debug code, the multiply defined object seems to be /ID.  The file also seems to have a stream with no objects.  I promised not to post the file.  I have attached patches that might help if anyone else has this problem.
William

$ python pdfsizeopt.py 17MB.pdf 
info: This is pdfsizeopt.py- rUNKNOWN size=315256.
info: using Java for Multivalent: /usr/bin/java
info: loading PDF from: 17MB.pdf
info: loaded PDF of 16900425 bytes
info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08)
info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>>
info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 6/Predictor 12>>
Traceback (most recent call last):
  File "pdfsizeopt.py-", line 7831, in <module>
    main(sys.argv)
  File "pdfsizeopt.py-", line 7793, in main
    ).Load(file_name)
  File "pdfsizeopt.py-", line 3463, in Load
    data, do_ignore_generation_numbers=self.do_ignore_generation_numbers)
  File "pdfsizeopt.py-", line 3805, in ParseUsingXref
    xref_ofs, xref_obj_num, xref_generation)
  File "pdfsizeopt.py-", line 3640, in ParseUsingXrefStream
    raise PdfXrefStreamError('duplicate obj %d' % obj_num)
__main__.PdfXrefStreamError: duplicate obj 5

pdfsizeopt-12oct12.pat
4.2 KB   View   Download
Oct 12, 2012
Project Member #1 pts...@gmail.com
Thank you for the bug report and the patch.

I'm hesitating to accept the patch, because it makes pdfsizeopt too permissive, and I don't want to pdfsizeopt to accept certain kinds of incorrect PDFs. It would help a lot if you could post an example PDF which you think pdfsizeopt should accept.
Labels: -Priority-High Priority-Medium
Oct 12, 2012
#2 william.bader@gmail.com
Thanks for looking at the patch.  The person who sent me the file saw my name in some patches.  I suggested that he send the file to you.  In any case, I am attaching a new patch that is more careful.  In one of the places, instead of allowing any duplicate, it permits only /ID.  In the other places, instead of continuing silently, it prints a warning to stderr similar the the message that it used to raise.
I have a log below that shows the warnings.  If you want, if you send me a patch that prints more information, I can run it and let you know what happens.
The file has Creator "Adobe Acrobat 8.1 Combine Files", Producer "Acrobat 9.3.1", Optimized "no", PDF version "1.6".

Object 5 starts <</ArtBox[42.5197 42.5197 496.063 722.834]
Object 6 starts <</Filter/FlateDecode/Length 619>>stream
Object 3251 starts <</Length 3645/Subtype/XML/Type/Metadata>>stream endstream
Object 11017 starts <</Author(Client1)/CreationDate(D:20120910153803+02'00')/Creator(Adobe Acrobat 8.1 Combine Files)
and another object has stream with /Info 11017 0 R.

info: This is pdfsizeopt.py rUNKNOWN size=315564.
info: using Java for Multivalent: /usr/bin/java
info: loading PDF from: 17MB.pdf
info: loaded PDF of 16900425 bytes
info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08)
info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>>
info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 6/Predictor 12>>
warning: duplicate obj 5 in xref stream
warning: duplicate obj 6 in xref stream
warning: duplicate obj 3251 in xref stream
warning: duplicate obj 11017 in xref stream
warning: duplicate /ID in xref streams
info: found 11039 obj offsets and 364 obj streams in xref stream
warning: missing offset for xref stream obj 11408
warning: missing xref obj stream 11406
warning: missing xref obj stream 11407
info: separated to 10676 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 34 Type1C fonts loaded
info: writing Type1CParser (73664 font bytes) to: pso.conv.parse.tmp.ps
info: executing Type1CParser with Ghostscript: gs -q -dNOPAUSE -dBATCH -sDEVICE=nullpage -sDataFile=pso.conv.parsedata.tmp.ps -f pso.conv.parse.tmp.ps
Type1CParser: using interpreter GPL Ghostscript 906 20120808
Type1CParser: all OK

pdfsizeopt-20121012.pat
1.9 KB   View   Download
Oct 13, 2012
Project Member #3 pts...@gmail.com
Thank you very much for the modified and restricted patch.

Without an example PDF I don't have enough information to decide whether the patch is an improvement in the general case. (It's definitely an improvement for this specific PDF.) So if you can't attach an example PDF, I'm ready to apply your patch, but the functionality could be enabled by a command-line flag (--do-permissive-obj-parsing) disabled by default. Would this work for you?
Oct 13, 2012
#4 william.bader@gmail.com
It is a file that someone sent me.  I do not need it to work, and I have asked him to send the file to you.  Since he apparently made the PDF with a recent Adobe product, I suspect that other people will have the same problem.  Maybe it is better to wait until someone else who is willing to send a PDF has the problem.
Oct 14, 2012
#5 william.bader@gmail.com
I have permission to send you the PDF privately for the purpose of checking the patches.  Is that OK?
William
Oct 14, 2012
Project Member #6 pts...@gmail.com
Thank you very much for the detailed bug report, the follow-up information and the several helpful patches.

Based on the provided example PDF I diagnosed the problem, identified several bugs in the xref stream parsing code of pdfsizeopt, and fixed them r220. Please download the latest pdfsizeopt.py and check if it works correctly. (It works for me.)

It turned out that the example PDF was correct, but pdfsizeopt was parsing it incorrectly when both xref streams and /Prev references were involved. I've read the relevant sections (3.4.5 and 3.4.7) of the PDF 1.7 reference again, and modified pdfsizeopt so that now it works according to the specification.
Status: Fixed
Oct 14, 2012
#7 william.bader@gmail.com
Thanks, 220 works for me.
Regards, William