| Issue 71: | pdfsizeopt gets __main__.PdfXrefStreamError: duplicate obj 5 | |
| 1 person starred this issue and may be notified of changes. | Back to list |
Someone sent me a PDF that fails with the sequence below. I pulled pdfsizeopt.py from svn today, 12 Oct 2012. From other debug code, the multiply defined object seems to be /ID. The file also seems to have a stream with no objects. I promised not to post the file. I have attached patches that might help if anyone else has this problem.
William
$ python pdfsizeopt.py 17MB.pdf
info: This is pdfsizeopt.py- rUNKNOWN size=315256.
info: using Java for Multivalent: /usr/bin/java
info: loading PDF from: 17MB.pdf
info: loaded PDF of 16900425 bytes
info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08)
info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>>
info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 6/Predictor 12>>
Traceback (most recent call last):
File "pdfsizeopt.py-", line 7831, in <module>
main(sys.argv)
File "pdfsizeopt.py-", line 7793, in main
).Load(file_name)
File "pdfsizeopt.py-", line 3463, in Load
data, do_ignore_generation_numbers=self.do_ignore_generation_numbers)
File "pdfsizeopt.py-", line 3805, in ParseUsingXref
xref_ofs, xref_obj_num, xref_generation)
File "pdfsizeopt.py-", line 3640, in ParseUsingXrefStream
raise PdfXrefStreamError('duplicate obj %d' % obj_num)
__main__.PdfXrefStreamError: duplicate obj 5
Oct 12, 2012
Project Member
#1
pts...@gmail.com
Labels:
-Priority-High Priority-Medium
Oct 12, 2012
Thanks for looking at the patch. The person who sent me the file saw my name in some patches. I suggested that he send the file to you. In any case, I am attaching a new patch that is more careful. In one of the places, instead of allowing any duplicate, it permits only /ID. In the other places, instead of continuing silently, it prints a warning to stderr similar the the message that it used to raise. I have a log below that shows the warnings. If you want, if you send me a patch that prints more information, I can run it and let you know what happens. The file has Creator "Adobe Acrobat 8.1 Combine Files", Producer "Acrobat 9.3.1", Optimized "no", PDF version "1.6". Object 5 starts <</ArtBox[42.5197 42.5197 496.063 722.834] Object 6 starts <</Filter/FlateDecode/Length 619>>stream Object 3251 starts <</Length 3645/Subtype/XML/Type/Metadata>>stream endstream Object 11017 starts <</Author(Client1)/CreationDate(D:20120910153803+02'00')/Creator(Adobe Acrobat 8.1 Combine Files) and another object has stream with /Info 11017 0 R. info: This is pdfsizeopt.py rUNKNOWN size=315564. info: using Java for Multivalent: /usr/bin/java info: loading PDF from: 17MB.pdf info: loaded PDF of 16900425 bytes info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08) info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 5/Predictor 12>> info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Columns 6/Predictor 12>> warning: duplicate obj 5 in xref stream warning: duplicate obj 6 in xref stream warning: duplicate obj 3251 in xref stream warning: duplicate obj 11017 in xref stream warning: duplicate /ID in xref streams info: found 11039 obj offsets and 364 obj streams in xref stream warning: missing offset for xref stream obj 11408 warning: missing xref obj stream 11406 warning: missing xref obj stream 11407 info: separated to 10676 objs + xref + trailer info: found 0 Type1 fonts loaded info: found 34 Type1C fonts loaded info: writing Type1CParser (73664 font bytes) to: pso.conv.parse.tmp.ps info: executing Type1CParser with Ghostscript: gs -q -dNOPAUSE -dBATCH -sDEVICE=nullpage -sDataFile=pso.conv.parsedata.tmp.ps -f pso.conv.parse.tmp.ps Type1CParser: using interpreter GPL Ghostscript 906 20120808 Type1CParser: all OK
Oct 13, 2012
Thank you very much for the modified and restricted patch. Without an example PDF I don't have enough information to decide whether the patch is an improvement in the general case. (It's definitely an improvement for this specific PDF.) So if you can't attach an example PDF, I'm ready to apply your patch, but the functionality could be enabled by a command-line flag (--do-permissive-obj-parsing) disabled by default. Would this work for you?
Oct 13, 2012
It is a file that someone sent me. I do not need it to work, and I have asked him to send the file to you. Since he apparently made the PDF with a recent Adobe product, I suspect that other people will have the same problem. Maybe it is better to wait until someone else who is willing to send a PDF has the problem.
Oct 14, 2012
I have permission to send you the PDF privately for the purpose of checking the patches. Is that OK? William
Oct 14, 2012
Thank you very much for the detailed bug report, the follow-up information and the several helpful patches. Based on the provided example PDF I diagnosed the problem, identified several bugs in the xref stream parsing code of pdfsizeopt, and fixed them r220. Please download the latest pdfsizeopt.py and check if it works correctly. (It works for me.) It turned out that the example PDF was correct, but pdfsizeopt was parsing it incorrectly when both xref streams and /Prev references were involved. I've read the relevant sections (3.4.5 and 3.4.7) of the PDF 1.7 reference again, and modified pdfsizeopt so that now it works according to the specification.
Status:
Fixed
Oct 14, 2012
Thanks, 220 works for me. Regards, William |