My favorites | Sign in
Project Home Downloads Wiki Issues Source
READ-ONLY: This project has been archived. For more information see this post.
Search
for
  Advanced search   Search tips   Subscriptions
Issue 77: Remove the info fields
1 person starred this issue and may be notified of changes. Back to list
 
Reported by rbr...@gmail.com, Feb 26, 2013
In an attempt to use pdfsizeopt to find a "normalized" or "canonical" representation of PDF files for potential deduplication during backups (or even for the sake of privacy), it would be nice if pdfsizeopt allowed one to also remove metadata such as the user customizable fields that appear when pdfinfo is invoked with a PDF file like:

Title:          The bytefield package
Subject:        Protocol diagrams for LaTeX
Keywords:       bits, bytes, bit fields, communication, network protocol diagrams, LaTeX2e, memory maps
Author:         Scott Pakin <scott+bf@pakin.org>
Creator:        LaTeX with hyperref package
Producer:       pdfTeX-1.40.10
CreationDate:   Sun Sep  2 13:50:50 2012
ModDate:        Sun Sep  2 13:50:50 2012
Tagged:         no
Pages:          48
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      724524 bytes
Optimized:      yes
PDF version:    1.4

Especially the dates. When would this "normalization" be desirable?

For instance, I sometimes (actually, frequently) find PS files, download them (perhaps in multiple computers, when I have to stop what I am reading and have to unse another computer) which I happen to convert to PDF since not all environments that I use may have an adequate PS reader.

When I want to backup things, it would be nice to be able to run a program like hardlink, or fdupes, or rdfind, or duff etc. to choose which copies I keep and which copies I don't.

It would also make it easier for deduplicating backup tools (like obnam or bup) to save space in such circumstances.


Regards,

Rogério Brito.
Feb 27, 2013
Project Member #1 pts...@gmail.com
Yes, it would be a nice and simple new pdfsizeopt feature to remove the info fields Title, Subject, Keywords, Author, Creator, Producer, CreationDate and ModDate.

pdfsizeopt makes no attempt to generate a normalized or canonical output representation. In my opinion, this feature would be very complicated to implement, maybe close to impossible to implement it in a usable way. Thus I have no such plans. From now on this issue will track only the removal of the info fields.
Summary: Remove the info fields (was: Feature request: Remove as much metadata from optimized files as possible)
Labels: -Type-Defect -Priority-High Type-Enhancement Priority-Medium
Feb 27, 2013
#2 rbr...@gmail.com
That's OK with me.

If you want to split this issue in two for tracking purposes (remove the info fields and, perhaps, in the future, try to make something canonical), feel free. Or if you want, I can do that (but I lack the privileges, I think).


Powered by Google Project Hosting