In an attempt to use pdfsizeopt to find a "normalized" or "canonical" representation of PDF files for potential deduplication during backups (or even for the sake of privacy), it would be nice if pdfsizeopt allowed one to also remove metadata such as the user customizable fields that appear when pdfinfo is invoked with a PDF file like:
Title: The bytefield package
Subject: Protocol diagrams for LaTeX
Keywords: bits, bytes, bit fields, communication, network protocol diagrams, LaTeX2e, memory maps
Author: Scott Pakin <scott+bf@pakin.org>
Creator: LaTeX with hyperref package
Producer: pdfTeX-1.40.10
CreationDate: Sun Sep 2 13:50:50 2012
ModDate: Sun Sep 2 13:50:50 2012
Tagged: no
Pages: 48
Encrypted: no
Page size: 612 x 792 pts (letter)
File size: 724524 bytes
Optimized: yes
PDF version: 1.4
Especially the dates. When would this "normalization" be desirable?
For instance, I sometimes (actually, frequently) find PS files, download them (perhaps in multiple computers, when I have to stop what I am reading and have to unse another computer) which I happen to convert to PDF since not all environments that I use may have an adequate PS reader.
When I want to backup things, it would be nice to be able to run a program like hardlink, or fdupes, or rdfind, or duff etc. to choose which copies I keep and which copies I don't.
It would also make it easier for deduplicating backup tools (like obnam or bup) to save space in such circumstances.
Regards,
Rogério Brito.
Labels: -Type-Defect -Priority-High Type-Enhancement Priority-Medium