Command Line Tool and Library To Eliminate Duplicates and Facilitate Intelligent Merging of Data Structures.
LITEN IS IN VERY ACTIVE DEVELOPMENT
Warning: Be extremely cautious when using this tool. Although it deletes duplicate files, you could still potentially delete a file that your operating system needs to run. Please do not just run this on the root level of your hard drive with --delete. You may delete something you need. First run a report, then look at the report, and decide what to delete based on that report, perhaps on a directory by directory basis.
News
2008-12-26 Liten 0.1.5 is released
This release fixes bug that required three duplicates to make a match and introduces new feature that allows specifying several directories to search.
Roadmap
Liten will eventually become part of a larger suite of filesystem tools developed in pure Python, although the liten command line tool will always remain an easy to install script, or package.
Installation
If you are on Python2.5 you can just do: easy_install liten
You can also download the script, and just run it. If you do not develop with Python this may be the best choice until I build new packages for it.
http://liten.googlecode.com/files/liten-0.1.5-script.py
The previous version in an OS X Leopard Package is here:
Over 1,100 downloads in less than 30 days since release
http://liten.googlecode.com/files/liten-0.1.4.2-macosx-10.5-intel.dmg
Documentation
The most up to date documentation can be found here:
http://pypi.python.org/pypi/liten
The latest documentation as html can be checked out of subversion here:
http://code.google.com/p/liten/source/browse/trunk/docs/liten_documentation.html
Latest Version: 0.1.5
Liten: A deduplication command line tool and library
:Author: Noah Gift :Version: $Revision: 0.1.5 $ :Copyright: This document has been placed in the public domain.
Summary
A deduplication command line tool and library. A relatively efficient algorithm based on searching like sized files, and then performing a full md5 checksum, is used to determine duplicate files/file objects. Files can be deleted upon discovery, and pattern matching can be used to limit search results. Finally, configuration file use is supported, and there is a developing API that lends itself to customization via an ActionsMixin class.
.. contents::
Example CLI Usage:
Size:
Search by size using --size or -s option::
liten.py -s 1 /mnt/raid is equal to liten.py -s 1MB /mnt/raid
liten.py -s 1bytes /mnt/raid
liten.py -s 1KB /mnt/raid
liten.py -s 1MB /mnt/raid
liten.py -s 1GB /mnt/raid
liten.py c:\in d:\ is equal to liten.py -s 1MB c:\in d:\Report Location:
Generate custom report path using -r or --report=/tmp/report.txt::
./liten.py --report=/tmp/test.txt /Users/ngift/Documents
By default a report will be created in CWD, called LitenDuplicateReport.csv
Config File:
You can use a config file in the following format::
[Options]
path=/tmp
size=1MB
pattern=*.m4v
delete=TrueYou can call the config file anything and place it anywhere.
Here is an example usage::
./liten.py --config=myconfig.ini
Verbosity:
All stdout can be suppressed by using --quiet or -q.
Delete:
By using --delete the duplicate files will be automatically deleted. The API has support for an interactive mode and a dry-run mode, they have not been implemented in the CLI as of yet.
Example Library/API Usage:
>>> Liten = Liten(spath='testData')
>>> dupeFileOne = 'testData/testDocOne.txt'
>>> checksumOne = Liten.createChecksum(dupeFileOne)
>>> dupeFileTwo = 'testData/testDocTwo.txt'
>>> checksumTwo = Liten.createChecksum(dupeFileTwo)
>>> nonDupeFile = 'testData/testDocThree_wrong_match.txt'
>>> checksumThree = Liten.createChecksum(nonDupeFile)
>>> checksumOne == checksumTwo
True
>>> checksumOne == checksumThree
FalseThere is also the concept of an Action, which can be implemented later, that will allow customizable actions to occur upon an a condition that gets defined as you walk down a tree of files.
Tests:
- Run Doctests: ./liten -t or --test
- Run test_liten.py
- Run test_create_file.py then delete those test files using liten::
python2.5 liten.py --delete /tmp
Display Options:
Stdout:
stdout will show you duplicate file paths and sizes such as::
Printing dups over 1 MB using md5 checksum: [SIZE] [ORIG] [DUP]
7 MB Orig: /Users/ngift/Downloads/bzr-0-2.17.tar
Dupe: /Users/ngift/Downloads/bzr-0-4.17.tarReport:
A report named LitenDuplicateReport.csv will be created in your current working directory::
Duplicate Version, Path, Size, ModDate
Original, /Users/ngift/Downloads/bzr-0-2.17.tar, 7 MB, 07/10/2007 01:43:12 AM
Duplicate, /Users/ngift/Downloads/bzr-0-3.17.tar, 7 MB, 07/10/2007 01:43:27 AMDebug Mode Environmental Variables:
To enable print statement debugging set LITEN_DEBUG to 1
To enable pdb break point debugging set LITEN_DEBUG to 2
LITEN_DEBUG_MODE = int(os.environ.get('LITEN_DEBUG', 0))
Note: When DEBUG MODE is enabled, a message will appear to standard out
QUESTIONS: noah dot gift at gmail.com
Acknowledgements
Thanks to Rick Copeland, Titus Brown, Shannon Behrens, Anatoly Techtonik, and others for ideas, patches, and enhancements.