O(n^2 * m)
fileSimilars.pl walks along the given dirs to find all similar files.
fileSimilars homepage http://xpt.sourceforge.net/tools/file_similars/
fileSimilars releases https://sourceforge.net/project/showfiles.php?group_id=163815&package_id=207422
fileSimilars.pl will find all similar files, not only identical ones: different version (.txt, .html, or .pdf) and different compression methods (.zip, .gz, .tar.gz, .bip2), MP3 files with slightly different names or even different sample rates, etc. It uses advanced soundex vector algorithm to determine the file similarities.
The file similarity checking is extremely fast. It uses advanced soundex vector algorithm to determine the similarity between files. Generally it means that if there are n files, each having approximately m words in the file name, the degree of calculation is merely
O(n^2 * m)
regardless of file size. This is over hundreds times faster than any existing file fingerprinting technology.
Please visit the xpt tools support page.
The self-test output will help you understand what the module do and what would you expect from the outcome.
PERL_DL_NONLAZY=1 /usr/bin/perl "-Iblib/lib" "-Iblib/arch" test.pl 1..4 # Running under perl version 5.010000 for linux # Current time local: Wed Oct 29 11:35:06 2008 # Current time GMT: Wed Oct 29 15:35:06 2008 # Using Test.pm version 1.25 # Testing File::Searcher::Similars version 1.23
9 test/(eBook) GNU - Python Standard Library 2001.pdf 3 test/CardLayoutTest.java 5 test/GNU - 2001 - Python Standard Library.pdf 4 test/GNU - Python Standard Library (2001).rar 9 test/LayoutTest.java 3 test/PopupTest.java 2 test/Python Standard Library.zip 5 test/TestLayout.java ok 1
Note:
The fileSimilars.pl script will pick out similar files from them in next test.
Let's assume that the number represent the file size in KB.
## ========= 3 'CardLayoutTest.java' 'test/' 5 'TestLayout.java' 'test/'
## ========= 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' ok 2
Note:
There are 2 groups of similar files picked out by the script. The second group makes more sense.
The similar files are picked because their file names looks similar.
However, the file size plays an important role as well.
There are 2 files in the second similar files group.
The file Python Standard Library.zip is not considered to be similar to the group because its size is not similar to the group.
## ========= 3 'CardLayoutTest.java' 'test/' 5 'TestLayout.java' 'test/'
## ========= 4 'Python Standard Library.zip' 'test/' 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' ok 3
Note:
There are 3 files in the second similar files group.
The file Python Standard Library.zip is now in the 2nd similar files group because its size is now similar to the group.
## ========= 3 'CardLayoutTest.java' 'test/' 5 'TestLayout.java' 'test/'
## ========= 4 'GNU - Python Standard Library (2001).rar' 'test/' 5 'GNU - 2001 - Python Standard Library.pdf' 'test/' 6 'Python Standard Library.zip' 'test/' 9 '(eBook) GNU - Python Standard Library 2001.pdf' 'test/' ok 4
Note:
There are 4 files in the second similar files group.
The file Python Standard Library.zip is still in the group.
But this time, because it is also considered to be similar to the .pdf file (since their size are now similar, 6 vs 9), a 4th file the .pdf is now included in the 2nd group.
If the size of file Python Standard Library.zip is 12(KB), then the second similar files group will be split into two. Do you know why and which files each group will contain?
perl Makefile.PL make make test make install
There includes in the package a client program called fileSimilars.pl. Copy it to a directory which is in the PATH.
Issue fileSimilars.pl to get help on how to use it. And also,
perldoc File::Searcher::Similars