The script below looks for duplicate files in up to 10 directories at once, looking for duplicates in the combined directories.
It should be dramatically faster than both fdupes
(running fdupes -r
) and fslint
; on a relatively small directory of 40GB, locally stored, it took the script 5 seconds to create a dupe list, while it took fdupes
and fslint
much longer (~ 90 / 100 seconds). On a larger directory (700GB, ~350000 files) on a relatively slow external USB drive, it took 90 minutes. With fdupes it would have been over 200-250 minutes, looking at the progress indication (which is nice, the script below doesn't show the progress), but I didn't wait for it all the way.
I should mention that for example fslint
offers additional functionality, which the script does not (as it is), so the comparison is strictly on creating the dupes list.
Furthermore, the speed depends for a part on how fast the disk reads: I tested several media (a.o on a network drive) with huge differences, especially on smaller directories, where creating the file list takes a relatively great part of the job ('s time).
The bottom line is that it won't be a quick job either which way, you might ask yourself if the directories aren't too large.
How it works
When the script finds duplicates, the duplicates are listed as follows:
Creating file list... /home/jacob/Bureaublad/test2
Creating file list... /home/jacob/Bureaublad/foto
Creating file list... /home/jacob/Bureaublad/Askubuntu
Checking for duplicates (10790 files)...
------------------------------------------------------------
> found duplicate: test1.txt 2
/home/jacob/Bureaublad/test2/test1.txt
/home/jacob/Bureaublad/test2/another directory/test1.txt
------------------------------------------------------------
and so on
The script
#!/usr/bin/env python3
import os
import sys
total_filelist = []
total_names = []
def find_files(directory):
l = []; l2 = []
for root, dirs, files in os.walk(directory):
for file in files:
l.append(file)
l2.append(root+"/"+file)
return (l, l2)
i = 1
while i <= 10:
try:
dr = (sys.argv[i])
print("Creating file list...", dr)
total_filelist = total_filelist+find_files(dr)[1]
total_names = total_names+find_files(dr)[0]
i = i+1
except IndexError:
break
print("Checking for duplicates ("+str(len(total_names)),"files)...")
for name in set(total_names):
n = total_names.count(name)
if n > 1:
print("-"*60,"\n> found duplicate:",
name, n, "\n")
for item in total_filelist:
if item.endswith("/"+name):
print(item)
print("-"*60, "\n")
Copy it into an empty file, save it as find_dupes.py
and run it by the command:
python3 <script> <directory1> <directory2> <directory3>
Up to max 10 directories
More options of the script
It is relatively simple to add additional functionality; move duplicates to another directory for example, renaming etc, so you can either manually or automatically decide which file to keep.
How to make the job doable
Your directories seem huge. To make the job reasonably possible, there is another, more sophisticated way to prevent the system from "choking": instead of doing the job on all file types (extensions) at once, you could cut the job into sections per file type. A small test on a directory of 30.000 files reduced the time from appr. 20 seconds (all files) to 0.3 second for one extension.
To make the script look for duplicates of only one file type, replace the section of the script:
def find_files(directory):
l = []; l2 = []
for root, dirs, files in os.walk(directory):
for file in files:
l.append(file)
l2.append(root+"/"+file)
return (l, l2)
by:
def find_files(directory):
l = []; l2 = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".py"): # example .py extension
l.append(file)
l2.append(root+"/"+file)
return (l, l2)
Finding occurring file extensions
To list all occurring file extension in a directory, you can use the script below:
#!/usr/bin/env python3
import sys
import os
l = []
for root, dirs, files in os.walk(sys.argv[1]):
for f in files:
if (
f.startswith("."),
f.count(".") == 0,
f.endswith("~"),
) == (False, False, False):
l.append(f[f.rfind("."):])
for item in set(l):
print(item)
Copy it into an empty file, save it as find_extensions.py
and run it by the command:
python3 <script> <diretory>
Example ouput:
.txt
.mp3
.odt
.py
.desktop
.sh
.ods
fslint
anddupedit
in particular should be very efficient. – lemonsqueeze Nov 09 '14 at 13:58