Find & Delete duplicated files on multiple harddisks at once

Question

I have 4 hard disks and want to find out which files on this four harddisks (including sub directories) are duplicates. It should be checked not only within a harddisk but against all the others as well.

The hard disks are large (3TB) therefor it has to be efficient (first filename than checksum checks or so)

Have a look at How to find (and delete) duplicate files. fslint and dupedit in particular should be very efficient. — lemonsqueeze, Nov 09 '14 at 13:58
To the close voters: what distinguishes this question from the linked one: very large directories in this case and multiple disks / directories in one comparison at once — Jacob Vlijm, Nov 10 '14 at 16:47

score 3 · Answer 1 · edited Jan 03 '15 at 21:12

The script below looks for duplicate files in up to 10 directories at once, looking for duplicates in the combined directories.

It should be dramatically faster than both fdupes (running fdupes -r) and fslint; on a relatively small directory of 40GB, locally stored, it took the script 5 seconds to create a dupe list, while it took fdupes and fslint much longer (~ 90 / 100 seconds). On a larger directory (700GB, ~350000 files) on a relatively slow external USB drive, it took 90 minutes. With fdupes it would have been over 200-250 minutes, looking at the progress indication (which is nice, the script below doesn't show the progress), but I didn't wait for it all the way.
I should mention that for example fslint offers additional functionality, which the script does not (as it is), so the comparison is strictly on creating the dupes list.

Furthermore, the speed depends for a part on how fast the disk reads: I tested several media (a.o on a network drive) with huge differences, especially on smaller directories, where creating the file list takes a relatively great part of the job ('s time).

The bottom line is that it won't be a quick job either which way, you might ask yourself if the directories aren't too large.

How it works

When the script finds duplicates, the duplicates are listed as follows:

Creating file list... /home/jacob/Bureaublad/test2
Creating file list... /home/jacob/Bureaublad/foto
Creating file list... /home/jacob/Bureaublad/Askubuntu
Checking for duplicates (10790 files)...
------------------------------------------------------------ 
>  found duplicate: test1.txt 2 

/home/jacob/Bureaublad/test2/test1.txt
/home/jacob/Bureaublad/test2/another directory/test1.txt
------------------------------------------------------------

and so on

The script

#!/usr/bin/env python3

import os
import sys

total_filelist = []
total_names = []

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

i = 1
while i <= 10:
    try:
        dr = (sys.argv[i])
        print("Creating file list...", dr)
        total_filelist = total_filelist+find_files(dr)[1]
        total_names = total_names+find_files(dr)[0]
        i = i+1
    except IndexError:
        break

print("Checking for duplicates ("+str(len(total_names)),"files)...")

for name in set(total_names):
    n = total_names.count(name)
    if n > 1:
        print("-"*60,"\n>  found duplicate:",
              name, n, "\n")
        for item in total_filelist:
            if item.endswith("/"+name):
                print(item)

print("-"*60, "\n")

Copy it into an empty file, save it as find_dupes.py and run it by the command:

python3 <script> <directory1> <directory2> <directory3>

Up to max 10 directories

More options of the script

It is relatively simple to add additional functionality; move duplicates to another directory for example, renaming etc, so you can either manually or automatically decide which file to keep.

How to make the job doable

Your directories seem huge. To make the job reasonably possible, there is another, more sophisticated way to prevent the system from "choking": instead of doing the job on all file types (extensions) at once, you could cut the job into sections per file type. A small test on a directory of 30.000 files reduced the time from appr. 20 seconds (all files) to 0.3 second for one extension.

To make the script look for duplicates of only one file type, replace the section of the script:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            l.append(file)
            l2.append(root+"/"+file)
    return (l, l2)

by:

def find_files(directory):
    l = []; l2 = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"): # example .py extension
                l.append(file)
                l2.append(root+"/"+file)
    return (l, l2)

Finding occurring file extensions

To list all occurring file extension in a directory, you can use the script below:

#!/usr/bin/env python3

import sys
import os

l = []
for root, dirs, files in os.walk(sys.argv[1]):
    for f in files:
        if (
        f.startswith("."),
        f.count(".") == 0,
        f.endswith("~"),
        ) == (False, False, False):
            l.append(f[f.rfind("."):])
for item in set(l):
    print(item)

Copy it into an empty file, save it as find_extensions.py and run it by the command:

python3 <script> <diretory>

Example ouput:

.txt
.mp3
.odt
.py
.desktop
.sh
.ods

score 0 · Answer 2 · edited Apr 13 '17 at 12:23

0

If you'd like to use a very capable GUI, try FSlint from the Software Center.

(I see that @lemonsqueeze suggested this in the comments above).

Here is an answer that outlines FSlint usage: https://askubuntu.com/a/472244/100356

edited Apr 13 '17 at 12:23

Community

1

answered Nov 10 '14 at 18:11

Enterprise

12,352

score 0 · Accepted Answer · answered Nov 12 '14 at 10:29

I used the FSlint project and find to get the thing done.

My process to get all of this sorted out on multiple disks with the requirement to run everything via CLI & screen

sudo apt-get install fslint
find path1/2/3 -type f -empty -delete & find path1/2/3 -type d -empty -delete (to get rid of alle empty or not completly copied stuff)
/usr/share/fslint/fslint/findsn path1 path2 path3 (delete everything which is stored on the same directory with the same size on different disks)
/usr/share/fslint/fslint/findup path1 path2 path3 (delete all duplicate files)
find path1/2/3 -type d -empty -delete (to get rid of the directories which are empty after findup)

after that I was able to mount all disks as a combined drive with mhddfs again without having duplicates wasting diskspace again

Find & Delete duplicated files on multiple harddisks at once

3 Answers3

How it works

The script

More options of the script

How to make the job doable

Finding occurring file extensions

Linked

Related