How to find ONLY duplicate files that have different names?

Question

FSlint can find duplicate files. But suppose one has 10,000 songs or images and wants to find ONLY those files that are identical but have different names? Right now, I get a list that has hundreds of dupes (in different folders). I want the names to be consistent, so I want to see only the identical files with different names, not identical files with the same name.

Can FSlint with advanced parameters (or a different program) accomplish this?

You could write a bash script that uses md5sum to check if the files are identical, and just say something like, if the md5sum is equal but the filenames aren't, then export the filenames to a list of some kind. — Gerowen, Feb 08 '16 at 07:27
Since you're talking about songs, Picard might be worth a look, it allows for automatic file renaming / moving based on tags. The only trouble I had with it was that I had to do two runs due to my different system for compilations vs single-artist-albums... — Tobias Kienzler, Feb 08 '16 at 13:43

Byte Commander · Answer 1 · 2016-02-08T13:09:09.177

9

If you're okay that the script prints all duplicate files with both equal and different filenames, you can use this command line:

find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67-

For an example run, I use the following directory structure. Files with similar name (and different number) have equal content:

.
├── dir1
│   ├── uname1
│   └── uname3
├── grps
├── lsbrelease
├── lsbrelease2
├── uname1
└── uname2

And now let's watch our command doing some magic:

$ find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67-
./lsbrelease
./lsbrelease2

./dir1/uname1
./dir1/uname3
./uname1
./uname2

Each group separated by a new line consists of files with equal content. Non-duplicate files are not listed.

edited Feb 08 '16 at 13:09

answered Feb 08 '16 at 07:56

Byte Commander

107,489

Very nice, but some files (the very same ones) -literally- appear 5-10 times in a row. – Jacob Vlijm Feb 08 '16 at 09:01
@JacobVlijm What exactly do you mean? There's always one file per row for me... – Byte Commander Feb 08 '16 at 09:02
@ByteCommander waitwait, it is not the same file!! (different backups). +1 Perfect, just perfect. Going to use it on my photo directory :) – Jacob Vlijm Feb 08 '16 at 09:06
@JacobVlijm If you like this answer, you will love my second one! :D – Byte Commander Feb 08 '16 at 11:40

Byte Commander · Accepted Answer · 2016-02-11T08:21:31.923

I have another, far more flexible and easy to use solution for you!

Copy the script below and paste it to /usr/local/bin/dupe-check (or any other location and file name, you need root permissions for this one).
Make it executable by running this command:

sudo chmod +x /usr/local/bin/dupe-check

As /usr/local/bin is in every user's PATH, everybody may now run it directly without specifying the location.

First, you should look at the help page of my script:

$ dupe-check --help
usage: dupe-check [-h] [-s COMMAND] [-r MAXDEPTH] [-e | -d] [-0]
                  [-v | -q | -Q] [-g] [-p] [-V]
                  [directory]

Check for duplicate files

positional arguments:
  directory             the directory to examine recursively (default '.')

optional arguments:
  -h, --help            show this help message and exit
  -s COMMAND, --hashsum COMMAND
                        external system command to generate hashes (default
                        'sha256sum')
  -r MAXDEPTH, --recursion-depth MAXDEPTH
                        the number of subdirectory levels to process: 0=only
                        current directory, 1=max. 1st subdirectory level, ...
                        (default: infinite)
  -e, --equal-names     only list duplicates with equal file names
  -d, --different-names
                        only list duplicates with different file names
  -0, --no-zero         do not list 0-byte files
  -v, --verbose         print hash and name of each examined file
  -q, --quiet           suppress status output on stderr
  -Q, --list-only       only list the duplicate files, no summary etc.
  -g, --no-groups       do not group equal duplicates
  -p, --path-only       only print the full path in the results list,
                        otherwise format output like this: `'FILENAME'
                        (FULL_PATH)´
  -V, --version         show program's version number and exit

You see, to get a list of all files in the current directory (and all subdirectories) with different file names, you need the -d flag and any valid combination of formatting options.

We still assume the same test environment. Files with similar name (and different number) have equal content:

.
├── dir1
│   ├── uname1
│   └── uname3
├── grps
├── lsbrelease
├── lsbrelease2
├── uname1
└── uname2

So we simply run:

$ dupe-check
Checked 7 files in total, 6 of them are duplicates by content.
Here's a list of all duplicate files:

'lsbrelease' (./lsbrelease)
'lsbrelease2' (./lsbrelease2)

'uname1' (./dir1/uname1)
'uname1' (./uname1)
'uname2' (./uname2)
'uname3' (./dir1/uname3)

And here is the script:

#! /usr/bin/env python3

VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO = 0, 4, 1
RELEASE_DATE, AUTHOR = "2016-02-11", "ByteCommander"

import sys
import os
import shutil
import subprocess
import argparse


class Printer:
    def __init__(self, normal=sys.stdout, stat=sys.stderr):
        self.__normal = normal
        self.__stat = stat
        self.__prev_msg = ""
        self.__first = True
        self.__max_width = shutil.get_terminal_size().columns
    def __call__(self, msg, stat=False):
        if not stat:
            if not self.__first:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
            print(msg, file=self.__normal)
            print(self.__prev_msg, end="", flush=True, file=self.__stat)
        else:
            if len(msg) > self.__max_width:
                msg = msg[:self.__max_width-3] + "..."
            if not msg:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", flush=True, file=self.__stat)
            elif self.__first:
                print(msg, end="", flush=True, file=self.__stat)
                self.__first = False
            else:
                print("\r" + " " * len(self.__prev_msg) + "\r", 
                      end="", file=self.__stat)
                print("\r" + msg, end="", flush=True, file=self.__stat)
            self.__prev_msg = msg


def file_walker(top, maxdepth=None):
    dirs, files = [], []
    for name in os.listdir(top):
        (dirs if os.path.isdir(os.path.join(top, name)) else files).append(name)
    yield top, files
    if maxdepth != 0:
        for name in dirs:
            for x in file_walker(os.path.join(top, name), maxdepth-1):
                yield x


printx = Printer()
argparser = argparse.ArgumentParser(description="Check for duplicate files")
argparser.add_argument("directory", action="store", default=".", nargs="?",
                       help="the directory to examine recursively "
                            "(default '%(default)s')")
argparser.add_argument("-s", "--hashsum", action="store", default="sha256sum",
                       metavar="COMMAND", help="external system command to "
                       "generate hashes (default '%(default)s')")
argparser.add_argument("-r", "--recursion-depth", action="store", type=int,
                       default=-1, metavar="MAXDEPTH", 
                       help="the number of subdirectory levels to process: "
                       "0=only current directory, 1=max. 1st subdirectory "
                       "level, ... (default: infinite)")
arggroupn = argparser.add_mutually_exclusive_group()
arggroupn.add_argument("-e", "--equal-names", action="store_const", 
                       const="e", dest="name_filter",
                       help="only list duplicates with equal file names")
arggroupn.add_argument("-d", "--different-names", action="store_const",
                       const="d", dest="name_filter",
                       help="only list duplicates with different file names")
argparser.add_argument("-0", "--no-zero", action="store_true", default=False,
                       help="do not list 0-byte files")
arggroupo = argparser.add_mutually_exclusive_group()
arggroupo.add_argument("-v", "--verbose", action="store_const", const=0, 
                       dest="output_level",
                       help="print hash and name of each examined file")
arggroupo.add_argument("-q", "--quiet", action="store_const", const=2, 
                       dest="output_level",
                       help="suppress status output on stderr")
arggroupo.add_argument("-Q", "--list-only", action="store_const", const=3, 
                       dest="output_level",
                       help="only list the duplicate files, no summary etc.")
argparser.add_argument("-g", "--no-groups", action="store_true", default=False,
                       help="do not group equal duplicates")
argparser.add_argument("-p", "--path-only", action="store_true", default=False,
                       help="only print the full path in the results list, "
                            "otherwise format output like this: "
                            "`'FILENAME' (FULL_PATH)´")
argparser.add_argument("-V", "--version", action="version", 
                       version="%(prog)s {}.{}.{} ({} by {})".format(
                       VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO, 
                       RELEASE_DATE, AUTHOR))
argparser.set_defaults(name_filter="a", output_level=1)
args = argparser.parse_args()

hashes = {}
dupe_counter = 0
file_counter = 0
try:
    for root, filenames in file_walker(args.directory, args.recursion_depth):
        if args.output_level <= 1:
            printx("--> {} files ({} duplicates) processed - '{}'".format(
                    file_counter, dupe_counter, root), stat=True)
        for filename in filenames:
            path = os.path.join(root, filename)
            file_counter += 1
            filehash = subprocess.check_output(
                       [args.hashsum, path], universal_newlines=True).split()[0]
            if args.output_level == 0:
                printx(" ".join((filehash, path)))
            if filehash in hashes:
                dupe_counter += 1 if len(hashes[filehash]) > 1 else 2
                hashes[filehash].append((filename, path))
                if args.output_level <= 1:
                    printx("--> {} files ({} duplicates) processed - '{}'"
                           .format(file_counter, dupe_counter, root), stat=True)
            else:
                hashes[filehash] = [(filename, path)]
except FileNotFoundError:
    printx("ERROR: Directory not found!")
    exit(1)
except KeyboardInterrupt:
    printx("USER ABORTED SEARCH!")
    printx("Results so far:")

if args.output_level <= 1:
    printx("", stat=True)
    if args.output_level == 0:
        printx("")
if args.output_level <= 2:
    printx("Checked {} files in total, {} of them are duplicates by content."
            .format(file_counter, dupe_counter))

if dupe_counter == 0:
    exit(0)
elif args.output_level <= 2:
    printx("Here's a list of all duplicate{} files{}:".format(
            " non-zero-byte" if args.no_zero else "",
            " with different names" if args.name_filter == "d" else
            " with equal names" if args.name_filter == "e" else ""))

first_group = True
for filehash in hashes:
    if len(hashes[filehash]) > 1:
        if args.no_zero and os.path.getsize(hashes[filehash][0][0]) == 0:
            continue
        first_group = False
        if args.name_filter == "a":
            filtered = hashes[filehash]
        else:
            filenames = {}
            for filename, path in hashes[filehash]:
                if filename in filenames:
                    filenames[filename].append(path)
                else:
                    filenames[filename] = [path]
            filtered = [(filename, path) 
                    for filename in filenames if (
                    args.name_filter == "e" and len(filenames[filename]) > 1 or
                    args.name_filter == "d" and len(filenames[filename]) == 1)
                    for path in filenames[filename]]
        if len(filtered) == 0:
            continue
        if (not args.no_groups) and (args.output_level <= 2 or not first_group):
            printx("")
        for filename, path in sorted(filtered):
            if args.path_only:
                printx(path)
            else:
                printx("'{}' ({})".format(filename, path))

I set up a test directory with a subdirectery and copied a couple dozen files into each. I then duplicated some of the files but changed their names. Well, the script works great, except for the -d and -e options. -d tells me that no duplicates were found with different names, and -e shows me all the duplicates, including the copies with different names.
Also, can I exclude subdirectories somehow? — ExNASATerry, Feb 08 '16 at 18:01
Sorry, I can't reproduce what you say. Maybe you used an old version? I updated the script several times after I published the first version here. Please copy the current one again. If it still does not show the desired output, would you mind running tree on your test directory (you might need to sudo apt-get install tree first) to display the hierarchical file/directory structure, and then run the script again with -d and -e each time, but also add the -v flag for verbose output. Please copy the whole output of all three commands to www.pastebin.com and give me the link. — Byte Commander, Feb 10 '16 at 12:18
I just updated the script again, adding the options -r MAXDEPTH/--recursion-depth MAXDEPTH to allow excluding subdirectories (or sub-subdirectories, or ...) and a -V/--version for better version control. The current version is 0.4 - If you still observe any bugs, please give me the information requested in my last comment, otherwise can't fix them if I don't notice them. — Byte Commander, Feb 10 '16 at 14:03
I guess I spoke too soon. I tested it with a very simple, three file (jpgs) directory and it worked. So I ran it on one of my massive photo archives and discovered that if it finds duplicate files of different names in the same directory, it lists only one of them. If they are in different directories, it lists both. Any chance of fixing that? I can work around it, but it's a pain. Thanks. — ExNASATerry, Feb 10 '16 at 21:41
When you use -d to only display differently named files, it will only show those duplicates that have a unique name. That means if you have file1.txt, dir1/file1.txt and file2.txt with the same content and use dupe-check -d, it will only display file2.txt, because the file1.txt files exist twice with the same content and name. This is the intended behaviour, at least I thought you wanted this. Did I misunderstand you? What output would you expect in that case? — Byte Commander, Feb 11 '16 at 08:14
I see your point. In the above case I'd like to see all 3 names listed (1 per line) so I can see each instance and would know in this case that file2.txt is most likely mis-named. I do see that that would actually list two files that have the same name. But otherwise I have one name only--I know there must be a duplicate file with a different name but I have to go searching by file size to find it, which means re-sorting the directory by name, then size for each file. It works but was a pain. Still, it's the best solution I've found so far. But the above change would be very helpful. Thanks! — ExNASATerry, Feb 11 '16 at 17:15
awesome - do you have this class in .net? or is there any way of getting this to work in .net? — BenKoshy, Feb 12 '16 at 12:42
@BKSpurgeon No, it's written in Python3 and I don't know .NET, so I can't help you there. However, the code is of course open source and licensed as every other Stack Exchange content, so you may feel free to modify and adapt the script to your needs and likes, as long as you give me credits and use the same License. But why would you want that in .NET? What do you think would be its advantage over Python 3 in this case? — Byte Commander, Feb 14 '16 at 17:38
@NobodyAtall But you can also use -e to only display duplicates with equal names, or use neither -d nor -e to display all duplicates ignoring their names. I still don't see what you want to achieve that isn't possible with any or none of the flags -d and -e. Please try again or describe in more detail. Thanks! — Byte Commander, Feb 14 '16 at 17:44
@ByteCommander ahh why .net? Because I want it to work on windows, not ubuntu per se. chrs, this is good in terms of learning logic etc. chrs — BenKoshy, Feb 14 '16 at 20:44
@BKSpurgeon Python is a cross-platform scripting language. When you install a Python 3 interpreter for Windows, it will work there as well. It's just that Ubuntu comes shipped with a preinstalled interpreter. — Byte Commander, Feb 14 '16 at 21:27
Assume I have 10000 files, 5000 with duplicate names. Scattered amongst them are 500 other duplicates but with different names. I want to find those 500 and fix the names. I could list all duplicates and pore over them looking for different names. Duplicate File Finder Pro let me do that but it's god-awful tedious. Your program lets me find only the different names, but I still have to determine what the names should be, which means figuring out the duplicate files with different names in the same directories. Instead I want to list only those 500, but along with the others that are identical. — ExNASATerry, Feb 14 '16 at 22:38
Assume the files file1A, file1B, dir1/file1A, file2A, file2B, file3A, dir1/file3A. The number in the file name indicates equal content and the capital letter is there to have different file names. You want an option to show all the file1*, because with this content there exist files with different names. You also want to see all file3*, because they have different names, but no file2* as they have equal names, right? Your wish is to show all files that have duplicate content, if any file with that same content has a different name. Give me an OK and I'll implement that. — Byte Commander, Feb 15 '16 at 07:20
Brilliant piece of software! I leraned alot analyzing it. I have 2 questions: 1. Why do you use 'subprocess.check_output' and not just 'subprocess.run'? - 2. Any chance to add an option to exclude files and/or directories from the list to analyse? — Marc Vanhoomissen, Feb 15 '16 at 13:17
@MarcVanhoomissen 1. because subprocess.run() was added in Python 3.5, but no Ubuntu release uses 3.5 as default for python3 yet, even if it is installed. Older releases than 15.10 might even only have 3.4. I use the older function subprocess.check_output() to stay compatible with as many Python 3 versions as possible. - 2. I could do that, sure. It will get on the list of feature requests! ;-) — Byte Commander, Feb 15 '16 at 13:29
Correct: "show all files that have duplicate content, if any file with that same content has a different name" So file1A, file1B, and dir1/file1A would all be listed (same content, and file1B has a different name). No, file3A and dir1/file3A would not, as they are identical and have same names. file2A and file2B would be listed, as they are identical and have different names. So would, say, file4A and dir1/file4B (same content, different names, in different directories). I would also like to +1 @MarcVanhoomissen's request to exclude directories. :) Thanks. — ExNASATerry, Feb 15 '16 at 19:30
Nice script. I wrote a file renaming utility that does a simple check for identical files, so I was interested in your code. It does fail for me though when it encounters a broken symlink: "subprocess.CalledProcessError: Command '['sha256sum', './.mozilla/firefox/mwad0hks.default/lock']' returned non-zero exit status 1" Happens at line 112. — user2253249, Feb 16 '16 at 18:39
Another, probably related, problem. I ran the program again and got a bunch of single-line responses for cases where there were duplicate files of different names in three or more directories. In this case, I expected: '12f976.jpg' (./Beach/HF/12f976.jpg) 'Br01.jpg' (./Beach/Br01.jpg) 'Br01.jpg' (./Big/Br01.jpg), but got only '12f976.jpg' (./Beach/HF/12f976.jpg). It looks like if the files are duped in more than 2 directories, only one instance is reported. I'm guessing since a pair of instances had the same names, neither of those got reported as differing from the 3rd. — ExNASATerry, Feb 23 '16 at 02:01
@user2253249 That's because the sha256sum tool seems not to be able to handle broken symlinks and exits with error code 1 in that case. It's not really a problem of my script, but I should probably catch that error and print a nicer message instead. I'll put that on the list for the next version as well. — Byte Commander, Feb 23 '16 at 13:38
I think I fixed it. I don't know Python (I do 6 other languages) but I changed line 160 to `args.name_filter == "d" and len(filenames[filename]) >= 1 and len(filenames[filename]) != len(hashes[filehash]))' — ExNASATerry, Feb 23 '16 at 16:49

score 1 · Answer 3 · answered Feb 29 '16 at 00:55

Byte Commander's excellent script worked, but did not give me quite the behavior I needed (listing all duplicate files that include at least one with a different name). I made the following change and now it works perfectly for my purposes (and has saved me a TON of time)! I changed line 160 to:

args.name_filter == "d" and len(filenames[filename]) >= 1 and len(filenames[filename]) != len(hashes[filehash]))

How to find ONLY duplicate files that have different names?

3 Answers3

Linked

Related