1

I have about 167k files in single folder(for now) and renamed by this script in here: Renaming bunch of files, but only part of the title .
How can I find duplicate files by their names (only digits in that specific spot) and delete oldest file:
Aaaaaaa.bbb - 0000125 tag tag_tag 9tag Aaaaaaa.bbb - 0000002 tag 9tag Aaaaaaa.bbb - 0000002 tag tag_tag 9tag

All tools that I used didn't provide such functionality so only script can help.

Ceslovas
  • 37
  • 6
  • So Aaaaaaa.bbb - 0000002 tag 9tag would be a duplicate of Aaaaaaa.bbb - 0000002 tag tag_tag 9tag because of 0000002, correct? – kos Nov 05 '15 at 22:35
  • Yes, that's correct. – Ceslovas Nov 05 '15 at 22:44
  • So, what defines the name? Will it always be in the format of foo.bar - XXX and the name is foo? Will there always be an extension? Will the space before the - always be the first space in the file name? – terdon Nov 10 '15 at 15:43

1 Answers1

0

Below here's a find, sort and awk one-liner.

Basic idea is to list files, sort them numerically (which works, unless Aaaaaaa.bbb and tags are themselves are numbers), and then let awk store each 3rd field of filenames into prev variable, and compare it with current value of field 3. If they match, print a message.

find . -type f -print | sort --numeric | awk '{if(prev == $3) print $0" is duplicate of "$prevEntry}{ prev=$3; prevEntry=$0}'

Below is a small demo:

    $ seq 6 10 | xargs printf "%07d\n" | xargs -I {} touch "Aaaaaaa.bbb - {} tag 9tag" 

    $ seq 00001 00020 | xargs printf "%07d\n" | xargs -I {} echo "Aaaaaaa.bbb - {} tag tag_tag 9tag"

$ find . -type f -print | sort --numeric | awk '{if(prev == $3) print $0" is duplicate of "$prevEntry}{ prev=$3; prevEntry=$0}'

    ./Aaaaaaa.bbb - 0000006 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000006 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000007 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000007 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000008 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000008 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000009 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000009 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000010 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000010 tag tag_tag 9tag
Byte Commander
  • 107,489
Sergiy Kolodyazhnyy
  • 105,154
  • 20
  • 279
  • 497
  • It did great job on finding and listing dups, the same result I got from your last one-liner(finding gaps, set from > 15 to < 1). The problem is not finding dups, but eliminating them, deleting old files and leaving new ones by creation (download) date. – Ceslovas Nov 06 '15 at 01:19
  • @Ceslovas Ah, I missed that part when I was reading the question . . .OK. I'll try to see what I could do there and edit my answer once I have a solution – Sergiy Kolodyazhnyy Nov 06 '15 at 01:21
  • @Ceslovas Can you please do me a favor. Select two duplicate files that my command above has outputed , and run stat --format=%Y FILENAME on each one separately, and let me know if the files have same output number – Sergiy Kolodyazhnyy Nov 06 '15 at 01:42
  • It's identical, and I double checked it – Ceslovas Nov 06 '15 at 01:50
  • @Ceslovas how about if we replace %Y with %Z option ? – Sergiy Kolodyazhnyy Nov 06 '15 at 02:03
  • Last three digits differs from each other. – Ceslovas Nov 06 '15 at 02:05
  • One last test, date +%s -r FILENAME – Sergiy Kolodyazhnyy Nov 06 '15 at 02:17
  • Also, if two files have same modification date, which file you'd prefer to keep ? – Sergiy Kolodyazhnyy Nov 06 '15 at 02:22
  • Identical and the same result as with stat --format=%Y – Ceslovas Nov 06 '15 at 02:22
  • I prefer the one which been created/downloaded the latest. The newer file. Don't know if that helps. – Ceslovas Nov 06 '15 at 02:33
  • The problem is that the both commands that give you identical result are testing for modification date . . . Cannot seem to find easy way to test for creation date. I'll keep on looking , will let you know if I find anything – Sergiy Kolodyazhnyy Nov 06 '15 at 02:35
  • Any progress on solving this problem? Can this script in here help you? – Ceslovas Nov 08 '15 at 20:27
  • @Ceslovas problem is, terdon's answer that you've linked is for ext4 only . I don't think it will work for ntfs . . .As for ntfs, Ubuntu does come with some utilities, like ntfsinfo, but I haven't had luck with them so far, I kept getting errors. Would there be any alternative way to differentiate newer files besides the date ? Like the specific contents of the files ? – Sergiy Kolodyazhnyy Nov 09 '15 at 10:14
  • Well, then lets do this question, but with one exception that there are no sub-folders in /folder2, and /folder1 contains all newest files by date. As for specifics only name differs (all tag part). For now I can manage dups manually one-by-one in windows, as now there are not so much. Will this alternative be good enough? – Ceslovas Nov 09 '15 at 15:13
  • @Ceslovas OK, but I don't quite understand why you need to move files in those folders. If everything in folder2 is a duplicate of some folder1 contents, why not just delete folder2 ? Or there are files you want to keep ? Also , since you already have 22 rep on site, you should be able to join askubuntu chat . extended discussions comments aren't quite welcome here – Sergiy Kolodyazhnyy Nov 09 '15 at 19:20