3

I want to clone one very large directory (many terabytes of multiple-gigabyte files) into another on another drive. I have been using this command:

ionice -c 3 rsync -avz /path/to/sourcedir/ /path/to/destdir/

The process takes over a day and more often than not gets interrupted, hence the use of rsync to be able to resume without restarting from zero. The theory should be that the above command is idempotent, so when anything fails I should just be able to reissue the same command to let it work out where it was interrupted and continue from there.

Now, because the point of the operation is to retire and recycle the source drive, before doing that I wanted to be super-sure that all files had been properly copied. So I used the approach in this question to compare each file byte by byte. Sure enough, there were a number of files that had a different hash.

So the theory question: does rsync, unlike what I thought I understood, work merely on file names, rather than content, or at least length?

And the (more important) practice question: are there other options I could be using instead, to force rsync to produce an exact clone of the source directory? In particular, in the case in which rsync is launched when the dest directory already has a file with the same name as one in the source directory, but with different content, I want the command to ensure it is replaced (or "completed") with the actual original file from the source directory.

dessert
  • 39,982
st01
  • 1,083
  • 1
    rscync works fine. If in doubt simply remove all files in the target directory first. This will make the process take longer but might help you sleep better if you are worried. For an example of cloning see: https://askubuntu.com/questions/1028604/bash-script-to-clone-ubuntu-to-new-partition-for-testing-18-04-lts-upgrade – WinEunuuchs2Unix May 07 '18 at 11:49
  • 2
    @WinEunuuchs2Unix, the point is that rsync does not "work fine" if it mysteriously managed to generate a file in the destination directory with the same name, timestamp and length as the one in the source directory but with a different content (as indicated by the md5sum hash). I am truly at a loss to explain why, I am just observing the facts. And there were 96 such "wrong" files, out of 2530. Since the src directory is being written to almost every day, and since copying the lot takes more than a day, deleting everything and redoing it isn't the most satisfactory option. – st01 May 07 '18 at 17:38
  • Maybe you should check if the drive's S.M.A.R.T. information says 'OK'. You can also re-consider Clonezilla. It might be as fast as or even faster than checking all the md5sums on the source and target. Clonezilla will only copy used blocks (where there are file data). There is also the possibility of files on the target, that have been deleted on the source. -- Or have you already checked all the files with the md5sum method? – sudodus May 07 '18 at 17:54
  • 1
    @sudodus, many thanks for the SMART tip. I checked and it says "Disk is OK, one attribute failed in the past (53 degrees C / 127 F)", but the useful side effect is that I found another drive in that computer that was NOT OK with a number of bad sectors. Clonezilla does not look like the appropriate solution: I'm copying from a 5TB drive to an 8 TB one, so I want a file-system-level copy, not a partition-level one. I have indeed checked all the files with md5sum, as I thought I said in the original question. I am still at a loss as to why rsync generated so many "wrong" files. – st01 May 08 '18 at 04:42
  • rsync should not generate "wrong" files. Are you sure that rsync generated them. Maybe some other program or background service modified those files. Maybe the RAM is flaky (did you check it with memtest from the grub menu?). What files are different? Have you checked manually, what is different between the original file and the copy? Size, some random bytes or are they completely different, like different versions? A couple of years ago I had problems with any copy (cp and rsync and ...) via USB in 16.04 LTS, I think caused by some background service. But it worked via tar. – sudodus May 08 '18 at 05:45
  • @sudodus, I also agree that rsync should not generate wrong files. This is very puzzling. Yes, I am sure rsync generated them. I started with a blank drive as destination. No other programs or background services write to the dest drive (they do add files to the source drive, though). I have not tested the RAM since building the box (it's a PVR, so downtime is a pain). 96 files, mostly mpeg, have a different checksum while having the same length and timestamp. No, I have not watched them to spot the difference---life is too short. I'd be keen to find out the cause of the wrong files. Thanks. – st01 May 08 '18 at 16:38
  • How was the target drive connected (via SATA or USB or via a local network e.g. via SSH)? When I had problems with rsync, the target drive was connected via USB and the files were truncated (too small). So you have another kind of problem. – sudodus May 08 '18 at 17:15
  • @sudodus: both source and destination drive are inside the PC and connected with SATA. – st01 May 09 '18 at 06:52
  • There is some serious error when you get bad copies. I would suspect some low level service, that is doing the data transfer to the memory in the drive or maybe the RAM, or maybe some physical/electronical hardware problem. – sudodus May 09 '18 at 07:28
  • @sudodus: sounds hard to diagnose without tearing the machine apart and a lot of trial and error, also because it's not as if every transfer is faulty! I'm slightly at a loss. – st01 May 09 '18 at 09:42
  • I must admit, that I am at a loss too. I don't know how to identify the cause of this problem. Maybe you can check the RAM overnight (with memtest from the grub menu in BIOS mode). Maybe you can run rsync in another computer and afterwards check the md5sums (to check if the problem can affect also other computers). – sudodus May 09 '18 at 10:08
  • I think if RAM didn't work I'd have plenty more worse problems than just that. Could the corruption be the fault of ionice? (clutching at straws really...) – st01 May 10 '18 at 08:52
  • I don't know (what ionice could do in this case). – sudodus May 11 '18 at 16:43

1 Answers1

7

Yes you can make rsync look into the files to check that everything matches. From man rsync

    -c, --checksum              skip based on checksum, not mod-time & size

Of course it will be slow, but rsync should find differences that the normal check would not find.

But rsyncing is not cloning. If you want a cloned copy, use Clonezilla.

sudodus
  • 46,324
  • 5
  • 88
  • 152
  • Please tell me if you want to clone and want me to add details about it. – sudodus May 07 '18 at 07:39
  • 1
    If you would at least elaborate on the difference, I'd appreciate it, thank you. – st01 May 07 '18 at 08:04
  • checking your clonezilla link, I don't need sector-by-sector partition copying. I just need file-level copying, with all (regular) files byte-for-byte identical. So, using your terminology, I don't actually need or want cloning. – st01 May 07 '18 at 09:19
  • @st01, What difference do you want me to elaborate? The difference due to the option -c in rsync? I understand now that Clonezilla is not an option. – sudodus May 07 '18 at 10:42
  • 1
    @st01 You could also add one of the many --delete options to delete files that exist in the target but not in the source directory. – PerlDuck May 07 '18 at 11:13
  • 1
    The -c option is great to know in light of OP problems. – WinEunuuchs2Unix May 07 '18 at 17:45
  • @sudodus, I meant "would you please elaborate on why rsyncing is not cloning?". I understood what you meant (cloning = sector-level identical rather than file-system-level identical) when I followed your clonezilla link. I do want file-system-level identical, not sector-level identical. I called it "cloning", perhaps improperly, to distinguish it from the case where the destination has files with the same name but not byte-for-byte identical to those of the source. Your suggestion of -c is helpful, thank you. It didn't fully solve my problem, but that's because I have a worse one, hence accept – st01 May 08 '18 at 04:48
  • rsync should always copy files correctly so that each target file has the same md5sum as the corresponding source file. And normally it is enough to use the normal criteria to identify which files to copy (the -c option is an extra option for special cases). But there might be extra files in the target, if you have deleted files from the source. You can use --deleteoptions to manage that problem. Test on a small directory tree until you know how to use all these options correctly. – sudodus May 08 '18 at 05:57
  • @st01, see also my comment at your question. Maybe your problem is not caused by 'rsync` but by some other program or background service. – sudodus May 08 '18 at 06:00