0

On my Ubuntu server I had a disk failure. The new hard drive was quickly installed by the hoster's technicians.

Then I followed the instructions on their page to integrate the new disk into the raid. It started similar with the answer given to this question (How can I quickly copy a GPT partition scheme from one hard drive to another?) Copy the partition table from the old disk to the new one:

sgdisk -R /dev/sdY /dev/sdX
sgdisk -G /dev/sdY

I am pretty certain that I did not mix up the old and new drive. Then I tried to integrate the new disk into the raid with

mdadm /dev/md0 -a /dev/sda1

That command failed. I rebooted to be able to get at the new partition on sda. But that's where it ended. The system will not boot anymore. I have access to a rescue system but I haven't the slightest idea what I have to do, to get my system up and running.

It seems that my filesystem may be corrupted?

fsck /dev/sdb
fsck from util-linux 2.25.2
e2fsck 1.42.12 (29-Aug-2014)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sdb

Is there any way to ascertain whether the file system is indeed destroyed? I was running kvm with several vms on there.

Yashima
  • 349

1 Answers1

0

I have figured out what happened. Either of two things: 1) I messed up the partition table or 2) something else messed up the partition table and after a reboot there was NOTHING to be done.

Here's what I should have done when one disk of the raid1 died:

  • check the raid status with cat /proc/mdstat make sure the drive is really dead
  • mdadm examine gives more insight into the status of the raid
  • while system is still running make backups of stuff that is not properly backed up (f.e. before having the hard drive removed and forcing a reboot on an already stressed system)
  • make a backup of the partition table before doing anything else (prefer using gdisk interactively and list partitions before backing them up to be sure the right device/harddrive is used)
  • use mdadm to remove the failed hard-drive partitions from the raid with --fail cleanly
  • instead of copying partition table from one drive to the other use the backup to load it
  • A reboot may be needed to get the partitions properly set up (make sure all stuff is backed up before)
  • Use mdadm to add the new partitions back to the raid devices f.e. mdadm --add /dev/md1 /dev/sda2
  • If for some reason you forgot to execute the --fail you may be able to re-create the raid devices with this: mdadm --create /dev/md1 --assume-clean --level=1 --verbose --raid-devices=2 missing /dev/sdb2 (I am reasonably sure that was not what destroyed the filesystems on the remaining Harddrive

If I'd followed the above I'd never have got into the position above. Once there, I did not find a way out. So what made me sure the data was gone?

  • From a rescue system, I was unable to mount any of the devices with mount -t ext4 /dev/md1 /mnt/mountpoint. I kept getting errors that the file-system was not recognized and the magic numbers not found
  • Testdisk had found the wrong number of partitions when trying to recreate the partition table
  • dumpe2fs while giving me locations for a bunch of magic numbers helped nothing because none were valid, also these positions are "fixed" within the partition in certain positions so if the partition table is wrong, these positions don't line up anymore
  • fsck basically told me the same thing and one partition was sacrificed to an attempt to repair the file system but every single inode threw an error
  • I did a remote scan with R-Studio (commercial software from R-Tools, the scan and recovery of files up to 256kb is free) and while at first it looked like there were recoverable files, I used it to download a few jpgs and pngs and none contained valid image data I tried a variety of things to find out what went wrong with the file system but everything came back to a messed up partition table and a failed recovery with testdisk.

So lessons learned: 1) keep a backup of the partition table somewhere safe (aka not on the server) 2) when stuff happens - do backups first 3) have a backup strategy before stuff happens

Yashima
  • 349