1

my Ubuntu server recently crashed and since then I am struggling to get it back.

The server became unresponsive, ping returned sporadic returns and none of the services (SSH or Webmin) would connect. Shutdown wasn't possible either so I eventually had to switch it off.

The hard reset seems to have destroyed the root file system as the boot folder and many others were empty which meant I ended up in grub rescue mode after the reboot.

Well, decided to reinstall the OS which is where my journey begins.

First, what's working:

  • New installation works without a problem
  • All drives are found, including the raid
  • when opening a shell in USB drive rescue mode I can mount all drives without a problem, (raid and backup drive)

Setup is

  • SSD for the OS, home and swap (3 separate partitions)
  • 3 4TB drives for a software raid 10 (one spare)
  • a separate 2 TB swappable drive for offline backups

And here's where I am stuck:

  • The server boots, shows the grub window and loads the kernel (lots of the usual status messages...)

  • The last successful messages seem to be

    Begin: Loading essential drivers ... done

    Begin: Running scripts/init-premount ... done

    [19.000] random: fast init done

    Begin waiting for root file system

From there on there are lots of the below

Begin: Running scripts/local-block ... mdadm: no devices listed in config file were found
done

Until it finaly gives up with

Gave up waiting for root device. Common problems...
...
ALERT! UUID=.... does not exist. Dropping to shell 

After which the system freezes.

The UUID listed is correct and represents the boot partition of my SSD.

Thís somehow looks like none of the drives are accesible all of a sudden, neither the boot drive (UUID error) nor the raid array (mdadm error message)

I tried grup-updates and reinstalls which all give me strange errors. But whenever I a boot from my USB stick, select the rescue option and open a shell with the ssd-boot partition I can happily see and mount all partitions.

Some of the grup messages I am getting:

grub-update

Found linux image....

Found initrd image....

WARNING: Failed to connect lvmdat. Falling back to device scanning
grup-probe: error: cannot find a GRUB drive for dev/sdb1 check your device map

I checked /etc/fstab, and all entries look good to me. UUIDs macth what I would expect, / SWAP and are available

Anyone's got an idea of where to look next? My next step would be to completely repartition the SSD which I would like to avoid...

Thanks Thomas

TZ04
  • 31
  • This sounds like a disk failure to me. I'd try checking the SMART status on all your disks. – Rod Smith May 25 '17 at 13:34
  • Hm, just a bad boot sector on the SSD? I ran a file system check on all SSD partions and it didn't show any errors. Would also not explain why seemingly all 3 (physical) drives produce problems. In a Windows world I'd say I need a driver for the HDD controller, but here? – TZ04 May 25 '17 at 14:34
  • Your question doesn't specify booting in BIOS mode, and the problem might not be in the boot sector even if you are booting in that way. A filesystem check might or might not reveal a problem. As I said, I recommend you check the SMART data. – Rod Smith May 25 '17 at 17:28
  • Thanks Rod, I did check the SMART data on all 5 drives on the server and they all show "Health". – TZ04 May 26 '17 at 13:07
  • Went ahead and started to delete the partitions on the ssd. First one to go was the root partition. Did a fresh server install and hey, back to normal!? The whole boot sequence looks different (normal) now, whereas before it came up with a much larger font and different messages. No idea what has been lurking on this partition but glad that I got my server back. Thanks a lot for helping me along! – TZ04 May 26 '17 at 13:14

0 Answers0