1

System:

  • Ryzen 5, no integrated graphics
  • B450 Tomahawk Max motherboard
  • ADATA SX8100 512 GB SSD
  • Nvidia GeForce 1660 main GPU
  • Dual boot Ubuntu 20.04 and Windows 10
  • UEFI firmware
  • No overclocking or other tweaks

I have had occasional problems in the past where the system would enter a kernel panic on boot, complaining first that initramfs decoding failed followed by being unable to mount root. The recovery mode option for the same kernel version would also panic, although with far more messages displayed.

I would usually deal with this by selecting an older kernel, which would boot fine, and then run Boot-Repair. I would then be good for a random number of boots until it all started over again.

I was never able find the cause and just dealt with the occasional inconvenience, however now none of my kernels boot. All I can do is boot from a live USB. I updated the GRUB config from inside a chroot, so now my Windows menu option is also gone.

The recovery mode messages ask me to specify my root partition with the root= boot option, and then says here are the available partitions followed by a kernel panic message. It seems that it is not detecting any partitions at all. This seems confirmed by the message that it can't mount root fs on unknown-block(0,0) indicating it can't identify what block device to use.

I've checked that the root UUID shown in the boot messages matches the UUID of my actual boot partition. I have not made any partition table modifications recently.

I've tried removing and re-seating the SSD.

How do I troubleshoot this? How do I get the kernel to detect my SSD?

Normal boot error messages:

enter image description here

Recovery mode boot messages

enter image description here

Per comments, I found the SMART status of the SSD.

Results of sudo smartctl :

kubuntu@kubuntu:~$ sudo smartctl -a /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-42-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: ADATA SX8100NP Serial Number: 2J4620042048 Firmware Version: VB411D43 PCI Vendor/Subsystem ID: 0x10ec IEEE OUI Identifier: 0x00e04c Controller ID: 1 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Thu Dec 3 00:57:15 2020 UTC Firmware Updates (0x0e): 7 Slots Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat Maximum Data Transfer Size: 64 Pages Warning Comp. Temp. Threshold: 118 Celsius Critical Comp. Temp. Threshold: 150 Celsius

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 8.00W - - 0 0 0 0 0 0 1 + 4.00W - - 1 1 1 1 0 0 2 + 3.00W - - 2 2 2 2 0 0 3 - 0.0128W - - 3 3 3 3 4000 8000 4 - 0.0080W - - 4 4 4 4 8000 30000

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 31 Celsius Available Spare: 100% Available Spare Threshold: 32% Percentage Used: 0% Data Units Read: 11,670,921 [5.97 TB] Data Units Written: 7,734,266 [3.95 TB] Host Read Commands: 0 Host Write Commands: 0 Controller Busy Time: 0 Power Cycles: 451 Power On Hours: 3,897 Unsafe Shutdowns: 319 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 8 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 5490475142593210059 45613 0xa607 0x2024 0x35f8 5071432998301508804 1351944024 0xc5 1 10305710759900180890 9804 0x6c00 0xa1d6 0xcb61 27252774468141376 380130432 0xc3 2 11549487431983370324 16455 0x58c9 0xd23e 0x8147 6957061290267430970 3258320200 0xf5 3 7018321358667646096 37313 0x0e1f 0x8670 0x6242 459368911713436868 1166902044 0x10 4 11390238159922049047 38421 0xd002 0x1890 0x7d29 17438238972143084540 884054193 0x01 5 156936697365045345 26140 0x5041 0xac10 0x4265 11916595043416224210 405107254 0xd4 6 6790662844906997140 16528 0x5fc1 0x2ed1 0x77c 5801270468783952621 39946248 0xb0 7 3460708732253516421 2072 0xa101 0x610c 0xc852 13889879911473169861 2147786536 0x68

Re: motherboard BIOS version. I updated my BIOS both soon after building my PC to version 7C02v36 (dated 04/24/2020) and before asking this question to version 7C02v39 (dated 11/30/2020). It had no effect. The next most recent BIOS listed is dated 12/10/2020, but it is a beta version so I'm uncertain if trying it is a good idea.

FWIW, my Ubuntu boot partition is 30GB, and has 3GB free.

enter image description here

GRUB can see my boot partition, as shown in this screenshot. Immediately after snapping this picture, I typed "normal" to return to the GRUB menu. I booted with debug messages enabled, and it complained that it couldn't mount that exact partition.

enter image description here

karel
  • 114,770
rothloup
  • 213
  • Did you have any problems when using Windows? Have you updated SSD firmware and bios? Boot-repair not for kernel panics, just grub problems. Would run smart data from Ubuntu or Live USB or similar Window tool. – crip659 Dec 02 '20 at 21:25
  • @crip659: Never had any problems booting to windows, although I can't try it anymore because updating grub from chroot caused it to not detect windows. I'm not aware of SSD firmware - how do I update that? The BIOS firmware was updated to the most recent when I built the computer, which was in April 2020. I'll research the smart data tool you mentioned - do you mean smartmontools? to get the SSD's SMART status? – rothloup Dec 03 '20 at 00:42
  • @crip659: I just managed to manually boot to windows, no problems. This appears to be related only to booting my ubuntu OS. – rothloup Dec 03 '20 at 02:18
  • SSD updates get from manufacturer of SSD. I don't know enough, but would imagine kernel panics would be cause by wrong hardware driver or corrupted OS. – crip659 Dec 03 '20 at 12:28
  • Check for more current motherboard firmware -- revisions are frequent for some hardware, and you may be several versions out of date if April was the last time you checked. – ubfan1 Dec 21 '20 at 16:25
  • @ubfan1 I added comments about the BIOS to my post now, thank you for pointing that out. I had already tried updating the BIOS before posting. – rothloup Dec 21 '20 at 16:54
  • https://askubuntu.com/questions/41930/kernel-panic-not-syncing-vfs-unable-to-mount-root-fs-on-unknown-block0-0 – WU-TANG Dec 21 '20 at 18:52
  • @WU-TANG: The question you referenced has multiple answers, and I've done all of those. Everytime this happens, I use a live USB stick to boot, then chroot to my boot partition and run update-initramfs -c -k all followed by 'update-grub`. My boot partition is not full (3GB free). This covers all of the answers, except for this one: https://askubuntu.com/a/1048477/751380. That one just demos that having not detecting the disk at all causes the error message seen - so my question is, why is linux not seeing my disk, but windows can? – rothloup Dec 21 '20 at 19:15
  • i actually think i know what you mean, but just to walk through it.... your boot partition "3GB free"??? I am assuming you really mean your root partition and your boot DIRECTORY is on you root partition, which has 3 GB free??? OR do you actually have a boot partition????? because in that case you do not have 3GB free on that partition..... and there is no way there is 27GB of boot partition data on whichever partition you are referring.? It may be nothing, but you may want to clarify what you have going on there? – WU-TANG Dec 21 '20 at 19:59
  • @WU-TANG: yes, you are correct. I have a root partition, with a boot directory on it, which has 3GB free. I then have a separate "data" partition, from which I mount various directories that tend to need a lot of space, such as /tmp and /home and so on, onto mount points defined on the root partition. – rothloup Dec 21 '20 at 20:15
  • Should I be using something other than GRUB? – rothloup Dec 21 '20 at 20:18
  • i would try to find more information on the reason/effects of those "Unsafe Shutdowns: 319" in the smartcl reports – WU-TANG Dec 21 '20 at 20:39
  • @WU-TANG: I just did a bunch of experiments and found that after numerous reboots between windows, standard linux, and live-USB linux, the ONLY condition that incremented that counter was when I got a kernel panic as described in my question and had to reboot. So the cause is almost entirely being forced to hard-boot after a kernel panic. Maybe a couple of power outages too, but those certainly weren't the first cause. – rothloup Dec 21 '20 at 22:17
  • are you able to mount root partition via live usb? – RedEyed Dec 24 '20 at 10:34
  • @RedEyed yes, i can. That's how i apply my temporary fix by chroot'ing to the root partition, rebuilding initrd, and update grub. It fixes things for a little while. – rothloup Dec 24 '20 at 16:29
  • so, this is a software problem. I wonder about It fixes things for a little while. How long does it work? Is it possible to narrow down the problem by the logs, i.e, what causes the next break ? – RedEyed Dec 24 '20 at 18:03
  • @RedEyed: I certainly hope and suspect that it's a software problem. However, troubleshooting software at boot is very tough. The best that I can say is that this tends to happen after I have a gaming session in windows. The best logs I can get are the screenshots of the boot messages - after a kernel panic, what else can I get? If you can suggest, I'll look. – rothloup Dec 24 '20 at 18:14
  • @RedEyed: I also don't know if there is a method to see boot messages that have scrolled off the screen after a kernel panic. Would love to see the entire boot log from the beginining. – rothloup Dec 24 '20 at 18:26
  • journalctl -b0 -p4 > logs.txt: -b0 means logs of the current boot, -b1 means the boot before the current and so on. -p4 means: show only warnings and errors. So, by changing the number after -b you can view logs of previous boots. – RedEyed Dec 25 '20 at 09:18
  • How is that drive paritioned, MBR or GPT? and depending on that answer, what did you use to create a 5th partition, Windows or Ubuntu? – WU-TANG Dec 25 '20 at 15:46
  • @RedEyed: journalctl doesn't seem to be able to capture kernel panics before my root partition is mounted, which makes sense. Where would it write to? for example, journalctl --list-boots doesn't show the many reboots I performed on 12-21 in this comment https://askubuntu.com/questions/1297016/boot-cant-detect-root-partition?noredirect=1#comment2213240_1297016 – rothloup Dec 25 '20 at 15:56
  • @WU-TANG: it's GPT. gdisk says "Found valid GPT with protective MBR; using GPT." – rothloup Dec 25 '20 at 15:59
  • (ok disregard, if it was MBR, there was a windows partitioning tool I read about that was said to cause problems booting other OSes)... Anyway, I'm very reluctant to suggest this because it's not a spinning drive... But if you could create another partition and dd a copy of that ubuntu partition onto it, I'd be interested to see if the same thing occurs. Of course youd have to create a new UUID for that new partition, edit its /etc/fstab, and then update-grub... Along those lines, I'd also check the power supply... but again those are hardware/corruption type troubleshooting – WU-TANG Dec 25 '20 at 16:20
  • @rothloup I see that you are running UEFI. Have you tried switching to legacy mode? BTW The tool on win is DISKPART and if there was trouble it was likely op error. Partition editors are powerful, and so dangerous, tools. – Nate T Dec 28 '20 at 04:04

1 Answers1

1

Boot from a USB stick with a live system on it; it doesn't matter much what Ubuntu version or flavour it is. Then try investigating and maybe even mounting that filesystem manually from a shell in that live system. The internal disk might be /dev/sdc now.

You can investigate partitions with any of

sudo parted --list

sudo fdisk -l

sudo blkid

Once you identified which partition your root filesystem is, you can try to run fsck -f on it for a filesystem check.

Try to mount it; I'd start with a read-only mount:

sudo mount -r /dev/sdc42 /mnt

(/dev/sdc42 being the device you just identified as your root filesystem)

then check /mnt/boot for available kernels and if there is a matching initrd* (the initial RAM disk containing kernel modules).


After reading some more comments above, it appears to me that the protective MBR might be a problem. Basically, it attempts to mirror the GPT partition table to make older tools believe they are seeing an old-style PC ("MS-DOS") partition table. That's alright as long as those older tools never attempt to modify any partitions; if they do, however, the protective MBR (which is what they will change) and the GPT (which contains the true information) may start to mismatch.

The result can be that some OS (not Linux, I am pretty sure) writes to disk blocks outside the current partitions and filesystems. If you experience problems after Windows gaming that would be a hint into that general direction.

HuHa
  • 3,385
  • HuHa: Can you suggest a way for me to confirm your suspicion? – rothloup Dec 28 '20 at 05:37
  • Compare the output of sudo fdisk -l on Linux with the output of fdisk on the C: drive on Windows: Partition sizes, start addresses (in blocks, if available). Do they match? – HuHa Dec 28 '20 at 09:50
  • HuHa: Here is the output of fdisk for all of the partitions I have: https://paste.ubuntu.com/p/tP8z3yFbgP/ I'm not sure if I'm interpreting it correctly, but I did note some nonsense sizes in the tables - i.e. line 27 shows a size of 811.6G, but my disk is only 512 GB. My linux root partition is /dev/nvme0n1p5, which is the one referenced in the OP. This command was run on my native linux installation, not a live USB stick - lmk if I need to do it differently. Were you implying that I should run "fdisk" on windows? or did you mean diskpart? – rothloup Dec 28 '20 at 15:28
  • I'm not sure what, precisely, to compare, but the sector size shown in the disk GPT on lines 12-17 seem to match the sector sizes shown in the first line for each individual partiition. I'm not sure which of these represent the protective MBR. – rothloup Dec 28 '20 at 15:29
  • 1
    That command that you issued tries to interpret partition devices as disks, so a lot of nonsense is to be expected. ;-) – HuHa Dec 28 '20 at 17:42
  • You should have called it like this: sudo fdisk -l /dev/nvme0n1 because that's the disk. If you add p1, p2, ..., that means you are telling fdisk to look for a partition table inside the partition. Fortunately, the first output block looks alright (even though strictly speaking that's just a happy coincidence). Now please compare those numbers with those reported by the Windows fdisk; I hope it also shows block (sector) numbers. The starting block addresses should match between Linux and Windows. – HuHa Dec 28 '20 at 17:47
  • When I compare your pastebin'ed fdisk -l output with your Windows screenshot, the partition sizes look very consistent between your Linux and your Windows; but please still check if the actual block (sector) addresses match. – HuHa Dec 28 '20 at 17:58
  • I did call it as you suggest - my list include a "space" so that the first call is for the disk, then the rest are for the partitions. So it's there, in the very first section. Not a coincidence. :) But I'm confused what you mean when you say "Windows fdisk". Are you saying I should boot into windows and run "diskpart"? Or run fdisk on the windows partition (which is /dev/nvme0n1p2 and p3)? – rothloup Dec 28 '20 at 19:15
  • There used to be an fdisk command, but I am not sure if that still applies to Windows 10 (my most recent one is Windows 7). – HuHa Dec 28 '20 at 21:37
  • There is no fdisk command in my windows 10. diskpart seems to be the equivalent tool. I'm not too familiar with it, so I'm not sure if this is the right information. I don't see a direct way to read the protective MBR, and my searched for such a command didn't turn up anything useful. https://pastebin.ubuntu.com/p/DhKgw36GmM/ – rothloup Dec 28 '20 at 22:17
  • It doesn't want to tell us detailed partition addresses, but the sizes look consistent (nevermind the sort order; that might be a quirk of the tool). So we might be up a dead end here; your problem might be in a completely different area. – HuHa Dec 29 '20 at 13:06
  • One shot in the dark might be to try to reinstall Grub2: https://howtoubuntu.org/how-to-repair-restore-reinstall-grub-2-with-a-ubuntu-live-cd . If you can still manage to boot your Ubuntu, you can skip the USB stick / chroot etc. part and proceed right with the grub-install part. – HuHa Dec 29 '20 at 13:11