Repeating problem with root filesystem on SSD becoming read-only randomly

Question

I replaced system SSD drive 1TB with a larger one 2TB and cloned the content using CloneZilla utility. Then in the OS the drive still appeared 1TB but I was able to extend it to 2TB. All the data seemed fine.

After some time the filesystem became read-only. Reboot and fsck did help, but only for few days. It keeps happening since then. Could the new SSD drive have been faulty? I tried updating Ubuntu from 18.04 to 20.04 but to no avail. Filesystem is ext4.

EDIT: Smartctl report:

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 2TB
Serial Number:                      S6P1NS0T501522T
Firmware Version:                   4B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2 000 398 934 016 [2,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2 000 398 934 016 [2,00 TB]
Namespace 1 Utilization:            726 404 530 176 [726 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5521405351
Local Time is:                      Fri Sep  2 14:42:51 2022 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.59W       -        -    0  0  0  0        0       0
 1 +     7.59W       -        -    1  1  1  1        0     200
 2 +     7.59W       -        -    2  2  2  2        0    1000
 3 -   0.0500W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500
Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1 046 751 [535 GB]
Data Units Written:                 11 045 053 [5,65 TB]
Host Read Commands:                 21 511 754
Host Write Commands:                122 266 698
Controller Busy Time:               632
Power Cycles:                       20
Power On Hours:                     258
Unsafe Shutdowns:                   14
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               46 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

syslog tail:

Sep  1 07:35:45 containerd[1489]: time="2022-09-01T07:35:45.938574835+02:00" level=info msg="cleaning up dead shim"
Sep  1 07:35:45 dockerd[1609]: time="2022-09-01T07:35:45.938532925+02:00" level=info msg="ignoring event" container=c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep  1 07:35:45 containerd[1489]: time="2022-09-01T07:35:45.954480844+02:00" level=warning msg="cleanup warnings time=\"2022-09-01T07:35:45+02:00\" level=info msg=\"starting signal loop\" namespace=moby pid=3411558 runtime=io.containerd.runc.v2\n"
Sep  1 07:35:45 kernel: [598279.313677] veth0e65189: renamed from eth0
Sep  1 07:35:46 kernel: [598279.339095] br-9972a812410e: port 5(veth77e9014) entered disabled state
Sep  1 07:35:46 systemd-udevd[3408622]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep  1 07:35:46 NetworkManager[1276]: <info>  [1662010546.0671] manager: (veth0e65189): new Veth device (/org/freedesktop/NetworkManager/Devices/82537)
Sep  1 07:35:46 avahi-daemon[517479]: Interface veth77e9014.IPv6 no longer relevant for mDNS.
Sep  1 07:35:46 avahi-daemon[517479]: Leaving mDNS multicast group on interface veth77e9014.IPv6 with address fe80::b82c:fff:fe77:d9b4.
Sep  1 07:35:46 kernel: [598279.397005] br-9972a812410e: port 5(veth77e9014) entered disabled state
Sep  1 07:35:46 kernel: [598279.400491] device veth77e9014 left promiscuous mode
Sep  1 07:35:46 kernel: [598279.400494] br-9972a812410e: port 5(veth77e9014) entered disabled state
Sep  1 07:35:46 avahi-daemon[517479]: Withdrawing address record for fe80::b82c:fff:fe77:d9b4 on veth77e9014.
Sep  1 07:35:46 systemd-udevd[3408622]: veth0e65189: Failed to get link config: No such device
Sep  1 07:35:46 gnome-shell[1796]: Removing a network device that was not added
Sep  1 07:35:46 NetworkManager[1276]: <info>  [1662010546.1106] device (veth77e9014): released from master device br-9972a812410e
Sep  1 07:35:46 gnome-shell[1796]: Removing a network device that was not added
Sep  1 07:35:46 systemd[67738]: run-docker-netns-9bfc9b4bb9d2.mount: Succeeded.
Sep  1 07:35:46 systemd[67738]: var-lib-docker-containers-c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0-mounts-shm.mount: Succeeded.
Sep  1 07:35:46 systemd[67738]: var-lib-docker-overlay2-a8794c75b463c71c93b29c9643accfb4e12fffe422f7060279f3d976db072b25-merged.mount: Succeeded.
Sep  1 07:35:46 systemd[361680]: run-docker-netns-9bfc9b4bb9d2.mount: Succeeded.
Sep  1 07:35:46 systemd[361680]: var-lib-docker-containers-c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0-mounts-shm.mount: Succeeded.
Sep  1 07:35:46 systemd[361680]: var-lib-docker-overlay2-a8794c75b463c71c93b29c9643accfb4e12fffe422f7060279f3d976db072b25-merged.mount: Succeeded.
Sep  1 07:35:46 systemd[1712]: run-docker-netns-9bfc9b4bb9d2.mount: Succeeded.
Sep  1 07:35:46 systemd[1712]: var-lib-docker-containers-c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0-mounts-shm.mount: Succeeded.
Sep  1 07:35:46 systemd[1712]: var-lib-docker-overlay2-a8794c75b463c71c93b29c9643accfb4e12fffe422f7060279f3d976db072b25-merged.mount: Succeeded.
Sep  1 07:35:46 systemd[960393]: run-docker-netns-9bfc9b4bb9d2.mount: Succeeded.
Sep  1 07:35:46 systemd[1]: run-docker-netns-9bfc9b4bb9d2.mount: Succeeded.
Sep  1 07:35:46 systemd[960393]: var-lib-docker-containers-c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0-mounts-shm.mount: Succeeded.
Sep  1 07:35:46 systemd[960393]: var-lib-docker-overlay2-a8794c75b463c71c93b29c9643accfb4e12fffe422f7060279f3d976db072b25-merged.mount: Succeeded.
Sep  1 07:35:46 systemd[1]: var-lib-docker-containers-c61acc2424d8b29cde658e65b0a12b21b7dd87a9c406532ee2fc75a68d565ab0-mounts-shm.mount: Succeeded.
Sep  1 07:35:46 systemd[1]: var-lib-docker-overlay2-a8794c75b463c71c93b29c9643accfb4e12fffe422f7060279f3d976db072b25-merged.mount: Succeeded.
Sep  1 07:35:46 avahi-daemon[517479]: Joining mDNS multicast group on interface veth1b1eea5.IPv6 with address fe80::d05a:41ff:fe71:6d0f.
Sep  1 07:35:46 avahi-daemon[517479]: New relevant interface veth1b1eea5.IPv6 for mDNS.
Sep  1 07:35:46 avahi-daemon[517479]: Registering new address record for fe80::d05a:41ff:fe71:6d0f on veth1b1eea5.*.
Sep  1 07:35:46 kernel: [598279.963079] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598279.963138] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598279.974695] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598280.063961] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598280.114831] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598280.182623] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
Sep  1 07:35:46 kernel: [598280.241481] EXT4-fs error (device nvme0n1p2): __ext4_find_entry:1551: inode #40395175: comm updatedb.mlocat: checksumming directory block 0

EXT4-fs error (device nvme0n1p2): __ext4_find_entry:1551: inode #40395175: comm updatedb.mlocat: checksumming directory block 0 seems notable

lsblk -f output:

nvme0n1
├─nvme0n1p1 vfat                     F062-0FA7 505,8M     1% /boot/efi
└─nvme0n1p2 ext4                     83f2e983-979f-4303-a7f9-837b7a8d65f0    1,1T    35% /
nvme1n1     ext4     filesystem_home 7af8bdbe-5605-4957-af95-69a790a8f67a 1009,1G    40% /home

I fear that too but haven't really experienced this kind of behavior with nvme SSD. — preator, Sep 02 '22 at 10:20
Use smartctl (or the nvme equivalent) to get the drive health statistics and see if it has a pending failure. Also check /var/log/syslog for relevant errors although they might not get recorded if the drive is going read only, in which case use dmesg as soon as possible after it goes read only. Check that the drive data interface is well seated. Post errors you find in your question so we can answer effectively. — user10489, Sep 02 '22 at 11:10
Added the logs dmesg is already full of error so only next time, it's hard to catch though. Seems the data are corrupted but I don't understand why it keeps reappearing. There are 2 SSD disks in that pc and the other is working well (wasn't migrated with clonezilla) — preator, Sep 02 '22 at 12:56
Do you have both plugged in with same UUIDs? That is not allowed. Post this: lsblkt -f — oldfred, Sep 02 '22 at 15:03
That dmesg is a bunch of noise. MIght help to grep for the specific filesystem that is going bad, we need the messages around where it logs that the fs has gone read only explaining why it did... The last entry (__ext4_find_entry) is suspicous. — user10489, Sep 02 '22 at 22:26
https://askubuntu.com/questions/1100838/ext4-fs-error-device-sda2-ext4-find-entry1436-ubuntu-18-04 — user10489, Sep 02 '22 at 22:28
I don't think it is related to the posted issue because the NVMe drives are both same andthe other one does not have this issue. Also before 2TB there was 1TB NVMe Samsung 970 Evo (without Plus) — preator, Sep 05 '22 at 13:30

preator · Accepted Answer · 2023-01-13T09:22:31.513

If someone stumbles upon similar issuse, here is what I learned. After having problems with the drive, to facilitate drive replacement I moved docker volumes for one high traffic application (sentry) to the other drive. The apllication itself (docker-compose) was on the other - working - drive.

No problems since. I suspect that this is not a coincidence but indeed application and docker volume resting on different physical drives created environment for a problem. Also no other steps were taken (no updates etc.) because it was just a prep for replacing the drive completely on the next failure.

EDIT: It happened again. Also there were two physical 2TB SSD drives. One as system and one as /home. First it happened for the system one, I replaced it with similar specs SSD drive from WD. Then these locks started occuring on the second Samsung drive mounted to /home. So I replaced both. Firmware was the newest version, everything updated. Seems like either bad batch or some common firmware Ubuntu issue.

If the drive is passing tests, then possibly your configuration and data access patterns are triggering a bug in the NVME firmware. Options are to either do what you did, check if the vendor has a firmware update, or replace it with NVME from a different manufacturer. — user10489, Oct 06 '22 at 23:48

Repeating problem with root filesystem on SSD becoming read-only randomly

1 Answers1