2

I've been experiencing consistent issues where my system gets into a read-only state. I can do the fixes described in other questions (like Read-only filesystem error) and this fixes it for ~a day. But each day at startup the issue comes back.

My question here isn't so much how to fix it, but is there any way for me to diagnose what particular piece of software may be causing this?

EDITS:

Adding syslog output below, from searching a bit on these error messages, my understanding is that nothing in particular is doing this; it's just repeatedly failing to find good blocks (i.e. dying drive)??

Jun 26 10:03:53 mal NetworkManager[1298]: <info>  [1593180233.9376] manager: NetworkManager state is now CONNECTED_GLOBAL
Jun 26 10:04:03 mal systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Jun 26 10:04:24 mal anacron[1290]: Job `cron.daily' terminated
Jun 26 10:04:24 mal anacron[1290]: Normal exit (1 job run)
Jun 26 10:04:24 mal systemd[1]: anacron.service: Succeeded.
Jun 26 10:05:01 mal CRON[7965]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 26 10:07:00 mal systemd-resolved[1282]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jun 26 10:07:07 mal kernel: [  529.845356] ata1.00: READ LOG DMA EXT failed, trying PIO
Jun 26 10:07:07 mal kernel: [  529.846653] ata1.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x0
Jun 26 10:07:07 mal kernel: [  529.846656] ata1.00: irq_stat 0x40000008
Jun 26 10:07:07 mal kernel: [  529.846659] ata1.00: failed command: READ FPDMA QUEUED
Jun 26 10:07:07 mal kernel: [  529.846663] ata1.00: cmd 60/08:c0:e8:47:91/00:00:14:00:00/40 tag 24 ncq dma 4096 in
Jun 26 10:07:07 mal kernel: [  529.846663]          res 41/40:00:ec:47:91/00:00:14:00:00/00 Emask 0x409 (media error) <F>
Jun 26 10:07:07 mal kernel: [  529.846665] ata1.00: status: { DRDY ERR }
Jun 26 10:07:07 mal kernel: [  529.846666] ata1.00: error: { UNC }
Jun 26 10:07:07 mal kernel: [  529.852714] ata1.00: configured for UDMA/133
Jun 26 10:07:07 mal kernel: [  529.852746] sd 0:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 26 10:07:07 mal kernel: [  529.852749] sd 0:0:0:0: [sda] tag#24 Sense Key : Medium Error [current] 
Jun 26 10:07:07 mal kernel: [  529.852752] sd 0:0:0:0: [sda] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 26 10:07:07 mal kernel: [  529.852755] sd 0:0:0:0: [sda] tag#24 CDB: Read(10) 28 00 14 91 47 e8 00 00 08 00
Jun 26 10:07:07 mal kernel: [  529.852758] blk_update_request: I/O error, dev sda, sector 345065452 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Jun 26 10:07:07 mal kernel: [  529.852779] ata1: EH complete
Jun 26 10:07:07 mal kernel: [  529.852805] EXT4-fs warning (device sda2): htree_dirblock_to_tree:997: inode #10763504: lblock 0: comm duplicity: error -5 reading directory block
Jun 26 10:07:07 mal kernel: [  529.950532] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
Jun 26 10:07:07 mal kernel: [  529.950534] ata1.00: irq_stat 0x40000008
Jun 26 10:07:07 mal kernel: [  529.950536] ata1.00: failed command: READ FPDMA QUEUED
Jun 26 10:07:07 mal kernel: [  529.950538] ata1.00: cmd 60/08:00:e8:47:91/00:00:14:00:00/40 tag 0 ncq dma 4096 in
Jun 26 10:07:07 mal kernel: [  529.950538]          res 41/40:00:ec:47:91/00:00:14:00:00/00 Emask 0x409 (media error) <F>
Jun 26 10:07:07 mal kernel: [  529.950539] ata1.00: status: { DRDY ERR }
Jun 26 10:07:07 mal kernel: [  529.950540] ata1.00: error: { UNC }
Jun 26 10:07:07 mal kernel: [  529.956788] ata1.00: configured for UDMA/133
Jun 26 10:07:07 mal kernel: [  529.956808] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 26 10:07:07 mal kernel: [  529.956810] sd 0:0:0:0: [sda] tag#0 Sense Key : Medium Error [current] 
Jun 26 10:07:07 mal kernel: [  529.956811] sd 0:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
Jun 26 10:07:07 mal kernel: [  529.956813] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 14 91 47 e8 00 00 08 00
Jun 26 10:07:07 mal kernel: [  529.956814] blk_update_request: I/O error, dev sda, sector 345065452 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Jun 26 10:07:07 mal kernel: [  529.956838] ata1: EH complete
Jun 26 10:07:07 mal kernel: [  529.956840] EXT4-fs error (device sda2): __ext4_find_entry:1531: inode #10763504: comm deja-dup: reading directory lblock 0

Update 29June

Side note; this error has NOT happened for 3+ days, ever since I unplugged my kindle from the desktop. I have no idea if this has anything to do with it; but just mentioning it here for those more in the know.

Booting from live CD: output of checking the drive

enter image description here

Additionally, smartctl outputs. I don't seem to be able to get a -t long test to actually finish, but short ones seem fine.

ubuntu@ubuntu:~$ sudo smartctl -a /dev/sda2
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-26-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Family: Marvell based SanDisk SSDs Device Model: SanDisk SSD PLUS 1000GB Serial Number: 190532802370 LU WWN Device Id: 5 001b44 8b92df610 Firmware Version: UH5100RL User Capacity: 1,000,207,286,272 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 T13/2015-D revision 3 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Jun 29 23:21:35 2020 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled

=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x15) SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 182) minutes.

SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3921 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 353 165 Total_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 1045 166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 4 167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 0 168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 14 169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 1688 170 Unknown_Attribute 0x0032 100 100 --- Old_age Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Avg_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 4 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 17 184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 34 188 Command_Timeout 0x0032 100 100 --- Old_age Always - 0 194 Temperature_Celsius 0x0022 065 055 000 Old_age Always - 35 (Min/Max 17/55) 199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0 230 Perc_Write/Erase_Count 0x0032 100 100 000 Old_age Always - 610 80 610 232 Perc_Avail_Resrvd_Space 0x0033 100 100 005 Pre-fail Always - 100 233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 4201 234 Perc_Write/Erase_Ct_BC 0x0032 100 100 000 Old_age Always - 16347 241 Total_Writes_GiB 0x0030 100 100 000 Old_age Offline - 6893 242 Total_Reads_GiB 0x0030 100 100 000 Old_age Offline - 3558 244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0

SMART Error Log Version: 1 No Errors Logged

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 3921 -

2 Extended offline Fatal or unknown error 90% 3920 0

3 Short offline Completed without error 00% 3919 -

4 Extended offline Fatal or unknown error 90% 3889 0

5 Extended offline Self-test routine in progress 90% 3888 -

6 Extended offline Self-test routine in progress 90% 3888 -

7 Extended offline Self-test routine in progress 90% 3888 -

8 Extended offline Self-test routine in progress 90% 3888 -

9 Extended offline Self-test routine in progress 90% 3888 -

#10 Extended offline Self-test routine in progress 90% 3888 - #11 Extended offline Self-test routine in progress 90% 3888 - #12 Extended offline Self-test routine in progress 90% 3888 - #13 Extended offline Self-test routine in progress 90% 3888 - #14 Extended offline Self-test routine in progress 90% 3888 - #15 Extended offline Self-test routine in progress 90% 3888 - #16 Extended offline Self-test routine in progress 90% 3888 - #17 Extended offline Self-test routine in progress 90% 3888 - #18 Extended offline Self-test routine in progress 90% 3888 - #19 Extended offline Self-test routine in progress 90% 3888 - #20 Extended offline Self-test routine in progress 90% 3888 - #21 Extended offline Self-test routine in progress 90% 3888 -

Selective Self-tests/Logging not supported

justincely
  • 123
  • 1
  • 6
  • Note the exact time that it occurs. Look in /var/log/syslog for error messages in the seconds leading up to that time. Edit your question to show us that part of your log (copy/paste into your Question). – user535733 Jun 25 '20 at 17:46
  • Edit your question, and add screenshot(s) of the Disks app, SMART Data & Tests, SMART Data (scrollable) window. Have you run fsck? Start comments to me with @heynnema or I'll miss them. – heynnema Jun 25 '20 at 17:49
  • If the system keeps reverting to read-only, it's a failsafe when the hard drive is dying. I'd make backups more often than normal, and prepare for the possibility of buying a new hard drive. – Nmath Jun 25 '20 at 18:16
  • thanks @user535733; syslog output edited in there. @Nmath; yes, backups going multiple times each day incase it just goes at any second. – justincely Jun 26 '20 at 14:29
  • Time to follow @heynnema's advice: Look up how to run a SMART test on your hard drive to confirm faulty/dying storage hardware. Start price-shopping for a replacement disk. – user535733 Jun 26 '20 at 14:34

1 Answers1

3

fsck

  • boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode
  • open a terminal window by pressing Ctrl+Alt+T
  • type sudo fdisk -l
  • identify the /dev/sdXX device name for your "Linux Filesystem"
  • type sudo fsck -f /dev/sda2, replacing sdXX with the number you found earlier
  • repeat the fsck command if there were errors
  • type reboot

NCQ

You have NCQ errors. And possible errors on sector 345065452.

Jun 26 10:07:07 mal kernel: [  529.950536] ata1.00: failed command: READ FPDMA QUEUED
Jun 26 10:07:07 mal kernel: [  529.950538] ata1.00: cmd 60/08:00:e8:47:91/00:00:14:00:00/40 tag 0 ncq dma 4096 in
Jun 26 10:07:07 mal kernel: [  529.950538]          res 41/40:00:ec:47:91/00:00:14:00:00/00 Emask 0x409 (media error) <F>
Jun 26 10:07:07 mal kernel: [  529.950539] ata1.00: status: { DRDY ERR }
Jun 26 10:07:07 mal kernel: [  529.950540] ata1.00: error: { UNC }

Jun 26 10:07:07 mal kernel: [  529.852758] blk_update_request: I/O error, dev sda, sector 345065452 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0

Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.

Do this to fix the NCQ errors...

Edit sudo -H gedit /etc/default/grub and change the following line to include this extra parameter. Then do sudo update-grub to write the changes to disk. Reboot. Monitor hangs, and watch /var/log/syslog or dmesg for continued error messages.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"

Bad blocking

If errors continue after the NCQ patch, then...

Note: do NOT abort a bad block scan!

Note: do NOT bad block a SSD

Note: backup your important files FIRST!

Note: this will take many hours

Note: you may have a pending HDD failure

Boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode.

In terminal...

sudo fdisk -l # identify all "Linux Filesystem" partitions

sudo e2fsck -fcky /dev/sdXX # read-only test

or

sudo e2fsck -fccky /dev/sda2 # non-destructive read/write test (recommended)

The -k is important, because it saves the previous bad block table, and adds any new bad blocks to that table. Without -k, you loose all of the prior bad block information.

The -fccky parameter...

   -f    Force checking even if the file system seems clean.

-c This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in order to find any bad blocks. If any bad blocks are found, they are added to the bad block inode to prevent them from being allocated to a file or direc‐ tory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

-k When combined with the -c option, any existing bad blocks in the bad blocks list are preserved, and any new bad blocks found by running badblocks(8) will be added to the existing bad blocks list.

-y Assume an answer of `yes' to all questions; allows e2fsck to be used non-interactively. This option may not be specified at the same time as the -n or -p options.

heynnema
  • 70,711
  • thanks @heynnema. I added your modifications to grub and that seems to have cleared up those bits from syslog. I've provided the outputs of smartctl, and the Disks utility; the main issue being that I can't seem to run a "long" test at all - it just never completes even after 24+ hours.

    sorry for taking too long to try out your suggestions too; had a busy weekend but i'll pay closer attention this week

    – justincely Jun 29 '20 at 23:34
  • @justincely Sorry, I just got your message. Did you ever get the long test to run? – heynnema Aug 13 '20 at 20:29
  • fraid not. interestingly, this issue has intermittently popped up and dropped away seemingly without much consistency (went away for more than a week after our last session, then popped back in randomly). The only strong constant is that it only happens in the morning when I boot up. Never does the disk get corrupted during daily use, even when i've got it on for 24+ hours. So i'm slowly disabling startup programs and errors I see in syslog, hoping I can narrow it down. @heynnema – justincely Aug 19 '20 at 15:30
  • @justincely Since you appear to have a bevy of external USB drives attached... are they connected via a powered USB hub? Are the SSD firmware(s) up to date? Do you have Windows installed? – heynnema Aug 19 '20 at 15:45
  • @justincely Are you running on the 5.4.0-26-generic kernel? Why so old? – heynnema Aug 19 '20 at 15:53
  • I think that's just the default kernel that comes with ubuntu 20.04. – justincely Aug 19 '20 at 16:28
  • @justincely 5.4.0-42-generic is current for 20.04.1... so maybe you're behind in your Software Updates? What about my powered USB hub question? Firmware? Windows? – heynnema Aug 19 '20 at 18:03
  • ahh, you're right. I did an update to 5.4.0-42-generic late last week. No changes to frequency of the failure - still intermittent, but fixable.

    No USB buses connected; just a USB3.0 backup drive, keyboard, webcam, all directly connected to the motherboard. Honestly, i'm now having an issue where something is preventing the system from shutting down; so I've got to fix that before I circle back to this.

    – justincely Aug 24 '20 at 14:58
  • @justincely Did you do the NCQ patch? – heynnema Aug 24 '20 at 15:01
  • i did do the NCQ patch, and for whatever reason I haven't seen this in weeks now. I've been keeping more up-to-date with kernel updates, and cleaning up everything I can - something in there seems to have worked...thanks fo rall the help. – justincely Sep 21 '20 at 14:22
  • @justincely Thanks for the update. Glad it's all working for you now. – heynnema Sep 21 '20 at 14:43