1

One of our Samsung 2TB NVME SSDs recently failed, so we swapped it with a new stick and have started to pay careful attention to the SMART tests.

Here is the output from a drive that was installed less than two weeks ago:

root@~ $ smartctl -a /dev/nvme0n1p1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-53-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: Samsung SSD 970 EVO Plus 2TB Serial Number: S59CNZFNA02015F Firmware Version: 2B2QEXM7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 2,000,398,934,016 [2.00 TB] Unallocated NVM Capacity: 0 Controller ID: 4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB] Namespace 1 Utilization: 129,469,706,240 [129 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 5a019ed120 Local Time is: Sun Nov 22 22:11:40 2020 EST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 85 Celsius Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.50W - - 0 0 0 0 0 0 1 + 5.90W - - 1 1 1 1 0 0 2 + 3.60W - - 2 2 2 2 0 0 3 - 0.0700W - - 3 3 3 3 210 1200 4 - 0.0050W - - 4 4 4 4 2000 8000

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 42 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 14,723 [7.53 GB] Data Units Written: 4,508,008 [2.30 TB] Host Read Commands: 243,468 Host Write Commands: 176,596,876 Controller Busy Time: 1,060 Power Cycles: 4 Power On Hours: 205 Unsafe Shutdowns: 3 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 42 Celsius Temperature Sensor 2: 46 Celsius

Error Information (NVMe Log 0x01, max 64 entries) No Errors Logged

The part that has us concerned is this:

Data Units Written:                 4,508,008 [2.30 TB]

The lifetime is 250TB, so 2TB being used is insane, and it doesn't make any sense.

How do we go about trying to figure out why this number is so high?

Thanks!

========================

@heynnema thanks for following up! Here is the response to yours comments (fyi, I killed Ubuntu swap after installing the new SSD)

root@~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:          251Gi        42Gi       153Gi       3.0Mi        56Gi       208Gi
Swap:            0B          0B          0B

root@~ $ sysctl vm.swappiness vm.swappiness = 60

root@~ $ grep -i swap /etc/fstab #/swap.img none swap sw 0 0

======================== additional info:

I ran iotop as follows:

iotop -ao

and have this after running for a while:

Total DISK READ :       0.00 B/s | Total DISK WRITE :     147.34 K/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     357.38 K/s
  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
29546 be/4 999           0.00 B    212.62 M  0.00 %  0.01 % mongod --auth --bind_ip_all [WTCheck.tThread]
  855 be/3 root          0.00 B    101.82 M  0.00 %  1.65 % [jbd2/nvme1n1p1-]
 1841 be/4 root          0.00 B     33.69 M  0.00 %  0.00 % python /opt/conda/bin/supervisord -c /etc/supervisor/supervisord.conf

It looks like the culprit is mongo and jbd2. How do I figure out what jbd2 is doing? thanks everyone for your help!

  • Are you using noatime in your mounts in fstab? cat /etc/fstab – oldfred Nov 23 '20 at 03:56
  • As mentioned below, iotop might help. However, edit your question and show me free -h and sysctl vm.swappiness and grep -i swap /etc/fstab. Start comments to me with @heynnema or I'll miss them. – heynnema Nov 23 '20 at 15:22

2 Answers2

1

You can check it with iotop. However, this will not show you the total writes to a drive, but it will allow you to see if apps are writing to the drive(s) a lot.

sudo apt install iotop

Then run it with elevated permissions:

sudo iotop

You should see something like the following:

Total DISK READ:         0.00 B/s | Total DISK WRITE:       248.20 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND        
1780425 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.37 % [kworker~e_power_]
   5170 be/4 terrance    0.00 B/s  248.20 K/s  0.00 %  0.00 % firefox ~orage #3]
      1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init nosplash
      2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
      3 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_gp]
      4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_par_gp]
      6 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker~-kblockd]
      8 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [mm_percpu_wq]
      9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     10 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
     11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [idle_inject/0]
     14 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/0]
     15 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/1]
     16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [idle_inject/1]
     17 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     18 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
     20 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker~-kblockd]
     21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/2]
  keys:  any: refresh  q: quit  i: ionice  o: active  p: procs  a: accum        
  sort:  r: asc  left: SWAPIN  right: COMMAND  home: TID  end: COMMAND          

Hope this helps!

Terrance
  • 41,612
  • 7
  • 124
  • 183
0

The important variable is Percentage used which is currently 0%. When it is 1% multiply the number of months since new to get life span.

See: How do I check system health?

  • thanks - we checked one of our other machines which runs a similar config, and the percentage used was 7%. That SSD is ~3 months old. We are trying to see if the docker containers are doing excessive writes, or whether it's mongodb, or something else. – vgoklani Nov 23 '20 at 16:59