One of our Samsung 2TB NVME SSDs recently failed, so we swapped it with a new stick and have started to pay careful attention to the SMART tests.
Here is the output from a drive that was installed less than two weeks ago:
root@~ $ smartctl -a /dev/nvme0n1p1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-53-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNZFNA02015F
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 129,469,706,240 [129 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5a019ed120
Local Time is: Sun Nov 22 22:11:40 2020 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 14,723 [7.53 GB]
Data Units Written: 4,508,008 [2.30 TB]
Host Read Commands: 243,468
Host Write Commands: 176,596,876
Controller Busy Time: 1,060
Power Cycles: 4
Power On Hours: 205
Unsafe Shutdowns: 3
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 42 Celsius
Temperature Sensor 2: 46 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
The part that has us concerned is this:
Data Units Written: 4,508,008 [2.30 TB]
The lifetime is 250TB, so 2TB being used is insane, and it doesn't make any sense.
How do we go about trying to figure out why this number is so high?
Thanks!
========================
@heynnema thanks for following up! Here is the response to yours comments (fyi, I killed Ubuntu swap after installing the new SSD)
root@~ $ free -h
total used free shared buff/cache available
Mem: 251Gi 42Gi 153Gi 3.0Mi 56Gi 208Gi
Swap: 0B 0B 0B
root@~ $ sysctl vm.swappiness
vm.swappiness = 60
root@~ $ grep -i swap /etc/fstab
#/swap.img none swap sw 0 0
======================== additional info:
I ran iotop as follows:
iotop -ao
and have this after running for a while:
Total DISK READ : 0.00 B/s | Total DISK WRITE : 147.34 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 357.38 K/s
TID PRIO USER DISK READ DISK WRITE> SWAPIN IO COMMAND
29546 be/4 999 0.00 B 212.62 M 0.00 % 0.01 % mongod --auth --bind_ip_all [WTCheck.tThread]
855 be/3 root 0.00 B 101.82 M 0.00 % 1.65 % [jbd2/nvme1n1p1-]
1841 be/4 root 0.00 B 33.69 M 0.00 % 0.00 % python /opt/conda/bin/supervisord -c /etc/supervisor/supervisord.conf
It looks like the culprit is mongo and jbd2. How do I figure out what jbd2 is doing? thanks everyone for your help!
cat /etc/fstab
– oldfred Nov 23 '20 at 03:56iotop
might help. However, edit your question and show mefree -h
andsysctl vm.swappiness
andgrep -i swap /etc/fstab
. Start comments to me with @heynnema or I'll miss them. – heynnema Nov 23 '20 at 15:22