1

Initially I recognized the issue when I wanted to tar my hard drive and then copy a 100 GB file. Meanwhile I tried lots of things and basically I am seeing that lots of data copying causes a system fail. The following script with some files in folder atemp1 summing up in around 1 GB is used to show the issue:

    while (true);
    do
            cnt=$(($cnt+1))
            echo $cnt cp >> cnt.log
            cp -dupR atemp1/* atemp2/
            top -b -n 1 | head -n 5 >> cnt.log
            echo $cnt rm >> cnt.log
            rm atemp2/*
    done

So the script does nothing then always copying the same content. Looking on some lines of the log file the result is as follows:

%Cpu(s):  3.9 us, 20.5 sy,  0.0 ni, 54.5 id, 20.0 wa,  0.0 hi,  0.6 si,  0.6 st
%Cpu(s):  3.3 us, 23.5 sy,  0.0 ni, 44.8 id, 27.0 wa,  0.0 hi,  0.5 si,  1.0 st
%Cpu(s):  2.2 us, 29.4 sy,  0.0 ni, 26.6 id, 40.0 wa,  0.0 hi,  0.3 si,  1.6 st
%Cpu(s):  2.0 us, 30.3 sy,  0.0 ni, 23.8 id, 42.0 wa,  0.0 hi,  0.3 si,  1.7 st
%Cpu(s):  1.9 us, 30.7 sy,  0.0 ni, 22.4 id, 43.0 wa,  0.0 hi,  0.2 si,  1.7 st
%Cpu(s):  1.8 us, 31.2 sy,  0.0 ni, 20.9 id, 44.0 wa,  0.0 hi,  0.2 si,  1.8 st
%Cpu(s):  1.3 us, 33.4 sy,  0.0 ni, 13.3 id, 50.0 wa,  0.0 hi,  0.2 si,  2.0 st
%Cpu(s):  1.0 us, 34.7 sy,  0.0 ni,  8.9 id, 53.0 wa,  0.0 hi,  0.1 si,  2.2 st
%Cpu(s):  1.0 us, 34.9 sy,  0.0 ni,  7.9 id, 54.0 wa,  0.0 hi,  0.1 si,  2.2 st
%Cpu(s):  0.9 us, 35.0 sy,  0.0 ni,  6.8 id, 55.0 wa,  0.0 hi,  0.1 si,  2.2 st
%Cpu(s):  0.9 us, 35.3 sy,  0.0 ni,  5.5 id, 56.0 wa,  0.0 hi,  0.1 si,  2.2 st
%Cpu(s):  0.7 us, 36.7 sy,  0.0 ni,  3.2 id, 57.0 wa,  0.0 hi,  0.1 si,  2.3 st

So wa is continuously going up until the system stops. Actually, watching top on a parallel terminal I see that wa goes up to 99.7 until it fails. There is no indication in any system log file while this happens. Finally, I am using a software raid, ext4 and LVM. HDD is 4 TB each. The LVM is 500 GB. As the files deleted and then copied again I assume that always the same HDD part is used and that it is not defect sector. - Needless to say that I did such checks already. Has anyone any clue about this issue. Is it a kernel problem?

Joe
  • 11

2 Answers2

1

IOWait is a CPU metric, measuring the percent of time the CPU is idle, but waiting for an I/O to complete. Strangely - It is possible to have healthy system with nearly 100% iowait, or have a disk bottleneck with 0% iowait. Since your system is doing nothing but repetitive I/O with your script it's not surprising to see wa approach 100%. This in and of itself is not your problem. Since you aren't getting any indications in the syslog you should run a memtest See 1and 2 and then check smart status on the drives in question.

You might also have a dodgy data or power cable going to the drive(s) in use.

Further reading: https://serverfault.com/questions/12679/can-anyone-explain-precisely-what-iowait-is

Elder Geek
  • 36,023
  • 25
  • 98
  • 183
0

Well after some significant time of testing I finally exchange my 200++ Euro motherboard (with CPU) with a <100 Euro one and it works without problems. As side effect also the ethernet boards get nice numbers (enp1s0 and enp2s0) instead of ens3 and rename2 before. Needless to say that the old motherboard sometimes changed the naming of the ethernet boards, which was a disaster, which I however could resolve with some parameter settings for ethernet port boot. - I do not want to disclose the motherboard name, but if you have similar issues then, you may contact me.

Joe
  • 11