Initially I recognized the issue when I wanted to tar my hard drive and then copy a 100 GB file. Meanwhile I tried lots of things and basically I am seeing that lots of data copying causes a system fail. The following script with some files in folder atemp1 summing up in around 1 GB is used to show the issue:
while (true);
do
cnt=$(($cnt+1))
echo $cnt cp >> cnt.log
cp -dupR atemp1/* atemp2/
top -b -n 1 | head -n 5 >> cnt.log
echo $cnt rm >> cnt.log
rm atemp2/*
done
So the script does nothing then always copying the same content. Looking on some lines of the log file the result is as follows:
%Cpu(s): 3.9 us, 20.5 sy, 0.0 ni, 54.5 id, 20.0 wa, 0.0 hi, 0.6 si, 0.6 st
%Cpu(s): 3.3 us, 23.5 sy, 0.0 ni, 44.8 id, 27.0 wa, 0.0 hi, 0.5 si, 1.0 st
%Cpu(s): 2.2 us, 29.4 sy, 0.0 ni, 26.6 id, 40.0 wa, 0.0 hi, 0.3 si, 1.6 st
%Cpu(s): 2.0 us, 30.3 sy, 0.0 ni, 23.8 id, 42.0 wa, 0.0 hi, 0.3 si, 1.7 st
%Cpu(s): 1.9 us, 30.7 sy, 0.0 ni, 22.4 id, 43.0 wa, 0.0 hi, 0.2 si, 1.7 st
%Cpu(s): 1.8 us, 31.2 sy, 0.0 ni, 20.9 id, 44.0 wa, 0.0 hi, 0.2 si, 1.8 st
%Cpu(s): 1.3 us, 33.4 sy, 0.0 ni, 13.3 id, 50.0 wa, 0.0 hi, 0.2 si, 2.0 st
%Cpu(s): 1.0 us, 34.7 sy, 0.0 ni, 8.9 id, 53.0 wa, 0.0 hi, 0.1 si, 2.2 st
%Cpu(s): 1.0 us, 34.9 sy, 0.0 ni, 7.9 id, 54.0 wa, 0.0 hi, 0.1 si, 2.2 st
%Cpu(s): 0.9 us, 35.0 sy, 0.0 ni, 6.8 id, 55.0 wa, 0.0 hi, 0.1 si, 2.2 st
%Cpu(s): 0.9 us, 35.3 sy, 0.0 ni, 5.5 id, 56.0 wa, 0.0 hi, 0.1 si, 2.2 st
%Cpu(s): 0.7 us, 36.7 sy, 0.0 ni, 3.2 id, 57.0 wa, 0.0 hi, 0.1 si, 2.3 st
So wa is continuously going up until the system stops. Actually, watching top on a parallel terminal I see that wa goes up to 99.7 until it fails. There is no indication in any system log file while this happens. Finally, I am using a software raid, ext4 and LVM. HDD is 4 TB each. The LVM is 500 GB. As the files deleted and then copied again I assume that always the same HDD part is used and that it is not defect sector. - Needless to say that I did such checks already. Has anyone any clue about this issue. Is it a kernel problem?