We recently built a machine equipped with:
Specifications | |
---|---|
CPU | AMD Ryzen 9 3950X |
RAM | 128GB DDR4 3000MHz |
SSD | 1TB + 2xHHD 6TB |
GPU | NVIDIA GEFORCE RTX 3090 24GB |
OS | Ubuntu 20.04 LTS |
PSU | 850W Certified |
We use the machine remotely for doing AI-based research. We had several issues related to an annoying bug when we have a load on CPU. Specifically, the errors are freezing completely the machine and the console returns:
Message from syslogd@machinename at Feb 13 09:37:16 ...
kernel:[ 348.578682] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [systemd-journal:660]
After some print of the above issue, the machine is not reachable in ssh. It is only possible to phisically restart the machine. We run experiments for weeks on GPU without any problem, once we load the CPU for some tasks, it freezes and report the above issue.
Has anyone experienced the same problem? How can we solve it?
Update (almost 1 month after)
By setting min_free_kbytes
to 2GB (as suggested by @DougSmythies) we mitigate the issue, however the issue presented again after almost 1 month of almost GPU-only usage. It happened while using 24/32 virtual cores.
Do disabling NMI watchdog could be a workaround to the complete freeze? Is there any disadvantage in doing that? The main goal is to recover the machine remotely (it can't be done when the problem occours).
Update 2 (after several trial) The machine seems to have a random behaviour, we are not sure what can cause the problem. Today we had an additional information. Looking at the journal log after the n-th crash we got:
Apr 26 18:43:10 machine_name kernel: [UFW BLOCK] IN=enp4s0 OUT= MAC=01:00:5e:00:00:01:00:05:1a:2f:fe:40:08:00 SRC=130.192.16.207 DST=2>
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:30 machine_name kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
Apr 26 18:44:30 machine_name kernel: BUG: unable to handle page fault for address: ffff9f42feb2adc0
Apr 26 18:44:30 machine_name kernel: #PF: supervisor instruction fetch in kernel mode
We were executing "simple" python code (minimal dependencies - numpy...).
Update 3
We update the kernel to 5.10.0 the latest supported for Nvidia drivers. The previous code was running without problem but suddenly we had again NMI warnings:
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.925806] NMI watchdog: Watchdog detected hard LOCKUP on cpu 8
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.925886] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.925940] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.925992] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.926044] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12
Message from syslogd@machine_name at May 1 07:50:04 ...
kernel:[163210.926093] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
/proc/sys/vm/min_free_kbytes
– Doug Smythies Feb 15 '21 at 22:15(i) Yes we can run a stress test (we tested full charge for one hour) without any problem. (ii) For
– morenolq Feb 16 '21 at 21:50/proc/sys/vm/min_free_kbytes
we have 67584 right now, an increase could help? (iii) Sorry but I didn't get the point with memory re-organization, for the memory we just run the memtester (OS-level) test without errors (with 120GB/128GB)/proc/sys/vm/min_free_kbytes
to like 2 or 4 gigs. – Doug Smythies Feb 16 '21 at 22:01min_free_kbytes
parameter to 2 GB and we do not experienced the issue (at the moment). Thank you for the insightful comment. – morenolq Mar 19 '21 at 14:46Message from syslogd@machinename at Mar 24 13:08:05 ... kernel:[ 9952.900248] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [python:109603]
Message from syslogd@machinename at Mar 24 13:08:29 ... kernel:[ 9976.960171] watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [java:135792]
– morenolq Mar 24 '21 at 13:34cat /proc/sys/kernel/watchdog_thresh
and maybe double itecho 20 | sudo tee /proc/sys/kernel/watchdog_thresh
, change "20" to whatever. You should also have some traceback information in/var/log/kern.log
or/var/log/syslog
. Your examples are only a short time after boot. What changed that caused you to re-boot? Are they repeatable? – Doug Smythies Mar 25 '21 at 14:30We try disabling NMI but nothing changed, we reproduce the exact same error without the NMI log.
I use the pointers to log files that you provide and grep for python/java, What I obtain is a tainted kernel
– morenolq Mar 26 '21 at 15:51Mar 24 13:11:17 machinename kernel: [10144.959623] watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [java:135792] Mar 24 13:11:17 machinename kernel: [10144.959648] CPU: 28 PID: 135792 Comm: java Tainted: P D OEL 5.4.0-67-generic #75-Ubuntu
I google it but I found no relevan information. Do you have any ideas?Additionally, for the last update, there was no java execution, do you think it could be the same for python?
– morenolq Apr 27 '21 at 17:36kernel:[163210.925806] NMI watchdog: Watchdog detected hard LOCKUP on cpu 8
(we got multiple warnings even for other CPU numbers), and I have no idea what can cause the problem. Is this something you may recognize?Thank you again @ubfan1 for the suggestions.
– morenolq May 01 '21 at 08:03We didn't update to 21.04 because we were focusing on an LTS version of the OS. Do you think it could help? Last question, do you think migrating to Fedora/RHEL could be helpful?
– morenolq May 01 '21 at 16:17min_free_kbytes
to 16 gigabytes and reduce the maximum CPU frequency. – Doug Smythies May 12 '21 at 18:14min_free_kbytes
a try. – Maghoumi Feb 23 '22 at 18:41min_free_kbytes
doesn't seem to be fixing it for me. Interesting to hear your freezes were always under load... As for wattage, as I mentioned in my comment above, I initially had a 850w PSU which I later upgraded to 1200w, which didn't really help the situation. For now, I've gone ahead and reduced my max CPU clock from 3.8 GHz to 3.0GHz. Will see how it goes – Maghoumi Mar 01 '22 at 17:38