2

We recently built a machine equipped with:

Specifications
CPU AMD Ryzen 9 3950X
RAM 128GB DDR4 3000MHz
SSD 1TB + 2xHHD 6TB
GPU NVIDIA GEFORCE RTX 3090 24GB
OS Ubuntu 20.04 LTS
PSU 850W Certified

We use the machine remotely for doing AI-based research. We had several issues related to an annoying bug when we have a load on CPU. Specifically, the errors are freezing completely the machine and the console returns:

Message from syslogd@machinename at Feb 13 09:37:16 ...
kernel:[ 348.578682] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [systemd-journal:660]

After some print of the above issue, the machine is not reachable in ssh. It is only possible to phisically restart the machine. We run experiments for weeks on GPU without any problem, once we load the CPU for some tasks, it freezes and report the above issue.

Has anyone experienced the same problem? How can we solve it?

Update (almost 1 month after) By setting min_free_kbytes to 2GB (as suggested by @DougSmythies) we mitigate the issue, however the issue presented again after almost 1 month of almost GPU-only usage. It happened while using 24/32 virtual cores. Do disabling NMI watchdog could be a workaround to the complete freeze? Is there any disadvantage in doing that? The main goal is to recover the machine remotely (it can't be done when the problem occours).

Update 2 (after several trial) The machine seems to have a random behaviour, we are not sure what can cause the problem. Today we had an additional information. Looking at the journal log after the n-th crash we got:

Apr 26 18:43:10 machine_name kernel: [UFW BLOCK] IN=enp4s0 OUT= MAC=01:00:5e:00:00:01:00:05:1a:2f:fe:40:08:00 SRC=130.192.16.207 DST=2>
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:30 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:40 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:43:50 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:00 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:10 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: The canary thread is apparently starving. Taking action.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Demoting known real-time threads.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Successfully demoted thread 59917 of process 1598.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Successfully demoted thread 1598 of process 1598.
Apr 26 18:44:20 machine_name rtkit-daemon[1610]: Demoted 2 threads.
Apr 26 18:44:30 machine_name kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
Apr 26 18:44:30 machine_name kernel: BUG: unable to handle page fault for address: ffff9f42feb2adc0
Apr 26 18:44:30 machine_name kernel: #PF: supervisor instruction fetch in kernel mode

We were executing "simple" python code (minimal dependencies - numpy...).

Update 3

We update the kernel to 5.10.0 the latest supported for Nvidia drivers. The previous code was running without problem but suddenly we had again NMI warnings:

Message from syslogd@machine_name at May  1 07:50:04 ...
 kernel:[163210.925806] NMI watchdog: Watchdog detected hard LOCKUP on cpu 8

Message from syslogd@machine_name at May 1 07:50:04 ... kernel:[163210.925886] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9

Message from syslogd@machine_name at May 1 07:50:04 ... kernel:[163210.925940] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10

Message from syslogd@machine_name at May 1 07:50:04 ... kernel:[163210.925992] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11

Message from syslogd@machine_name at May 1 07:50:04 ... kernel:[163210.926044] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12

Message from syslogd@machine_name at May 1 07:50:04 ... kernel:[163210.926093] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24

  • We, well at least I, don't have enough information to even begin to have any inkling as to how to help. A watchdog report can occur, if something is hogging uninterruptible sleeps, and doesn't always means bad things. In my opinion you will have pick away at this yourself and come back with more information. For example, can you run a simple CPU load stress program? can you run with slightly reduced maximum CPU frequency? What temperatures are involved? How much does your program(s) fragment memory, and might it be going away to re-organize the memory? Try larger /proc/sys/vm/min_free_kbytes – Doug Smythies Feb 15 '21 at 22:15
  • Thank you so much for the comment, I don't know well where to start looking at. So, concerning your questions:

    (i) Yes we can run a stress test (we tested full charge for one hour) without any problem. (ii) For /proc/sys/vm/min_free_kbytes we have 67584 right now, an increase could help? (iii) Sorry but I didn't get the point with memory re-organization, for the memory we just run the memtester (OS-level) test without errors (with 120GB/128GB)

    – morenolq Feb 16 '21 at 21:50
  • My main point was, that we don't know yet where to look. the sub-point with memory, was that I see you have a lot, and I helped someone else once with lots of memory that was becoming very fragmented, and the computer would go off on its own for a long time re-organizing it. I don't recall the exact details, but it helped greatly increasing their /proc/sys/vm/min_free_kbytes to like 2 or 4 gigs. – Doug Smythies Feb 16 '21 at 22:01
  • found the old thing I was thinking of. I could well be leading you astray, as we really don't know anything yet. – Doug Smythies Feb 16 '21 at 22:13
  • After almost 1 month, we just change the min_free_kbytes parameter to 2 GB and we do not experienced the issue (at the moment). Thank you for the insightful comment. – morenolq Mar 19 '21 at 14:46
  • Interesting, thank you very much for the update. – Doug Smythies Mar 20 '21 at 13:52
  • Unfortunately the same happened again when there was a very high load on CPU (24 threads on 32 virtual cores) for more than 2 hours. Don't know how to solve. – morenolq Mar 24 '21 at 13:28
  • Try 4 GB. Is the error the same, soft lockup? Do you monitor temperatures during this work? – Doug Smythies Mar 24 '21 at 13:33
  • Yes the same soft lockup error, unfortunately not, I thought it was "solved".

    Message from syslogd@machinename at Mar 24 13:08:05 ... kernel:[ 9952.900248] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [python:109603]

    Message from syslogd@machinename at Mar 24 13:08:29 ... kernel:[ 9976.960171] watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [java:135792]

    – morenolq Mar 24 '21 at 13:34
  • Given the errors (java and python) I think, this time, could be more related to what we were executing. The annoying thing is that the machine freezes and it is not able to recover remotely (you need to go there and restart manually). Do you know any workaround? – morenolq Mar 24 '21 at 13:40
  • I do not know a workaround. I'll upvote your question though. – Doug Smythies Mar 24 '21 at 13:52
  • Thank you for the upvote, however I was wondering if disabling NMI watchdog could be useful. – morenolq Mar 24 '21 at 13:59
  • You could try changing the NMI threshold instead. cat /proc/sys/kernel/watchdog_thresh and maybe double it echo 20 | sudo tee /proc/sys/kernel/watchdog_thresh, change "20" to whatever. You should also have some traceback information in /var/log/kern.log or /var/log/syslog. Your examples are only a short time after boot. What changed that caused you to re-boot? Are they repeatable? – Doug Smythies Mar 25 '21 at 14:30
  • Very interesting points, I have some updates:
    1. We try disabling NMI but nothing changed, we reproduce the exact same error without the NMI log.

    2. I use the pointers to log files that you provide and grep for python/java, What I obtain is a tainted kernel

    Mar 24 13:11:17 machinename kernel: [10144.959623] watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [java:135792] Mar 24 13:11:17 machinename kernel: [10144.959648] CPU: 28 PID: 135792 Comm: java Tainted: P D OEL 5.4.0-67-generic #75-Ubuntu I google it but I found no relevan information. Do you have any ideas?

    – morenolq Mar 26 '21 at 15:51
  • please extract the entire area of the log file and edit your question, adding it there. According to the taint flags, you have at lease one proprietary unsigned module loaded. If it were me, I would try to make the problem occur more often, yes MORE often, so as to provide a quicker way to debug. – Doug Smythies Mar 26 '21 at 16:12
  • Hi Doug, I added some additional information from the journalctl, thank you again for your efforts. – morenolq Apr 27 '21 at 08:59
  • Check with the motherboard manfacturer for any firmware updates. Do you really need a real-time kernel (?) for AI research? A memory reorg fail may just be java garbage collection -- maybe you can force that to happen periodically and not interfere with other work? – ubfan1 Apr 27 '21 at 16:18
  • Wow, you made a lot of very interesting points. Actually, I didn't know my kernel was "real-time". Can you give any pointer on that?

    Additionally, for the last update, there was no java execution, do you think it could be the same for python?

    – morenolq Apr 27 '21 at 17:36
  • With now two people commenting, you will need to specifically flag the person you want to alert. @ubfan1 would not have been flagged as to your question. – Doug Smythies Apr 28 '21 at 19:13
  • The "real-time threads" in your posting made me think you have a low-latency kernel. "lowlatency" is in the kernel package names. "Real-time" and "garbage collect" generally do not play well together unless you do it yourself, limiting the time allowed for GC. The "kernel tried to execute NX-protected page" is a clue that something bad is happening. Track that down -- who knows what might be randomly executing that is not in a protected page. – ubfan1 Apr 28 '21 at 21:28
  • I have some additional info: 1. We update the kernel to version 5.10.0 (the latest supported by Nvidia). 2. We run our tests without problems for 2 days, even using the code that was causing the error. Now, suddenly we got the error kernel:[163210.925806] NMI watchdog: Watchdog detected hard LOCKUP on cpu 8 (we got multiple warnings even for other CPU numbers), and I have no idea what can cause the problem. Is this something you may recognize?

    Thank you again @ubfan1 for the suggestions.

    – morenolq May 01 '21 at 08:03
  • Did you check for motherboard firmware updates? What Nvidia driver are you using, maybe try to drop back a version, or use the latest one from Nvidia. Try 21.04, that might help with the latest hardware you are running. Might be a programing error, better dealt with on stackoverflow than on this Ubuntu site. – ubfan1 May 01 '21 at 15:57
  • Thank you @ubfan1 for your comments. Actually we made the code running in other configurations and it runs without problems (Intel-powered machine running Ubuntu 18.04). For the drivers, we are currently using nvidia-drivers-460 (not the latest beta 465).

    We didn't update to 21.04 because we were focusing on an LTS version of the OS. Do you think it could help? Last question, do you think migrating to Fedora/RHEL could be helpful?

    – morenolq May 01 '21 at 16:17
  • With your hardware newer than the 20.04 release code, I'd definitely try 21.04, and if that works, then move to 22.04 LTS when avail. Try the graphics-drivers PPA for a 465 Nvidia driver, and if not available, get it from the Nvidia site (and be prepared to reinstall every kernel update). I doubt any release older than your hardware will help, but I know little about F/RH since I dropped it 10+ years ago. – ubfan1 May 01 '21 at 16:28
  • As suggested we upgrade the OS version and rebuilt all drivers. However, we continue experiencing the same issue, at this point, it seems to us it happens more when a high load is put on the CPU, and I/O from and to RAM is frequent. We also tested using benchmark CPU and RAM and it does not show any problem. The main goal at this point is to have a machine that, if something happen, made the script/software running crash but remains available for ssh connections. – morenolq May 12 '21 at 16:23
  • At this point it seems to me to be related to the code you are running, which I wouldn't know how to debug. If it were me, I would set min_free_kbytes to 16 gigabytes and reduce the maximum CPU frequency. – Doug Smythies May 12 '21 at 18:14
  • Hi, @morenolq - do yo have any updates on progress you've made on this? I'm encountering what seems like a very similar problem on a machine with 2x EPYC 7543 CPUs and 2x 3090s, used for a mix of deep learning and GIS with both CPU and GPU intensive workloads. We seem to crash weekly. I've just tried the min_free_kbytes patch and will see how that goes, but curious if there's a more effective solution yet. Thanks! – dga Dec 24 '21 at 14:20
  • Hi @dga, we solved the issue by upgrading the bios of the motherboard. However, now we have the opposite problem (the machine goes idle when it is not used), but it is more related to the OS than to the hardware. – morenolq Dec 27 '21 at 15:15
  • Thanks! Alas, we upgraded the bios on ours recently to try to address this (it seemed promising; there was a microcode patch that sounded like it might address our problem) but with no success. We'll see if the min_free_kbytes tweak helps, then. Thanks again for replying! – dga Dec 29 '21 at 01:05
  • Sorry to hear that, let me know if you have other questions or I can be helpful in any way. – morenolq Dec 30 '21 at 08:59
  • The machine has been substantially more stable since upgrading min_free_kbytes, so thank you! We've had one or two apparent lockups but nothing like the regularity with which they were happening before. (Unfortunately, we also removed a likely-glitching GPU from the machine so have changed multiple factors, but I'm not reverting the min_free_kbytes boost. :) – dga Feb 12 '22 at 02:35
  • @morenolq Wow! I've been having the EXACT same problems for about 1.5 years! I have a Ryzen 3900X with a RTX 3090 GPU, running on MSI X570 Tomahawk and 850w PSU (now upgraded to 1200w). This machine has been too unstable to run any DNN training with and I experience lockups that can only be solved by hard resetting. I see the same messages about soft lockups in dmesg which can only be seen by SSH'ing into the machine right before it absolutely freezes. At first I thought it might be due to Ryzen C-state issues, but that doesn't seem to be the case. I'll give min_free_kbytes a try. – Maghoumi Feb 23 '22 at 18:41
  • I think a combination of Ryzen CPU (with known c-state issues) and the Nvidia Driver > v455, which is also known to cause lockups might be the reason we're having all these issues. I also tried using a RTX 2080 Ti, but that also locks up the same way. The only config I haven't tried yet is RTX 2080 Ti with an older driver (perhaps < v450?). – Maghoumi Feb 23 '22 at 18:44
  • 1
    @Maghoumi I don't know if drivers could be the issue, the only way I manage to fix it was the bios update. Currently, we have just a problem while the machine is running anything, but I think it is more a Ubuntu-related issue than a hardware problem. I'll check installing a different OS in next weeks. – morenolq Feb 24 '22 at 09:41
  • @morenolq Thanks. I'm already running the latest BIOS for my machine and I've been updating it frequently over the past 1.5 year as well. What I know is, this issue started around the time when this machine was upgraded to RTX 3090 (from RTX 2080 Ti), and RAM was upgraded to 64 GB (from 32 GB), and Ubuntu 20.04 was installed. This machine has passed 10 passes of memtest86, 16 hours of prime95's "small FFT" test and a whole bunch of other stress tests (RealBench, AIDA64, etc.) So I'm fairly certain this issue may have something to do with Ubuntu and the software config I'm running. – Maghoumi Feb 28 '22 at 16:12
  • @Maghoumi : Have you tried reducing your maximum CPU frequency a little? (For timing window reasons, and just as a test.) How often does your issue occur? It might be time to start your own question. – Doug Smythies Feb 28 '22 at 16:36
  • No, I haven't tried reducing CPU frequency, but I did try playing with PBO settings, Load Line Calibration values, RAM frequency/voltage/XMP and also CPU voltage. Will try CPU clock next. As for how often, it's totally random: sometimes within a few minutes after boot, other times it could be stable for 2 weeks, then BAM. Also my crash may happen at load, at idle, many apps open, no apps open, etc. I've tried to find a pattern but so far no luck. Sure, I may consider that if the solutions here don't help. – Maghoumi Mar 01 '22 at 08:54
  • 1
    @Maghoumi in our case it was due to heavy load on CPU (almost certain). However, we are also planning switching the distro to a RHEL based one (RHEL itself or Fedora) to see if we can solve the low charge issue. I can suggest, however, to take a look to the Wattage, 3090 is quite power hungry. – morenolq Mar 01 '22 at 13:22
  • Literally right when I was typing this respond I experienced another freezing! So min_free_kbytes doesn't seem to be fixing it for me. Interesting to hear your freezes were always under load... As for wattage, as I mentioned in my comment above, I initially had a 850w PSU which I later upgraded to 1200w, which didn't really help the situation. For now, I've gone ahead and reduced my max CPU clock from 3.8 GHz to 3.0GHz. Will see how it goes – Maghoumi Mar 01 '22 at 17:38
  • Let us updated, it could be interesting to find the real root of the issue! – morenolq Mar 01 '22 at 21:51

2 Answers2

3

I already commented about my situation above, but the thread is getting very large. So posting this as an answer for anyone else who may be having similar issues but are unable to resolve it.

Long story short, I had random freezes and BUG: soft lockup messages for almost 2 years. My machine was running with Ryzen 3900X, RTX 3090 GPU, on MSI X570 Tomahawk and 1200w PSU with 64 GB Corsair RAM sticks. I'd also get random BSOD's in Windows 10.

Over the course of 2 years, I tried many things, including replacing most of my hardware components (except for the motherboard) and nothing resolved this issue. What was really strange to me was that I was never able to run with XMP (AMP) profiles enabled for my RAMs from day one.

In the end, turned out this issue was due to a faulty motherboard! I sent the Tomahawk board in for repair but after waiting for an entire month, I was told they couldn't source a replacement, so they had to mail me a check instead. I went ahead and bought another motherboard (MSI MPG X570S Edge MAX). Lo and behold, that solved the problem for me!

The new motherboard has been running without any issues for about a month. Interestingly, I can enable the XMP profile now and run through days of stress testing without issues.

Maghoumi
  • 131
2

The main problem was fixed by a bios update. Now the machine can run very long training (up to 10 days of full GPU) without problems, even in multi-user/multi-processing settings.

An issue that still persists is about the suspension while not running anything, we already tried to mask suspend and hibernate states but it does not fix.