How to troubleshoot total system hang

Question

I have a new System76 Lemur Pro laptop with Ubuntu 20.04. I really want to love it, but I'm finding that it's completely and totally locking up several times a week, which kind of puts a damper on my feelings. I'm in contact with System76 support, but I'm also trying to do some troubleshooting of my own. I'm fairly new to Linux and am hoping to learn not just how to fix my machine, but also general troubleshooting steps that would be useful in the future.

The system: System76 Lemur Pro, i7, 40gb RAM, single SSD. Ubuntu 20.04. All updates installed. Only peripherals are a USB hub with a mouse and keyboard plugged in, and an external monitor hooked up via USB-C to DisplayPort adapter. Nothing exotic.

The crash: Several times a week, I'll return to my laptop (usually in the morning after it sits idle all night) to find that it's totally unresponsive to mouse/keyboard. Using ALT+F_ to try to switch to a terminal does not do anything. ALT + PRTSCR + REISUB does not do anything. Hitting the power button does not do anything. Trying to turn on the internal LCD does not do anything. Only holding the power button down and hard-resetting the machine allows me to recover. This did happen only one time while I was actively using the machine and the Gnome desktop stayed visible, the mouse and keyboard locked, and about 1/4 of a second of the song I was listening to just got stuck in a loop. Nothing but hard reset worked to recover.

What I've tried:

Stress testing CPU. I monitored CPU temps while running a stress test for several minutes. Temps never exceeded upper 80s, and the CPU fan kicked in to keep it under control. This seems safe, given that the hot/critical temps were listed as 100.
Running memtester. Looped through 5 times, everything passed.
Installing any updates recommended by Ubuntu.
Looking at system logs (/var/log/syslog). These logs simply go blank when the system hangs and stay blank until I hard reset it. Nothing immediately before the crash looks terribly interesting.
Disabling sleep. Was already disabled, but thought I'd mention it.

At this point, I'm not quite sure what my next steps should be. Are there other logs I can look at? Other diagnostics I can run? Should I assume it's a peripheral and disconnect keyboard/mouse/monitor/hub one at a time to try to isolate? Seems unlikely to be a common peripheral, but who knows.

Edit: as requested, here's logs from /var/log/kern.log right before one of the crashes. It includes a lot of info about CPU throttling being managed. However, such messages occur regularly when the computer is stable as well...

Oct 22 07:52:00 system76-pc kernel: [44320.095989] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 7775)
Oct 22 07:52:00 system76-pc kernel: [44320.095990] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 4669)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 719)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.095994] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.096970] mce: CPU2: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU0: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU5: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096973] mce: CPU3: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU6: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU7: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096975] mce: CPU4: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096976] mce: CPU1: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU6: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU7: Package temperature/speed normal

Would try to disconnect the peripherals, just because this is not expected behaviour of a System76(maybe something picked up at a store). Do not say what graphics card you are using. — crip659, Oct 23 '20 at 13:24
Graphics are just built-in Intel UHD graphics. I agree this is not expected behavior - I hardly believe a company that specializes in Linux laptops is in the habit of shipping units that all crash, which is what's leading me to run hardware tests and to suspect peripherals. I figure I either got a bum unit, or I've done something out of the ordinary... — John Chrysostom, Oct 23 '20 at 13:27
Think it is one of two main things, System 76 got a bad piece of hardware or there is a bug affecting the system. Does sound like the baytrail bug. — crip659, Oct 23 '20 at 13:42
Yeah, it does sound exactly like the baytrail bug, after looking. It's an i7-10510U, though, which is Comet Lake. Will research to see if there's a similar bug/workaround. Interestingly, I do see a LOT of stuff about CPU throttling right before the system crashes in at least a couple cases I have in front of me, but I assumed that was just 'cause nothing else was running with the system idle. Could possibly related. — John Chrysostom, Oct 23 '20 at 13:51
A fast google did show an older i7 bug that cause freezes, but you will need better search. Would think System 76 would know about any unless newer. — crip659, Oct 23 '20 at 14:15
I went ahead and set my max cstate to 1 just to see if it helps. Easy enough to rule out, right? Will report back. — John Chrysostom, Oct 23 '20 at 14:36
CPU throttling messages are an important clue, and are not due to system idle. Please edit your question to include some examples. — Doug Smythies, Oct 23 '20 at 15:01
There is a high probability that your issue is thermal shutdown. The throttling messages are thermal related, and are the 1st level of protection. The last level of protection is to shutdown the computer. Perhaps your CPU stress test is not stressful enough (different 100% uses of CPU create different waste heat, mprime torture test is the best I have found, and I have tried a great many) or the main heat source might be graphics. Suggest monitoring with sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6. — Doug Smythies, Oct 23 '20 at 16:19
my thermal shutdown suggestion is inconsistent with your "and the Gnome desktop stayed visible" experience. — Doug Smythies, Oct 23 '20 at 16:44
I think CPU is very likely, though I'm not sure it's overheating. Does the total events = 7775 mean 7775 events were actually generated and logged as one? Cause that seems like it would be not normal (I'm hypothesizing). Running mprime torture tests, I still can't get the temperature above 88, which should be within the CPU's working range. I'll pass the info along to the system76 support folks too, though. Given that this often happens when the PC is idle, I'm starting to entertain the possibility that it's the CPU getting throttled down too much, as with the Bay Lake problem... — John Chrysostom, Oct 23 '20 at 18:15

score 0 · Answer 1 · answered Oct 24 '20 at 15:57

This is a partial answer, based on current information, including from the comments.

From the log files, there are indications that high CPU temperatures are involved, such that the system keeps hitting its throttling temperature limit. However, CPU stress tests indicate no problem.

As a test, find the system operating point where CPU thermal problems are not possible and run that way for long enough to determine the effect on system stability. The cost of this test will be performance. Later on, a proper thermal daemon (thermald, tlp, ...) should be investigated as a way to recover maximum performance.

The default CPU frequency scaling driver for the i7-10510U is intel_pstate, and this answer is written for that driver. Check via:

doug@s15:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu4/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu5/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu6/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu7/cpufreq/scaling_driver:intel_cpufreq

The mprime (prime95) high heat torture test is used as the CPU stress test because it consumes the most energy of any CPU stress test that I have ever tested. To protect my example computer, which has no thermal daemon running, the desired operating point of about 80 degrees will be found from the low side. First, note the current maximum CPU frequency percent, note the minimum as well (yours will be different):

cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
42

It might not be 100% if some thermal daemon is already limiting things. Anyway, I will start at 50%:

doug@s15:~$ echo 50 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
50

Then gradually raise the maximum CPU frequency percent, say in 10 percent increments, and find the operating point for about 80 degrees processor package temperature:

doug@s15:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.25    1754    725     25      3.81    0.12
0.02    1600    288     26      3.70    0.12
0.06    1600    360     26      3.70    0.12
38.82   1899    7740    39      16.28   0.12
100.00  1900    17594   41      36.20   0.12   <<< mprime torture test started
100.00  1900    17541   42      36.44   0.12
100.00  1900    17552   43      36.39   0.12
100.00  1900    17517   44      36.25   0.12
100.00  1927    17474   48      36.95   0.12
100.00  2300    17389   49      46.51   0.12
100.00  2300    17367   50      46.60   0.12
100.00  2300    17362   52      46.69   0.12
100.00  2300    17438   53      46.77   0.12
100.00  2552    18440   56      54.18   0.12
100.00  2700    17672   58      58.48   0.12
100.00  2700    17590   58      58.59   0.12
100.00  2700    17710   61      58.74   0.12
100.00  2953    17780   66      67.91   0.12
100.00  3100    17876   68      73.38   0.12  <<<< First time at 80%, temp lags.
100.00  3100    17843   69      73.55   0.12
100.00  3100    17860   70      73.64   0.12
100.00  3100    18794   71      73.78   0.12
100.00  3231    17826   77      79.69   0.12
100.00  3500    18305   80      92.33   0.12
100.00  3500    17765   81      92.66   0.12
100.00  3457    17747   80      90.72   0.12
100.00  3300    17720   81      82.62   0.12
100.00  3300    17723   81      82.72   0.12
100.00  3300    17708   80      82.81   0.12
100.00  3300    17712   83      82.95   0.12  <<<< Opps too high
100.00  3300    17788   82      83.08   0.12
100.00  3204    17882   81      79.25   0.12
100.00  3100    17778   80      74.78   0.12
100.00  3100    18571   81      74.83   0.12
100.00  3100    17806   80      74.85   0.12
100.00  3100    17787   80      74.89   0.12 <<<< 80 percent seems stable
100.00  3100    17772   81      74.84   0.12
100.00  3100    17824   81      74.85   0.12
100.00  3100    17777   80      74.89   0.12
100.00  3100    17799   81      74.95   0.12
100.00  3100    17867   81      74.77   0.12

So, for my system, limiting the CPU frequency to 80% of maximum will keep them away from any built in additional thermal throttling. Run the system this way for awhile.

Thanks for this. On the basis of somebody mentioning the similarity to the Bay Trail bug, I set my max cstate to 1 and the system has been running stable since Friday. This would imply that it's not overheating that's causing the issue, but rather dropping into deeper idle states to save power and then crashing when trying to come out of them. Do you agree? (See https://askubuntu.com/questions/803640/system-freezes-completely-with-intel-bay-trail) — John Chrysostom, Oct 26 '20 at 12:49
I don't know. I do see many idle state related bugs on bugzilla. However, these particular mce's (machine control error) are temperature related interrupts inside the processor itself. The issue is that running your system with a max idle state depth of 1 will cost you energy. Myself, I would attempt to isolate if it is a particular idle state and also try with HWP disabled and also with the acpi-cpufreq driver instead of the intel_pstate driver. — Doug Smythies, Oct 26 '20 at 18:28
Thanks. I'm now up to 4 days stable with the max cstate set to 1, so I'm getting pretty confident that it's stable. I could always try pushing it further and further until I hit instability again to save as much energy as possible. I'll look into swapping drivers, thanks! — John Chrysostom, Oct 27 '20 at 15:34
Do you still get any mce's (Machine Check Errors)? at all. This one is very interesting, but I suppose you want to actually use your computer rather than investigate forever. (I have spent a lot of time in 2020 on a intel_pstate with HWP issue related to idle state 2 being enabled. idle state 2 disabled is fine.) — Doug Smythies, Oct 27 '20 at 16:04

score 0 · Accepted Answer · answered Nov 02 '20 at 13:01

This is a Kernel bug associated with CPU power management. It's fixed in kernel 5.8, which comes with Ubuntu 20.10. I upgraded to 20.10, turned off all the workarounds, and am running stable now.

If upgrading to 5.8/20.10 isn't something you want to do, you can also work around the bug by keeping your CPU from going into lower-power states (this will reduce battery life, obviously). Open up /etc/default/grub and add intel_idle.max_cstate=1 to the contents of the value for GRUB_CMDLINE_LINUX_DEFAULT. Save, run sudo update-grub, then re-boot. Reverse the process to reverse the workaround.

It's possible a cstate value higher than 1 would still be a stable workaround, but I never experimented enough to verify.

How to troubleshoot total system hang

2 Answers2