0

I have a new System76 Lemur Pro laptop with Ubuntu 20.04. I really want to love it, but I'm finding that it's completely and totally locking up several times a week, which kind of puts a damper on my feelings. I'm in contact with System76 support, but I'm also trying to do some troubleshooting of my own. I'm fairly new to Linux and am hoping to learn not just how to fix my machine, but also general troubleshooting steps that would be useful in the future.

The system: System76 Lemur Pro, i7, 40gb RAM, single SSD. Ubuntu 20.04. All updates installed. Only peripherals are a USB hub with a mouse and keyboard plugged in, and an external monitor hooked up via USB-C to DisplayPort adapter. Nothing exotic.

The crash: Several times a week, I'll return to my laptop (usually in the morning after it sits idle all night) to find that it's totally unresponsive to mouse/keyboard. Using ALT+F_ to try to switch to a terminal does not do anything. ALT + PRTSCR + REISUB does not do anything. Hitting the power button does not do anything. Trying to turn on the internal LCD does not do anything. Only holding the power button down and hard-resetting the machine allows me to recover. This did happen only one time while I was actively using the machine and the Gnome desktop stayed visible, the mouse and keyboard locked, and about 1/4 of a second of the song I was listening to just got stuck in a loop. Nothing but hard reset worked to recover.

What I've tried:

  • Stress testing CPU. I monitored CPU temps while running a stress test for several minutes. Temps never exceeded upper 80s, and the CPU fan kicked in to keep it under control. This seems safe, given that the hot/critical temps were listed as 100.
  • Running memtester. Looped through 5 times, everything passed.
  • Installing any updates recommended by Ubuntu.
  • Looking at system logs (/var/log/syslog). These logs simply go blank when the system hangs and stay blank until I hard reset it. Nothing immediately before the crash looks terribly interesting.
  • Disabling sleep. Was already disabled, but thought I'd mention it.

At this point, I'm not quite sure what my next steps should be. Are there other logs I can look at? Other diagnostics I can run? Should I assume it's a peripheral and disconnect keyboard/mouse/monitor/hub one at a time to try to isolate? Seems unlikely to be a common peripheral, but who knows.

Edit: as requested, here's logs from /var/log/kern.log right before one of the crashes. It includes a lot of info about CPU throttling being managed. However, such messages occur regularly when the computer is stable as well...

Oct 22 07:52:00 system76-pc kernel: [44320.095989] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 7775)
Oct 22 07:52:00 system76-pc kernel: [44320.095990] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 4669)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 719)
Oct 22 07:52:00 system76-pc kernel: [44320.095992] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.095994] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 752)
Oct 22 07:52:00 system76-pc kernel: [44320.096970] mce: CPU2: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU0: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096972] mce: CPU5: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096973] mce: CPU3: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU6: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096974] mce: CPU7: Core temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096975] mce: CPU4: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096976] mce: CPU1: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU6: Package temperature/speed normal
Oct 22 07:52:00 system76-pc kernel: [44320.096977] mce: CPU7: Package temperature/speed normal
John Chrysostom
  • 171
  • 1
  • 9
  • Would try to disconnect the peripherals, just because this is not expected behaviour of a System76(maybe something picked up at a store). Do not say what graphics card you are using. – crip659 Oct 23 '20 at 13:24
  • Graphics are just built-in Intel UHD graphics. I agree this is not expected behavior - I hardly believe a company that specializes in Linux laptops is in the habit of shipping units that all crash, which is what's leading me to run hardware tests and to suspect peripherals. I figure I either got a bum unit, or I've done something out of the ordinary... – John Chrysostom Oct 23 '20 at 13:27
  • Think it is one of two main things, System 76 got a bad piece of hardware or there is a bug affecting the system. Does sound like the baytrail bug. – crip659 Oct 23 '20 at 13:42
  • Yeah, it does sound exactly like the baytrail bug, after looking. It's an i7-10510U, though, which is Comet Lake. Will research to see if there's a similar bug/workaround. Interestingly, I do see a LOT of stuff about CPU throttling right before the system crashes in at least a couple cases I have in front of me, but I assumed that was just 'cause nothing else was running with the system idle. Could possibly related. – John Chrysostom Oct 23 '20 at 13:51
  • A fast google did show an older i7 bug that cause freezes, but you will need better search. Would think System 76 would know about any unless newer. – crip659 Oct 23 '20 at 14:15
  • I went ahead and set my max cstate to 1 just to see if it helps. Easy enough to rule out, right? Will report back. – John Chrysostom Oct 23 '20 at 14:36
  • CPU throttling messages are an important clue, and are not due to system idle. Please edit your question to include some examples. – Doug Smythies Oct 23 '20 at 15:01
  • @DougSmythies Added, as requested. – John Chrysostom Oct 23 '20 at 15:17
  • There is a high probability that your issue is thermal shutdown. The throttling messages are thermal related, and are the 1st level of protection. The last level of protection is to shutdown the computer. Perhaps your CPU stress test is not stressful enough (different 100% uses of CPU create different waste heat, mprime torture test is the best I have found, and I have tried a great many) or the main heat source might be graphics. Suggest monitoring with sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6. – Doug Smythies Oct 23 '20 at 16:19
  • my thermal shutdown suggestion is inconsistent with your "and the Gnome desktop stayed visible" experience. – Doug Smythies Oct 23 '20 at 16:44
  • I think CPU is very likely, though I'm not sure it's overheating. Does the total events = 7775 mean 7775 events were actually generated and logged as one? Cause that seems like it would be not normal (I'm hypothesizing). Running mprime torture tests, I still can't get the temperature above 88, which should be within the CPU's working range. I'll pass the info along to the system76 support folks too, though. Given that this often happens when the PC is idle, I'm starting to entertain the possibility that it's the CPU getting throttled down too much, as with the Bay Lake problem... – John Chrysostom Oct 23 '20 at 18:15

2 Answers2

0

This is a partial answer, based on current information, including from the comments.

From the log files, there are indications that high CPU temperatures are involved, such that the system keeps hitting its throttling temperature limit. However, CPU stress tests indicate no problem.

As a test, find the system operating point where CPU thermal problems are not possible and run that way for long enough to determine the effect on system stability. The cost of this test will be performance. Later on, a proper thermal daemon (thermald, tlp, ...) should be investigated as a way to recover maximum performance.

The default CPU frequency scaling driver for the i7-10510U is intel_pstate, and this answer is written for that driver. Check via:

doug@s15:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu1/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu2/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu4/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu5/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu6/cpufreq/scaling_driver:intel_cpufreq
/sys/devices/system/cpu/cpu7/cpufreq/scaling_driver:intel_cpufreq

The mprime (prime95) high heat torture test is used as the CPU stress test because it consumes the most energy of any CPU stress test that I have ever tested. To protect my example computer, which has no thermal daemon running, the desired operating point of about 80 degrees will be found from the low side. First, note the current maximum CPU frequency percent, note the minimum as well (yours will be different):

cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
doug@s15:~$ cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
42

It might not be 100% if some thermal daemon is already limiting things. Anyway, I will start at 50%:

doug@s15:~$ echo 50 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
50

Then gradually raise the maximum CPU frequency percent, say in 10 percent increments, and find the operating point for about 80 degrees processor package temperature:

doug@s15:~$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 6
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

0.25 1754 725 25 3.81 0.12 0.02 1600 288 26 3.70 0.12 0.06 1600 360 26 3.70 0.12 38.82 1899 7740 39 16.28 0.12 100.00 1900 17594 41 36.20 0.12 <<< mprime torture test started 100.00 1900 17541 42 36.44 0.12 100.00 1900 17552 43 36.39 0.12 100.00 1900 17517 44 36.25 0.12 100.00 1927 17474 48 36.95 0.12 100.00 2300 17389 49 46.51 0.12 100.00 2300 17367 50 46.60 0.12 100.00 2300 17362 52 46.69 0.12 100.00 2300 17438 53 46.77 0.12 100.00 2552 18440 56 54.18 0.12 100.00 2700 17672 58 58.48 0.12 100.00 2700 17590 58 58.59 0.12 100.00 2700 17710 61 58.74 0.12 100.00 2953 17780 66 67.91 0.12 100.00 3100 17876 68 73.38 0.12 <<<< First time at 80%, temp lags. 100.00 3100 17843 69 73.55 0.12 100.00 3100 17860 70 73.64 0.12 100.00 3100 18794 71 73.78 0.12 100.00 3231 17826 77 79.69 0.12 100.00 3500 18305 80 92.33 0.12 100.00 3500 17765 81 92.66 0.12 100.00 3457 17747 80 90.72 0.12 100.00 3300 17720 81 82.62 0.12 100.00 3300 17723 81 82.72 0.12 100.00 3300 17708 80 82.81 0.12 100.00 3300 17712 83 82.95 0.12 <<<< Opps too high 100.00 3300 17788 82 83.08 0.12 100.00 3204 17882 81 79.25 0.12 100.00 3100 17778 80 74.78 0.12 100.00 3100 18571 81 74.83 0.12 100.00 3100 17806 80 74.85 0.12 100.00 3100 17787 80 74.89 0.12 <<<< 80 percent seems stable 100.00 3100 17772 81 74.84 0.12 100.00 3100 17824 81 74.85 0.12 100.00 3100 17777 80 74.89 0.12 100.00 3100 17799 81 74.95 0.12 100.00 3100 17867 81 74.77 0.12

So, for my system, limiting the CPU frequency to 80% of maximum will keep them away from any built in additional thermal throttling. Run the system this way for awhile.

Doug Smythies
  • 15,448
  • 5
  • 44
  • 61
  • Thanks for this. On the basis of somebody mentioning the similarity to the Bay Trail bug, I set my max cstate to 1 and the system has been running stable since Friday. This would imply that it's not overheating that's causing the issue, but rather dropping into deeper idle states to save power and then crashing when trying to come out of them. Do you agree? (See https://askubuntu.com/questions/803640/system-freezes-completely-with-intel-bay-trail) – John Chrysostom Oct 26 '20 at 12:49
  • I don't know. I do see many idle state related bugs on bugzilla. However, these particular mce's (machine control error) are temperature related interrupts inside the processor itself. The issue is that running your system with a max idle state depth of 1 will cost you energy. Myself, I would attempt to isolate if it is a particular idle state and also try with HWP disabled and also with the acpi-cpufreq driver instead of the intel_pstate driver. – Doug Smythies Oct 26 '20 at 18:28
  • Thanks. I'm now up to 4 days stable with the max cstate set to 1, so I'm getting pretty confident that it's stable. I could always try pushing it further and further until I hit instability again to save as much energy as possible. I'll look into swapping drivers, thanks! – John Chrysostom Oct 27 '20 at 15:34
  • Do you still get any mce's (Machine Check Errors)? at all. This one is very interesting, but I suppose you want to actually use your computer rather than investigate forever. (I have spent a lot of time in 2020 on a intel_pstate with HWP issue related to idle state 2 being enabled. idle state 2 disabled is fine.) – Doug Smythies Oct 27 '20 at 16:04
0

This is a Kernel bug associated with CPU power management. It's fixed in kernel 5.8, which comes with Ubuntu 20.10. I upgraded to 20.10, turned off all the workarounds, and am running stable now.

If upgrading to 5.8/20.10 isn't something you want to do, you can also work around the bug by keeping your CPU from going into lower-power states (this will reduce battery life, obviously). Open up /etc/default/grub and add intel_idle.max_cstate=1 to the contents of the value for GRUB_CMDLINE_LINUX_DEFAULT. Save, run sudo update-grub, then re-boot. Reverse the process to reverse the workaround.

It's possible a cstate value higher than 1 would still be a stable workaround, but I never experimented enough to verify.

John Chrysostom
  • 171
  • 1
  • 9