12

I use a Thinkpad L13. Now, I have thermal issues especially under full load. When I run my Python program which utilizes all cores, my laptop shuts down soon.

What have I tried so far? I installed TLP and thermald on my machine. Furthermore, I changed the Intel settings in BIOS to "Balanced".

Recently, two things took place:

  1. I had installed Ubuntu 20.04.

  2. Due to graphical issues with my ThinkPad, they had changed my mainboard recently. Maybe it's an hardware issue, like the cooler doesn't fit properly?

Before that, no problem occured.

The command grep -i -e temp -e therm /var/log/syslog* produces the following output on this occasion:

Apr 29 09:20:50 omikron systemd[1]: Started Daily Cleanup of Temporary Directories.
Apr 29 09:20:50 omikron systemd[1]: Starting Thermal Daemon Service...
Apr 29 09:20:50 omikron kernel: [    0.221560] mce: CPU0: Thermal monitoring enabled (TM1)
Apr 29 09:20:50 omikron kernel: [    0.376125] ACPI: \_SB_.PR00: _OSC native thermal LVT Acked
Apr 29 09:20:50 omikron kernel: [    0.539054] thermal_sys: Registered thermal governor 'fair_share'
Apr 29 09:20:50 omikron kernel: [    0.539055] thermal_sys: Registered thermal governor 'bang_bang'
Apr 29 09:20:50 omikron kernel: [    0.539056] thermal_sys: Registered thermal governor 'step_wise'
Apr 29 09:20:50 omikron kernel: [    0.539056] thermal_sys: Registered thermal governor 'user_space'
Apr 29 09:20:50 omikron kernel: [    0.539057] thermal_sys: Registered thermal governor 'power_allocator'
Apr 29 09:20:50 omikron kernel: [    0.725855] thermal LNXTHERM:00: registered as thermal_zone0
Apr 29 09:20:50 omikron kernel: [    0.725856] ACPI: Thermal Zone [THM0] (31 C)
Apr 29 09:20:50 omikron kernel: [    2.056100] proc_thermal 0000:00:04.0: enabling device (0000 -> 0002)
Apr 29 09:20:50 omikron kernel: [    2.147392] proc_thermal 0000:00:04.0: Creating sysfs group for PROC_THERMAL_PCI
Apr 29 09:20:50 omikron kernel: [    2.412750] thermal thermal_zone5: failed to read out thermal zone (-61)
Apr 29 09:20:50 omikron sensors[826]: temp1:            N/A
Apr 29 09:20:50 omikron sensors[826]: coretemp-isa-0000
Apr 29 09:20:50 omikron sensors[826]: temp1:         +1.0°C
Apr 29 09:20:50 omikron sensors[826]: temp2:         +1.0°C
Apr 29 09:20:50 omikron sensors[826]: temp3:         +4.0°C
Apr 29 09:20:50 omikron sensors[826]: temp4:         +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp5:       +121.0°C
Apr 29 09:20:50 omikron sensors[826]: temp6:       +121.0°C
Apr 29 09:20:50 omikron sensors[826]: temp7:         +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp8:         +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp9:        +64.0°C
Apr 29 09:20:50 omikron sensors[826]: temp10:        +3.0°C
Apr 29 09:20:50 omikron sensors[826]: temp11:       -80.0°C
Apr 29 09:20:50 omikron sensors[826]: temp12:        +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp13:        +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp14:        +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp15:        +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp16:        +0.0°C
Apr 29 09:20:50 omikron sensors[826]: temp1:        +48.0°C  (crit = +98.0°C)
Apr 29 09:20:50 omikron thermald[822]: [WARN]22 CPUID levels; family:model:stepping 0x6:8e:c (6:142:12)
Apr 29 09:20:50 omikron thermald[822]: [WARN]Polling mode is enabled: 4
Apr 29 09:20:50 omikron thermald[822]: [WARN]sensor id 10 : No temp sysfs for reading raw temp
Apr 29 09:20:50 omikron thermald[822]: message repeated 2 times: [ [WARN]sensor id 10 : No temp sysfs for reading raw temp]
Apr 29 09:20:50 omikron thermald[822]: I/O warning : failed to load external entity "/etc/thermald/thermal-conf.xml"
Apr 29 09:20:50 omikron thermald[822]: [WARN]error: could not parse file /etc/thermald/thermal-conf.xml
Apr 29 09:20:50 omikron thermald[822]: [WARN]sysfs open failed
Apr 29 09:20:50 omikron thermald[822]: I/O warning : failed to load external entity "/etc/thermald/thermal-conf.xml"
Apr 29 09:20:50 omikron thermald[822]: [WARN]error: could not parse file /etc/thermald/thermal-conf.xml
Apr 29 09:20:50 omikron systemd[1]: Started Thermal Daemon Service.
Apr 29 09:20:50 omikron thermald[822]: I/O warning : failed to load external entity "/etc/thermald/thermal-conf.xml"
Apr 29 09:20:50 omikron thermald[822]: [WARN]error: could not parse file /etc/thermald/thermal-conf.xml
Apr 29 09:21:04 omikron gsd-print-notif[1262]: Source ID 3 was not found when attempting to remove it
Apr 29 09:29:01 omikron kernel: [  493.759292] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759293] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759295] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759296] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759298] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759299] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759300] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759302] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759326] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.759327] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 09:29:01 omikron kernel: [  493.760277] mce: CPU4: Core temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760278] mce: CPU0: Core temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760279] mce: CPU5: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760280] mce: CPU1: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760281] mce: CPU6: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760282] mce: CPU2: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760283] mce: CPU0: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760284] mce: CPU4: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760317] mce: CPU7: Package temperature/speed normal
Apr 29 09:29:01 omikron kernel: [  493.760318] mce: CPU3: Package temperature/speed normal
Apr 29 09:35:50 omikron systemd[1]: Starting Cleanup of Temporary Directories...
Apr 29 09:35:50 omikron systemd[1]: Finished Cleanup of Temporary Directories.
Apr 29 10:14:58 omikron kernel: [ 3250.661431] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 10:14:58 omikron kernel: [ 3250.661431] mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)
Apr 29 10:14:58 omikron kernel: [ 3250.661433] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661434] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661435] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661436] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661437] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661438] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661438] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.661440] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 196)
Apr 29 10:14:58 omikron kernel: [ 3250.665320] mce: CPU3: Core temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665321] mce: CPU7: Core temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665322] mce: CPU2: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665323] mce: CPU0: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665324] mce: CPU4: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665325] mce: CPU5: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665325] mce: CPU6: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665326] mce: CPU1: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665327] mce: CPU7: Package temperature/speed normal
Apr 29 10:14:58 omikron kernel: [ 3250.665328] mce: CPU3: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.746988] mce: CPU4: Core temperature above threshold, cpu clock throttled (total events = 323)
Apr 29 10:20:05 omikron kernel: [ 3557.746989] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 323)
Apr 29 10:20:05 omikron kernel: [ 3557.746991] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.746992] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.746993] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.746994] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.747022] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.747023] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.747025] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.747026] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 650)
Apr 29 10:20:05 omikron kernel: [ 3557.749589] mce: CPU4: Core temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749590] mce: CPU0: Core temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749591] mce: CPU7: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749591] mce: CPU3: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749592] mce: CPU0: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749593] mce: CPU4: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749625] mce: CPU5: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749626] mce: CPU1: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749627] mce: CPU6: Package temperature/speed normal
Apr 29 10:20:05 omikron kernel: [ 3557.749628] mce: CPU2: Package temperature/speed normal
Apr 29 10:23:09 omikron kernel: [ 3741.654959] thermal thermal_zone0: critical temperature reached (100 C), shutting down

EDIT (05/01/2020):

Today, I had a Zoom meeting and the laptop went hot such that it turned off during the meeting. This is not what should happen, right? What is going on here? I did not run a complicated computation here. Perhaps it has something to do with the power supply since I had put it in?


EDIT (05/09/2020):

I put the peformance settings to the maximum level and considered the same stress test as it is done in various temperature reviews of my notebook. On Windows, I get similar values as they do. Therefore, I think, it must be an issue with the new Ubuntu 20.04. Somehow, Ubuntu won't throttle the frequency such that the temperature would go down.


EDIT (07/19/2020):

I contacted the Lenovo support and they repaired my notebook (whatever they did). For a couple of weeks, it had worked fine. Now, I have the same issue again.

I've updated my BIOS version, which helps but comes with another issue: the cpu is throttling down to 400Mhz as soon as the temperature is near overheating. In result, my notebook is barely usable for demanding tasks.

As a possible solution, I deactivated Intel's turbo boost. The temperatures are now in tolerable ranges and everything works smoothly enough. That's a compromise I am willing to take.

YoungMath
  • 121
  • 1
    well, your thermald config file, /etc/thermald/thermal-conf.xml seems to be broken. I can not decode your CPU ID into a make and model number, could you provide it. – Doug Smythies Apr 29 '20 at 18:07
  • How can I fix this? What should I do to obtain this number? – YoungMath Apr 29 '20 at 19:30
  • grep "model name" /proc/cpuinfo. I guess post your /etc/thermald/thermal-conf.xml file. – Doug Smythies Apr 29 '20 at 19:40
  • Model name: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz and the file /etc/thermald/thermal-conf.xml doesn't even exist. – YoungMath Apr 29 '20 at 19:46
  • 2
    By default, you should be using the intel-pstate CPU frequency driver. to check: cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver. Then temporarily force a low CPU frequency, and therefore less heat, say 70%: echo 70 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct. Myself, I implement thermal control via limiting processor package power. See here. I think your processor can adjust that limit, not sure. – Doug Smythies Apr 29 '20 at 21:08
  • Thanks! However, this doesn't explain what is going on here, does it? I mean, I didn't have had any problems before... – YoungMath Apr 30 '20 at 07:32
  • 2
    The purpose of forcing the maximum CPU frequency to be lower is to make your system stable enough that you can investigate further without the worry of thermal stress on your processor. – Doug Smythies Apr 30 '20 at 14:40
  • Okay, sorry, I didn't make myself clear. I was aware of the purpose and it is a good temporary solution. However, the problem did not occur beforehand. So something must be wrong. And I want to find out what it is. – YoungMath Apr 30 '20 at 14:47
  • You mention that you had no issues "before". The two changes/events separating "before" from "after" are mentioned in the question. Did both things happen at the same time? I guess you are trying to diagnose if the issue is related to software or hardware, so you know what chances you have for asking whoever fixed the hardware part to do it right... – sancho.s ReinstateMonicaCellio May 02 '20 at 10:32
  • I have installed the most recent linux kernel on my laptop. So far, no heating issue occurs. Let's wait one or two hours. Then I can tell more about it. But from this point of view, this seems to be a software issue. If that fails and my computer shuts down again, I will try your suggestion. – YoungMath May 02 '20 at 11:04
  • It seems, the installation of Ubuntu's kernel 5.6.7 has settled the overheating. Can anyone confirm this? – YoungMath May 02 '20 at 20:14
  • I can only guess that somehow your thermal daemon was broken and now it's not. do you still see those messages in syslog? Can you correlate those messages with the kernel you are running? – Doug Smythies May 02 '20 at 21:29
  • Best guess (based on limited information): The older version of Ubuntu didn't know about your very new processor and so used generic stuff; Then the newer kernel did know, but it had mistake. The mistake got fixed and backported to kernel 5.6.5, "Tiger Lake's new unique ACPI device IDs for Intel thermal driver are not valid because of missing 'C' in the IDs. Fix the IDs by updating them." We would have to bisect the kernel to know for certain. Twice actually, once to isolate the works to broken commit and once to isolate the broken to fixed commit. – Doug Smythies May 02 '20 at 23:19
  • To be honest, I used a newer kernel on my old Ubuntu, too, due to Bluetooth and WLAN issues. Thus, it could really be the kernel which doesn't know my cpu. – YoungMath May 02 '20 at 23:45
  • Apparently, my approach didn't help. I have no idea what's the issue here. The temperatures are below 90°C and still, the laptop shuts down due to high temperatures. I will bring the notebook back to the shop and ask for help. Could be an hardware issue. – YoungMath May 11 '20 at 15:10

3 Answers3

1

A full diagnosis of Hardware+Software system is hard to perform via askubuntu in your case. Hardware issues are particularly difficult.

An alternative for a first step in the diagnosis may be provided by installing another OS side-by-side with your Ubuntu 20.04, and performing intensive testing as well.

You could run the same Python program (if you can configure it to use all cores). Even so, it might not be running under the exact same condition you see shutdowns. There are quite a few applications for testing performance out there, and they should be good enough (or even more stringent than your program). And it would not have any "contamination" from your possible Ubuntu 20.04 configuration.

Later on, when the full diagnosis is finished, you can get rid of this OS and reclaim the space for your Ubuntu.

0

Try this:

mkdir ~/helper

curl https://raw.githubusercontent.com/Sepero/temp-throttle/stable/temp_throttle.sh -o ~/helper/temp_throttle.sh chmod +x ~/helper/temp_throttle.sh

cat <<EOF > ~/helper/temp_down.sh #!/bin/bash /usr/bin/sudo -H -S <<< "yourpassword" -p GNOME_SUDO_PASS -u root bash -c '~/helper/temp_throttle.sh 65' EOF

chmod +x ~/helper/temp_down.sh

Test it with:

  sh ~/helper/temp_down.sh

This is only to test if it works, I don't recommend inserting your password into easily available text files.

You can add it to startup applications.

kenn
  • 5,162
0

A bios update actually solved the problem.

YoungMath
  • 121