Everyone should understand the thermal characteristics of their computer, and provide adequate protection. Often users are not aware of how extremely rapid the processor package temperature can increase with a step function load. An example from my 20.04 test server:
doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 0.1
PkgTmp PkgWatt
33 1.88
33 1.69
33 1.56
33 1.74
49 24.99 800 degrees per second
57 133.28 80 degrees per second
61 133.66 40 degrees per second
61 132.58 0 degrees per second
63 133.57
64 134.12
The load was applied about 4/5ths of the way along the sample time (25 / (133.5 - 1.7) ~= 20%, or 4/5ths) and the temperature already went up 16 degrees, or 800 degrees per second. The load here was the prime95 torture test, the maximum heat sub-test. The example computer is water cooled with the water pump always on at maximum rate. Processor i5-10600K.
For ASUS motherboards, please know that the CPU fan sensor is actually an external thermistor that will lag the actual processor package temperature both in time and value. On my ASUS motherboard, under heavy load, the CPU fan sensor lags the actual processor temperature by 12 degrees.
In the end, it is possible for the processor package temperature to hit the shutdown limit so fast that various monitoring programs or daemons don't even notice. Sometimes thermal protection needs to react sooner to have time to take effect before any overshoot temperature triggers a shutdown.
Method 1: Thermald
For `/etc/thermald/thermal-conf.xml` use the very basic and simple configuration, as per the `man thermal-conf.xml` page:
<?xml version="1.0"?>
<!--
use "man thermal-conf.xml" for details
-->
<!-- BEGIN -->
<ThermalConfiguration>
<Platform>
<Name>Overide CPU default passive</Name>
<ProductName>*</ProductName>
<Preference>QUIET</Preference>
<ThermalZones>
<ThermalZone>
<Type>cpu</Type>
<TripPoints>
<TripPoint>
<Temperature>41000</Temperature>
<type>passive</type>
</TripPoint>
</TripPoints>
</ThermalZone>
</ThermalZones>
</Platform>
</ThermalConfiguration>
<!-- END -->
Note: I am using a ridiculously low trip point of 41 degrees, because my system is water cooled and I can not get to desired example temperatures.
doug@s19:~$ sudo systemctl start thermald
doug@s19:~$ sudo systemctl status thermald
● thermald.service - Thermal Daemon Service
Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2021-11-05 07:41:45 PDT; 17s ago
Main PID: 3461 (thermald)
Tasks: 2 (limit: 38214)
Memory: 2.2M
CGroup: /system.slice/thermald.service
└─3461 /usr/sbin/thermald --systemd --dbus-enable --adaptive
Nov 05 07:41:45 s19 systemd[1]: Starting Thermal Daemon Service...
Nov 05 07:41:45 s19 systemd[1]: Started Thermal Daemon Service.
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: 22 CPUID levels; family:model:stepping 0x6:a5:5 (6:165:5)
Nov 05 07:41:45 s19 thermald[3461]: Polling mode is enabled: 4
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: sensor id 5 : No temp sysfs for reading raw temp
Nov 05 07:41:45 s19 thermald[3461]: XML zone: invalid sensor type []
While thermald status shows some complaining, it actually works properly, although a little slow to respond:
doug@s19:~$ sudo turbostat --quiet --Summary --show PkgWatt,PkgTmp --interval 1
PkgTmp PkgWatt
33 1.44
33 1.34
33 1.33
58 63.26
61 114.43
61 114.68
48 86.59
47 55.48
47 55.53
41 42.77
43 33.43
41 34.30
41 28.04
43 33.63
40 34.45
44 33.57
41 34.40
44 33.85
34 14.50
34 1.33
34 1.33
Adjust the trip point as needed to get the most out of your system while still preventing the overshoot high point causing a shutdown. Having too low a trip point might reduce system performance to undesirable levels.
Method 2: TCC Offset
If your kernel is new enough and your processor is supported, TCC offset can be used to have the processor itself do the thermal throttling. Depending on the timing window parameters, the response time can be much faster. For this example, the timing window was set in BIOS to the fastest response time:
First, find which cooling device:
doug@s19:~$ grep . /sys/devices/virtual/thermal/cooling_device*/type
/sys/devices/virtual/thermal/cooling_device0/type:Fan
/sys/devices/virtual/thermal/cooling_device10/type:Processor
/sys/devices/virtual/thermal/cooling_device11/type:Processor
/sys/devices/virtual/thermal/cooling_device12/type:Processor
/sys/devices/virtual/thermal/cooling_device13/type:Processor
/sys/devices/virtual/thermal/cooling_device14/type:Processor
/sys/devices/virtual/thermal/cooling_device15/type:Processor
/sys/devices/virtual/thermal/cooling_device16/type:Processor
/sys/devices/virtual/thermal/cooling_device17/type:intel_powerclamp
/sys/devices/virtual/thermal/cooling_device18/type:TCC Offset
/sys/devices/virtual/thermal/cooling_device1/type:Fan
/sys/devices/virtual/thermal/cooling_device2/type:Fan
/sys/devices/virtual/thermal/cooling_device3/type:Fan
/sys/devices/virtual/thermal/cooling_device4/type:Fan
/sys/devices/virtual/thermal/cooling_device5/type:Processor
/sys/devices/virtual/thermal/cooling_device6/type:Processor
/sys/devices/virtual/thermal/cooling_device7/type:Processor
/sys/devices/virtual/thermal/cooling_device8/type:Processor
/sys/devices/virtual/thermal/cooling_device9/type:Processor
It is device 18. Set the offset and then check it via turbostat without the --quiet option:
doug@s19:~$ echo 59 | sudo tee /sys/devices/virtual/thermal/cooling_device18/cur_state
59
doug@s19:~$ sudo /home/doug/temp-k-git/linux/tools/power/x86/turbostat/turbostat --Summary --show Bzy_MHz,PkgWatt,PkgTmp --interval 0.1
turbostat version 21.05.04 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 0x16 CPUID levels
CPUID(1): family:model:stepping 0x6:a5:5 (6:165:5) microcode 0xec
...
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x3b641422 (41 C) (100 default - 59 offset)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x883f0800 (37 C)
...
Bzy_MHz PkgTmp PkgWatt
800 33 1.35
800 33 1.34
800 34 1.40
4187 49 86.23
4100 52 91.72
4100 53 91.29
...
Notice the throttling is virtually immediate, 4.8 GHz would have been the un-throttled CPU frequency. Note that the throttling limit for my processor (not all processors) is the non-turbo maximum clock frequency of 4.1 GHz, and so it can not actually reach the ridiculously low limit of 41 degrees.
EDIT 1: To make the TCC offset method setup automatically on boot, post-boot service is suggested.
EDIT 2: The TCC offset will be lost after a suspend/resume cycle. The modified post-boot service is now below.
doug@s19:~$ cat /etc/systemd/system/doug-post-boot-s19.service
[Unit]
Description=Doug post boot script for s19
After=suspend.target
[Service]
ExecStart=/home/doug/post-boot/doug-post-boot-s19.sh
[Install]
WantedBy=multi-user.target suspend.target
Where the called script is:
doug@s19:~$ cat post-boot/doug-post-boot-s19.sh
#!/bin/sh
#
# doug-post-boot-s19.sh 2022.04.11
# For s19 Ubuntu test server.
# Often this service is disabled, because
# it isn't really needed.
#
# Do desired post boot changes and configurations.
logger "doug-post-boot-s19.sh - begin..."#
/home/doug/post-boot/doug-set-tcc-offset.sh
insert other scripts here, as required.
logger "doug-post-boot-s19.sh - exiting, done..."
exit 0
and it called:
doug@s19:~$ cat post-boot/doug-set-tcc-offset.sh
#!/bin/sh
#
# doug-set-tcc-offset.sh 2022.04.11
# Set the desired TCC offset.
# Requires a new enough kernel.
# Note: The cooling device number is not
# guarenteed not to change. This script
# should be improved to auto figure out
# the proper colling device. (I do not
# know how.)
logger "doug-set-tcc-offset.sh - begin..."
TCC is 100 degrees for s19.
Therefore a trip point of 45 degrees requires an offset of 55.
echo 55 > /sys/devices/virtual/thermal/cooling_device18/cur_state
logger "doug-set-tcc-offset.sh - exiting, done..."
exit 0
Note: it might not make sense to the reader that I merely call another script from the first one. Indeed it is not required for this server, but on other servers I have many post-boot things to do, and it makes it cleaner this way.
Once everything is setup, enable the service:
doug@s19:~$ sudo systemctl enable doug-post-boot-s19.service
Created symlink /etc/systemd/system/multi-user.target.wants/doug-post-boot-s19.service → /etc/systemd/system/doug-post-boot-s19.service.
Test via re-boot and then inquire:
doug@s19:~$ sudo systemctl status doug-post-boot-s19.service
[sudo] password for doug:
● doug-post-boot-s19.service - Doug post boot script for s19
Loaded: loaded (/etc/systemd/system/doug-post-boot-s19.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2022-04-11 17:19:19 PDT; 18s ago
Process: 839 ExecStart=/home/doug/post-boot/doug-post-boot-s19.sh (code=exited, status=0/SUCCESS)
Main PID: 839 (code=exited, status=0/SUCCESS)
Apr 11 17:19:19 s19 systemd[1]: Started Doug post boot script for s19.
Apr 11 17:19:19 s19 systemd[1]: doug-post-boot-s19.service: Succeeded.
And finally check that the offset actually was set:
doug@s19:~$ cat /sys/devices/virtual/thermal/cooling_device18/cur_state
55
sudo /turbostat --Summary --quiet --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,RAMWatt,GFXWatt,CorWatt --interval 6
. My i5-10600K is from the same era, and I had to enable HWE on 20.04 server to use a newer kernel. Suggest you try a newer kernel, just as a test. – Doug Smythies Nov 04 '21 at 22:51