Premise
I have recently upgraded from 17.04 to 17.10.
I have the following video cards and processor (on a laptop):
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
Intel Core i7 Quad Core Processor 7700HQ (2.8GHz, 3.8GHz Turbo)
In 17.04, I was using the nvidia-375
driver (build 66 if memory serves).
After upgrading, I noticed that my steam games would run very poorly.
In some cases, some games would seemingly overheat the machine to the point that it automatically turned off.
I have added the graphics-drivers/ppa/ubuntu artful
repository and switched to the later nvidia-387
driver, which seems to improve performance to similar levels as prior to my Ubuntu upgrade.
However, some games still seem to overheat my machine and lead to a hard automatic shutdown.
I have tried exploring the logs in /var/log
a bit, but I am not knowledgeable enough to infer which information is relevant and which isn't, of whether there actually is any relevant information in the logs in such cases.
I have done the initial due-diligence, i.e. checking for dust and that the fans work (no dust, both fans work).
Actual question
I am not asking "how to fix this and make my games work", I realize how hard that would be to answer, given the context.
However, I would like to understand what is the recommended must-gather for such situation, so that I can either try to investigate on my own, ask a more specific question here, or (probably more suitable) convey that information to the game vendor and request for support.
As mentioned, I strongly suspect this is related to video card drivers or CPU overheating.
Update 1
I have tried and replicated the issue with a few additional Nvidia driver versions. Here is the list I tried so far, which all replicate the issue:
- 375.66 - used to work well in 17.04, laggy graphics in 17.10 and replicates auto-shutdowns
- 384.90 - not tried in 17.04, laggy graphics in 17.10 (but better than 375.66), replicates auto-shutdowns
- 387.12 - seemingly no difference compared to 384.90 within context
I also noticed that all games requiring a processor speed that would need turbo on my processor replicate the issue (some seem to take longer).
This last finding is interesting, because it means the shutdown is likely triggered after a certain time the CPU is in turbo mode, and might not be related to the GPU after all.
I have grepped for "temperat*"
in /var/log
, but the only entries matching are from repowerd
and while I don't really understand what they mean, they show a temperature=0.00
, which I suspect I can disregard as noise within context.
I'm about to change the thermald
logging level and see if there's anything relevant once the issue replicates - will update later.
Update 2
I have replicated the issue after setting up the following debugging processes:
- [as administrator]
watch -n10 "sensors >> ~/sensors.log"
- [as administrator]
watch -n10 "hddtemp /dev/sda1 >> ~/hddtemp.log"
Tailing those files after starting the machine again indicates the following, seemingly acceptable temperatures:
/dev/sda1: ST1000LX015-1U7172: 37°C
iwlwifi-virtual-0
Adapter: Virtual device
temp1: +54.0°C
acpitz-virtual-0
Adapter: Virtual device
temp1: +79.0°C
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +78.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +77.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +78.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +72.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +75.0°C (high = +100.0°C, crit = +100.0°C)
pch_skylake-virtual-0
Adapter: Virtual device
temp1: +75.5°C `
I've grepped thermald
logs from the syslog
and piped them into another log file for readability.
In my debug-level thermald
logs, I've tried looking for "common" patterns (I have no idea how to really read that information), within the time range of the occurrence.
Some entries did not occur close to the occurrence of the shutdown.
My search included keywords like "warn", "error", "fail", "critical", "invalid".
Here are the only findings I can share - all entries repeat, not necessarily in this order...
sysfs read failed constraint_0_max_power_uw
- occurred before and close to shutdowndram:powercap RAPL invalid max power limit range
failed to open /dev/acpi_thermal_rel
read_trip_points 1/trip_point_0_type:critical
index 0: type:critical temp:115000 hyst:1 zone id:1 sensor id:1 cdev size:0
Buggy max temp: to close to critical 90000
Core temp DTS :critical 100000, max 90000, psv 95000
As my initial grep for thermald
logs was a little wide, I also bumped into some maybe relevant kernel log entries:
thermal thermal_zone2: failed to read out thermal zone (-5)
- occurred close to shutdown
This would narrow down to either or both of the entries close to shutdown replication time.
However, I still have no clue how to read that data, or whether I am completely mislead in gathering the data in the first place.
Maybe my watch
interval should be much shorter?
Maybe there is actually no overheating, but some (kernel?) issue that prevents a proper read of the temperatures?
Any clarification welcome.
Last update, off-topic
I have now reinstalled Ubuntu 17.04.
The issue does not replicate.
The figures from sensors
and hddtemp
are slightly lower than the ones tested with 17.10, but only slightly.
Note that I need to parametrize the kernel with pci=noacpi
on 17.04 in order to be able to start/shutdown properly. Maybe it's related...
I guess I'll stay clueless for now...