4

I recently installed Ubuntu 16.10 and since then Ubuntu reboots itself. the output of: last | grep "Oct 31" is:

aegefel  tty7         :0               Mon Oct 31 15:15    gone - no logout
reboot   system boot  4.8.0-26-generic Mon Oct 31 15:14   still running
aegefel  tty7         :0               Mon Oct 31 15:02 - down   (00:04)
reboot   system boot  4.8.0-26-generic Mon Oct 31 15:02 - 15:06  (00:04)
aegefel  tty7         :0               Mon Oct 31 14:33 - crash  (00:28)
reboot   system boot  4.8.0-26-generic Mon Oct 31 14:33 - 15:06  (00:33)
aegefel  tty7         :0               Mon Oct 31 14:12 - crash  (00:20)
reboot   system boot  4.8.0-26-generic Mon Oct 31 14:12 - 15:06  (00:54)
aegefel  tty7         :0               Mon Oct 31 13:08 - crash  (01:04)
reboot   system boot  4.8.0-26-generic Mon Oct 31 13:08 - 15:06  (01:58)

Which leads me to believr it's caused by a crash

I don't know what cause this but it happened when I tried to see a movie or when I did a backup

How should I proceed?

EDIT 1

The command more /var/log/syslog* gives me:

Nov  6 18:18:17 aegefel-Akoya-E6424-MD99850 gnome-terminal-[2674]: Allocating size to GtkBox 0x55558d2b47b0 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate?
Nov  6 18:18:17 aegefel-Akoya-E6424-MD99850 gnome-terminal-[2674]: Allocating size to GtkBox 0x55558d2b47b0 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate?
Nov  6 18:18:31 aegefel-Akoya-E6424-MD99850 gnome-terminal-[2674]: Allocating size to GtkBox 0x55558d2b4120 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate?
Nov  6 18:18:31 aegefel-Akoya-E6424-MD99850 gnome-terminal-[2674]: Allocating size to GtkBox 0x55558d2b4120 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate?
Nov  6 18:18:36 aegefel-Akoya-E6424-MD99850 systemd[1]: Starting Stop ureadahead data collection...
Nov  6 18:18:36 aegefel-Akoya-E6424-MD99850 systemd[1]: Started Stop ureadahead data collection.

Then nothing happened during almost 1 minute, so I suppose the pc rebooted.

The command ls -alt /var/crash gives me for today:

total 21672
drwxrwsrwt  2 root     whoopsie     4096 Nov  6 14:26 .
-rwxrwxrwx  1 root     whoopsie        0 Nov  6 14:26 .lock

EDIT 2

This append only when my CPU is used at 40% - 50% or more (My CPU is an Intel Core i5 6267U 2.9GHz)

EDIT 3

The command sensors gives me the following:

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +37.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:         +34.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:         +36.0°C  (high = +100.0°C, crit = +100.0°C)

acpitz-virtual-0
Adapter: Virtual device
temp1:        +38.0°C  (crit = +98.0°C)

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +35.0°C  

The high temperature is equal to the critical. Maybe my laptop just overheat and the fan don't have the time to lower the temperature. I tried to lower the high temperature but this automatically lower the critical (the critical must be equal to the high)

EDIT 4

Here you have

And here are the crashes from the 20 november

EDIT 5

After some test, I think the problem is a GPU overheating. In fact, my laptop reboot only when I try to watch a movie, when I tested with some free games on my Laptop or when I used the Unreal Engine 4. The reason my PC didn't reboot with Blender is that Blender use, by default, the CPU (not the GPU). I have an Intel Iris Graphics 550 (Skylake GT3e) Any idea ?

Aegefel
  • 336
  • 2
  • 11
  • 3
    You might ls -alt /var/crash in terminal, to see the last app that crashed the system. You might more /var/log/syslog* and look for entries JUST BEFORE the actual crash to help determine the cause. – heynnema Oct 31 '16 at 15:14
  • 2
    The temperature looks fine, but you may keep an eye on it for a while and keep track of it using sudo sh -c 'while sleep 15; do clear; sh -c "date --iso-8601='seconds'; sensors -A" | tee -a /var/log/mytemp.log; done' . Could you post /var/log/kern.log and /var/log/kern.log.1 to http://paste.ubuntu.com then their links to the question. – user.dz Nov 25 '16 at 11:25
  • check the logs in /var/log and the temperature. you may need to install some tools for the temperature. –  Nov 25 '16 at 15:52
  • I installed some tools for the temperature and I checked it with watch -n0.1 sensors and I rendered a short animation with Blender. I tried to leave the critical temperature as it is and I tried to rise the critical temperature to 110° and in the two cases the PC didn't crash (in the first case, the temp. stayed under 100° and in the second it went sometimes to 103° - 104°) – Aegefel Nov 25 '16 at 17:13
  • 1
    Can I suggest you install sysstat and configure it appropriately http://www.leonardoborda.com/blog/how-to-configure-sysstatsar-on-ubuntudebian/. To ensure all available statistics are collected, also modify /etc/sysstat/sysstat and ensure the line with SADC_OPTIONS is modified to SADC_OPTIONS="-S XALL". Later, when the system reboots, you can trace through sar and get to the root cause. Helpful links: http://www.thegeekstuff.com/2011/03/sar-examples/?utm_source=feedburner and http://aionica.computerlink.ro/2011/02/visualize-sar-reports-with-awk-and-gnuplot/. – AnthonyK Nov 26 '16 at 12:28
  • There are more than 1000 errors in your kern.log, among them are 27 segfaults. Have you run a memtest? Tried a previous kernel? – Elder Geek Nov 26 '16 at 13:59
  • Here is the output of the command sar after a crash-reboot. I recently complately reinstalled Ubuntu16.10. I will try a memtest – Aegefel Nov 26 '16 at 15:49
  • @ElderGeek I just checked disk from the installation image and nothing found.. – Aegefel Nov 26 '16 at 15:57
  • @Aegefel, 103 too high, even the log has some temp alert and cpu freq throttled (it get slower/lower performance) CPU1: Core temperature above threshold, cpu clock throttled (total events = 1) with this error which seems related mce: [Hardware Error]: Machine check events logged – user.dz Nov 26 '16 at 15:59
  • 1
    @Aegefel, Could you check mcelog output as explained in https://askubuntu.com/questions/605369/mce-hardware-error-machine-check-events-logged-appears-in-syslog-what-sho . BTW, Elder Geek means to check memory (RAM) you can get it in grub menu . – user.dz Nov 26 '16 at 16:11
  • Can you update your question with output from nautilus --version? – WinEunuuchs2Unix Nov 26 '16 at 16:23
  • 1
    @Aegefel the output of sar really doesn't get me anything. I was suggesting that you run memtest86+ (often available on installation media or if not at http://www.memtest.org/ as I suspect faulty RAM. – Elder Geek Nov 27 '16 at 22:07
  • @Aegefel, for memtest, I was expecting all Ubuntu releases install it by default. Anyway you install using sudo apt-get install memtest86+ , reboot, on boot press shift key to get grub menu, select memtest there. run light/quick/short test, if it is ok, continue long/complete test – user.dz Nov 28 '16 at 08:49
  • I tried the memtest and I didn't get any errors. I also tried cleaning the fan. It is better but didn't solve the problem. The mcelog says it is an overheating problem on CPU 1 and 3 – Aegefel Dec 18 '16 at 16:05

2 Answers2

2

If you are truly concerned about the rebooting due to kernel panics as the title of your post suggests, you can check the file /etc/sysctl.conf for a directive similar to kernel.panic = n where n is some number that indicates how many seconds to delay before rebooting in the even of a kernel panic. Research indicates that it's not supposed to reboot by default.

If instead, as I suspect you are more concerned with determining the root cause of these reboots (some hardware related failure is my opinion) you'll want to review the Machine check events in order to determine what hardware is malfunctioning. If you don't have the file /var/log/mcelog You may need to install the mcelog package by enabling the Universe repository (if not already enabled in your sources) and issuing the command sudo apt install mcelog Then moving forward these events will be logged to /var/log/mcelog

For clarity here's an excerpt from the man mcelog

X86  CPUs  report  errors  detected  by the CPU as machine check events
       (MCEs).  These can be data corruption detected in the  CPU  caches,  in
       main memory by an integrated memory controller, data transfer errors on
       the front side bus or CPU interconnect or other internal errors.   Pos‐
       sible  causes can be cosmic radiation, instable power supplies, cooling
       problems, broken hardware, or bad luck.

       Most errors can be corrected by the CPU by  internal  error  correction
       mechanisms. Uncorrected errors cause machine check exceptions which may
       panic the machine.

More information on the mcelog file format can be found here

Linux systems don't typically reboot due to kernel panic by default so you may widh to check the file /etc/sysctl.conf mentioned previously.

Sources:

http://www.techrepublic.com/blog/linux-and-open-source/auto-reboot-linux-after-a-kernel-panic/

http://packages.ubuntu.com

"mce: [Hardware Error]: Machine check events logged" appears in syslog. What should I do?

http://mcelog.org/logfile.html

Based on your mcelog, CPU's 1 and 3 in your system are overheating. throttling down, cooling off and throttling back up (all this is by design to protect the CPU from overheating). The root cause could be a poorly applied thermal compound between the CPU and heatsink, a loose heatsink, blocked vents, or overly dusty or failing cooling equipment (fan?). Another (unlikely) possibility is a failure in the thermal detection capabilities of the CPU.

Elder Geek
  • 36,023
  • 25
  • 98
  • 183
  • I don't have the kernel.panic line in my /etc/sysctl.conf file. But in the mcelog file, there was this So it seems to be an hardware problem. Some ideas of how to solve this ? – Aegefel Dec 01 '16 at 16:22
  • @Aegefel updated answer – Elder Geek Dec 01 '16 at 21:45
  • Thank you. My fan is actually dusty but I can't unmount it to clear it. I should give it to someone to clear it for me. I might also be a factory problem (do we say "factory problem" ?) so I would have to send it back :( – Aegefel Dec 05 '16 at 21:07
  • @Aegefel You can often successfully clear dust from a fan without unmounting it by using a can of compressed air (air duster) – Elder Geek Dec 06 '16 at 01:51
1

The title of this topic is not clear.

Anyway, if you need an help to investigate on your system crash, and all previous comments were not useful, try these:

  1. Increase kernel log verbosity.
  2. Stop the kernel to automatically reboot with a crash/panic.
  3. Try to remotely login (e.g. ssh) in your system and check the logs.
  4. as @user.dz stated, use e.g. memtest86+ from http://www.memtest.org/ to deeply check your RAM.
  5. Because you said "...This append only when my CPU is used at 40% - 50% or more...", could be a PSU issue? I mean your system requires more power than PSU can give to it.
d a i s y
  • 5,511
mattia.b89
  • 712
  • 5
  • 12