6

tl;dr: I have a Machine with Ubuntu Server which I want to run 24/7, but it tends to shut down everyday. What can I look into? I have checked a few things (see below), but I have not been able to fix it and I am running out of ideas :)

Context

The Problem

I have a new machine (Lenovo P340) and I want to run it uninterruptedly. This is a Desktop, which is permanently connected to Power. More accurately, to a UPS, although power outages are not a problem where I live.

The machine runs Ubuntu Server and is running things like Docker (like web services), but after a few hours, it tends to "shut down" by itself. I would love to know how to fix this.

What I mean by Shutdown

When I say shutdown what I mean is that I cannot interact with the machine: none of the docker apps run, and I cannot ssh into the machine. If I connect a keyboard and press enter, or try to change "environment" with cntrl+alt+f1/f3, nothing happens. Now, I do not think it completely shuts down, maybe it enters power mode since I can see the light of the desktop on.

This happens while the docker apps are running, and sometimes even while I am ssh or connected via samba (without actual input from user, just connected).

I have never seen it happening while I am actually active on the machine (e.g. while I am on ssh running commands or executing things). This is what makes me think that this might be related to power management. However, it could just be that it didn't happen at the same time. The shutdown happens around once per day and at different times. It can happen in the morning, afternoon or evening.

The only way I've found to "get out" of it is keep the off button pressed and then switch it on again. One soft press (as in to perhaps "unlock") didn't seem to work, but not sure without a proper way to interact with the machine.

Troubleshooting

I have been reading, testing and collecting data. Putting some of the things I've tried below.

Hypothesis: System not up to date

I currently have Ubuntu 20.04.1 LTS (GNU/Linux 5.6.0-1042-oem x86_64) and I periodically run updates on the machine. I've done it as part of this troubleshooting.

Hypothesis: Memory is overloaded and system shuts down

One of the most common reasons I've found is that perhaps the memory or CPU are overloaded. The machine is new. The CPU is an Intel Core i9-10900 2.8G 10C vPro with 64GB of RAM while I am just running a few (~10) containers.

I have also been talking snapshots of Memory usage every 15 minutes and storing them. This is an example of top -b -o %MEM -n 1 > top.txt just before "stopping".

top - 06:30:01 up 1 day, 11:10,  0 users,  load average: 0.05, 0.03, 0.02
Tasks: 511 total,   1 running, 508 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0.6 us,  0.3 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64033.5 total,  43098.9 free,   7544.9 used,  13389.7 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  56011.2 avail Mem 

The memory usage is about 8GB out of 64GB. The only thing that calls my attention is the 2 zombie tasks, but otherwise none of the processes are heavily used.

Hypothesis: UPS monitor is telling the machine to shut down

The server is connected to a UPS server. I thought this could be the cause as sometimes it failed to connect (but usually wouldn't switch off). However, after completely disabling upsmon configuration, the UPS is not connected, nor I see logs on it, and this still happens.

Aside note: In terms of power, the machine is still connected to the UPS, just not monitoring its status and I live in an area where power cuts are uncommon.

Hypothesis: Machine is going to sleep if not being used

I am taking snapshots every 30 minutes. The snapshots are run to keep the machine awake. They are run using Jenkins, which ssh into the server and runs the two commands that captures logs and memory usage. I would expect this to count as interaction. For a full day, I've tried running them every 5 minutes and somehow that day, no "shutdowns" happened. Not sure if coincidence or due to the process, but I am testing again to see the results.

Hypothesis: GUI has power management and I should remove it

I had a GUI installed, but I have already uninstalled it as per advice of @guiverc on the comments.

This used to be the sessions I had:

nito-server:~$ ls /usr/share/xsessions/
gnome-xorg.desktop  gnome.desktop  ubuntu.desktop

I have followed this tutorial, this answer and this other answer to remove both gnome and ubuntu desktop.

After running this, now there are no more GUIs shown in:

nito-server:~$ ls /usr/share/xsessions/

Despite this, the system still shuts down periodically.

Hypothesis: Disabling power interface via GRUB would solve it

I researched a bit on this forum and I often saw changes to GRUB to configure Power Interface. I have tried different variants and progressively increasing it:

  • GRUB_CMDLINE_LINUX_DEFAULT="text"
  • GRUB_CMDLINE_LINUX_DEFAULT="text acpi=force"
  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset acpi=force"
  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset pci=noaer"
  • GRUB_CMDLINE_LINUX_DEFAULT="text nomodeset acpi=force pci=noaer"

I checked the shutdown logs running sudo journalctl -b -1 -e

Jan 18 05:33:55 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:36:26 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:36:26 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:36:39 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:36:39 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:37:39 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:37:39 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:39:05 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:39:05 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:43:31 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:43:31 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:48:59 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:48:59 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:49:07 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:49:07 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:50:02 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:50:02 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr                 
Jan 18 05:50:32 nito-server kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:04:00.0
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER:   device [15b7:5006] error status/mask=00000001/0000e000
Jan 18 05:50:32 nito-server kernel: nvme 0000:04:00.0: AER:    [ 0] RxErr     

After checking this, the last configuration is currently set to: GRUB_CMDLINE_LINUX_DEFAULT="text pcie_aspm=off".

This seems to make the problem happen less often, but it still occurs.

Hypothesis: System (BIOS) has a power savings mode

I've checked the BIOS. There's an Enhanced Power Saving Mode. However, (1) this is about entering power saving mode when it's already Off, not about turning off; and (2) it's Disabled. In Power, most features were about automatic or controlled power on. Nothing else related.

Hypothesis: Issues with cron drives the machine to shutdown

There are no cron jobs configured on the machine directly. The only crons come from Jenkins which is configured inside a docker container.

crontab -l shows no crontab for nito

As for ll /etc/cron.hourly/ shows:

total 20
drwxr-xr-x   2 root root  4096 Aug  1 00:28 ./
drwxr-xr-x 134 root root 12288 Jan 29 16:37 ../
-rw-r--r--   1 root root   102 Feb 14  2020 .placeholder

CURRENT STATUS AND LOGS

After all the previous, the machine stabilized for a while, but shutdowns still happen every 48-72h. These are the last journal logs (sudo journalctl -b -1 -e):

Jan 22 07:03:00 nito-server sshd[74442]: pam_unix(sshd:session): session opened for user nito by (uid=0)
Jan 22 07:03:00 nito-server systemd[1]: Created slice User Slice of UID 1000.
Jan 22 07:03:00 nito-server systemd[1]: Starting User Runtime Directory /run/user/1000...
Jan 22 07:03:00 nito-server systemd-logind[1131]: New session 61 of user nito.
Jan 22 07:03:00 nito-server systemd[1]: Finished User Runtime Directory /run/user/1000.
Jan 22 07:03:00 nito-server systemd[1]: Starting User Manager for UID 1000...
Jan 22 07:03:00 nito-server systemd[74473]: pam_unix(systemd-user:session): session opened for user nito by (uid=0)
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Paths.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Timers.
Jan 22 07:03:00 nito-server systemd[74473]: Starting D-Bus User Message Bus Socket.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG network certificate management daemon.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Jan 22 07:03:00 nito-server systemd[74473]: Listening on GnuPG cryptographic agent and passphrase cache.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on debconf communication socket.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on Sound System.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on REST API socket for snapd user session agent.
Jan 22 07:03:00 nito-server systemd[74473]: Listening on D-Bus User Message Bus Socket.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Sockets.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Basic System.
Jan 22 07:03:00 nito-server systemd[1]: Started User Manager for UID 1000.
Jan 22 07:03:00 nito-server systemd[74473]: Starting Sound Service...
Jan 22 07:03:00 nito-server systemd[1]: Started Session 61 of user nito.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server rtkit-daemon[8477]: Supervising 0 threads of 0 processes of 1 users.
Jan 22 07:03:00 nito-server systemd[74473]: Started D-Bus User Message Bus.
Jan 22 07:03:00 nito-server dbus-daemon[74579]: [session uid=1000 pid=74579] AppArmor D-Bus mediation is enabled
Jan 22 07:03:00 nito-server systemd[74473]: Started Sound Service.
Jan 22 07:03:00 nito-server systemd[74473]: Reached target Main User Target.
Jan 22 07:03:00 nito-server systemd[74473]: Startup finished in 122ms.
Jan 22 07:03:00 nito-server bluetoothd[1105]: Endpoint registered: sender=:1.477 path=/MediaEndpoint/A2DPSink/sbc
Jan 22 07:03:00 nito-server bluetoothd[1105]: Endpoint registered: sender=:1.477 path=/MediaEndpoint/A2DPSource/sbc
Jan 22 07:03:07 nito-server sshd[74442]: pam_unix(sshd:session): session closed for user nito
Jan 22 07:03:07 nito-server systemd[1]: session-61.scope: Succeeded.
Jan 22 07:03:07 nito-server systemd-logind[1131]: Session 61 logged out. Waiting for processes to exit.
Jan 22 07:03:07 nito-server systemd-logind[1131]: Removed session 61.
Jan 22 07:03:07 nito-server bluetoothd[1105]: Endpoint unregistered: sender=:1.477 path=/MediaEndpoint/A2DPSink/sbc
Jan 22 07:03:07 nito-server bluetoothd[1105]: Endpoint unregistered: sender=:1.477 path=/MediaEndpoint/A2DPSource/sbc
Jan 22 07:03:07 nito-server systemd[74473]: pulseaudio.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopping User Manager for UID 1000...
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Main User Target.
Jan 22 07:03:17 nito-server systemd[74473]: Stopping D-Bus User Message Bus...
Jan 22 07:03:17 nito-server systemd[74473]: dbus.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped D-Bus User Message Bus.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Basic System.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Paths.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Sockets.
Jan 22 07:03:17 nito-server systemd[74473]: Stopped target Timers.
Jan 22 07:03:17 nito-server systemd[74473]: dbus.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed D-Bus User Message Bus Socket.
Jan 22 07:03:17 nito-server systemd[74473]: dirmngr.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG network certificate management daemon.
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-browser.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-extra.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent-ssh.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jan 22 07:03:17 nito-server systemd[74473]: gpg-agent.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 22 07:03:17 nito-server systemd[74473]: pk-debconf-helper.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed debconf communication socket.
Jan 22 07:03:17 nito-server systemd[74473]: pulseaudio.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed Sound System.
Jan 22 07:03:17 nito-server systemd[74473]: snapd.session-agent.socket: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Closed REST API socket for snapd user session agent.
Jan 22 07:03:17 nito-server systemd[74473]: Reached target Shutdown.
Jan 22 07:03:17 nito-server systemd[74473]: systemd-exit.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[74473]: Finished Exit the Session.
Jan 22 07:03:17 nito-server systemd[74473]: Reached target Exit the Session.
Jan 22 07:03:17 nito-server systemd[1]: user@1000.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopped User Manager for UID 1000.
Jan 22 07:03:17 nito-server systemd[1]: Stopping User Runtime Directory /run/user/1000...
Jan 22 07:03:17 nito-server systemd[1]: run-user-1000.mount: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: user-runtime-dir@1000.service: Succeeded.
Jan 22 07:03:17 nito-server systemd[1]: Stopped User Runtime Directory /run/user/1000.
Jan 22 07:03:17 nito-server systemd[1]: Removed slice User Slice of UID 1000.
Jan 22 07:09:52 nito-server wpa_supplicant[1135]: wlo1: WPA: Group rekeying completed with 76:ac:b9:30:c7:b5 [GTK=CCMP]
Jan 22 07:17:01 nito-server CRON[75952]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 22 07:17:01 nito-server CRON[75953]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jan 22 07:17:01 nito-server CRON[75952]: pam_unix(cron:session): session closed for user root

Running sudo cat /var/log/syslog | grep -i "panic\|error\|hang"

Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:08 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 00:00:12 nito-server systemd-resolved[1129]: message repeated 47 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 03:01:30 nito-server networkd-dispatcher[1163]: ERROR:Unknown interface index 50 seen even after reload
Jan 29 03:01:30 nito-server networkd-dispatcher[1163]: ERROR:Unknown interface index 50 seen even after reload
Jan 29 03:01:30 nito-server kernel: [22595.674380] IPv6: ADDRCONF(NETDEV_CHANGE): vethc7eb143: link becomes ready
Jan 29 03:33:00 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 03:33:05 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 03:33:05 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 03:33:25 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 03:33:45 nito-server boltd[117319]: power: state changed: supported/off
Jan 29 08:31:20 nito-server boltd[117319]: power: state changed: supported/on
Jan 29 08:31:40 nito-server boltd[117319]: power: state changed: supported/wait
Jan 29 08:32:00 nito-server boltd[117319]: power: state changed: supported/off
Jan 29 08:45:13 nito-server NetworkManager[1154]: <info>  [1611881113.6500] dhcp4 (wlo1): state changed bound -> extended
Jan 29 14:12:25 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 14:12:25 nito-server systemd-resolved[1129]: message repeated 2 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 16:36:01 nito-server systemd-resolved[1129]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.
Jan 29 16:36:01 nito-server systemd-resolved[1129]: message repeated 2 times: [ Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP.]
Jan 29 16:36:25 nito-server NetworkManager[1154]: <info>  [1611909385.6366] manager: kernel firmware directory '/lib/firmware' changed
Jan 29 16:36:29 nito-server NetworkManager[1154]: <info>  [1611909389.8075] manager: kernel firmware directory '/lib/firmware' changed
Jan 29 16:37:04 nito-server ntpd[351772]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized
Jan 29 16:37:04 nito-server ntpd[351772]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized

Note: the power state change shown there is probably a manual one as there has not been automatic shutdowns on Jan 29.

Final Decision

Hi everyone,

Thanks for your tips and hypothesis. After implementing all changes and running all commands recommended, the machine still reboots. Good news is that it reboots every 72h instead of 24h.

I have decided to reinstall Ubuntu Server and see if that solves the problem. Thanks!

  • How did you install the GUI/desktop? You've likely turned your server into a desktop system. Servers are built for performance, desktops in contrast try and shutdown to save power & battery life.. You're using a unsupported kernel (5.4 & 5.8 are supported focal kernels currently), though I'd look at how you added the desktop packages; and if you undid the conversion from server to desktop that can occur when you do that... – guiverc Jan 16 '21 at 07:59
  • Yes, I am afraid so too. I do not recall what flavour (i.e. what exact UI I chose), but the command would have been something close to sudo apt-get install ubuntu-desktop. Let me try uninstall those. – nitobuendia Jan 16 '21 at 08:01
  • 1
    That command would have converted a server (performance orientated machine) to a power-saving desktop. You'll have to turn off or disable all the power saving functions (GNOME is the default desktop for 20.04 or focal). I'm not using GNOME, but I'd look for Power & Screen Saving/Lock functions & disable them.. or wait for advice from a GNOME user :) – guiverc Jan 16 '21 at 08:04
  • Makes sense. Trying to uninstall GUI. Otherwise, I will consider a fresh install. Thanks! – nitobuendia Jan 16 '21 at 08:07
  • @guiverc I've just removed the GUI and it now starts on Terminal mode fine. I think this might be the answer, but I can only confirm in probably ~24h. Do you want to post your comment as an answer or wait until I can confirm? Thanks for the pointer! – nitobuendia Jan 16 '21 at 08:21
  • 1
    You can write your own answer... a better answer to your actual question maybe the GNOME settings that stop what you wanted... I just gave you the why or cause.. but thanks :) – guiverc Jan 16 '21 at 08:23
  • 1
    Look at the end of the previous boot's logs: sudo journalctl -b -1 -e – waltinator Jan 16 '21 at 20:33
  • Good call @waltinator. I didn't have any switch off today so far (indicating that maybe giverc solution was right). If I do, I will give it a try and store this command for the future! Thank you! – nitobuendia Jan 17 '21 at 03:29
  • Quick updates:

    (1) Despite removing the GUI as per @guiverc discussion, this has happened again. Nonetheless, I have added the explanation of what I did for future reference into the troubleshotting.

    (2) I have run the journalctl cmd from @waltinator and it came with a few corrected errors. I am not familiar with these. Currently looking into it. I've added it into the "System Logs" section. One thought here is that the machine still "looks on", but actually these logs matches roughly the timing where the last snapshot happened (i.e. when the machine stopped).

    Thanks for the help!

    – nitobuendia Jan 18 '21 at 10:36
  • The "PCIe Bus Error: severity=Corrected, type=Physical Layer" error seems to be related to power management, so most likely related. Following this answer now: https://askubuntu.com/questions/863150/pcie-bus-error-severity-corrected-type-physical-layer-id-00e5receiver-id -- currently checking whether adding pcie_aspm=off to GRUB_CMDLINE_LINUX_DEFAULT actually solves the issue. I will update in 24-48h when I can confirm that there are no more shutdowns. – nitobuendia Jan 18 '21 at 10:46
  • No more "shutdowns" in the last 48h, so I will assume this is the answer. I will publish now. Thanks waltinator and guiverc. – nitobuendia Jan 20 '21 at 11:14
  • Unfortunately, it has happened again, twice already. I have added the new issues to system logs, and I will be updating my question later (and removing the answer and adding it as troubleshooting). – nitobuendia Jan 22 '21 at 01:21
  • I have removed my answer and merged all the troubleshooting in order using a hypothesis-based approach. Shutdowns are still happening and not sure what else to try. The last journalctl didn't show anything odd that I could find. – nitobuendia Jan 23 '21 at 08:06
  • There are many answers here you can check out: https://askubuntu.com/questions/47311/how-do-i-disable-my-system-from-going-to-sleep – WinEunuuchs2Unix Jan 25 '21 at 04:28
  • 1
    According to your journalctl previous boot system froze when cron hourly was being run. Unless part of the log wasn't posted? Can you update your question with results of ll /etc/cron.hourly/? There might be something there causing the crash. – WinEunuuchs2Unix Jan 25 '21 at 04:44
  • Temperature issues (might be living longer due to colder ambient temp?)? If you can spare the machine for a day perhaps run Ubuntu from a pendrive with a benchmarking app and see if you get the same behaviour. Might be temp (eg BIOS shutdown to protect hardware) or simply a hardware fault? Is the shutdown real time correlated or uptime correlated? – pbhj Jan 28 '21 at 19:44
  • Try tor eproductive behavior with: 1. masked snap* services. 2. Stopped docer containers and flushed iptables rules. – ExploitFate Jan 29 '21 at 15:13
  • @WinEunuuchs2Unix There are no crons configured, updated that before. I have also updated my answer with the output of your command. Thanks for your answer. I missed the comments on my original message. Apologies. – nitobuendia Jan 30 '21 at 05:25
  • @pbhj Other people mentioned temperature too. Is there a good easy way to verify? Machine is new, so unlikely, but perfectly possible that it's faulty. The time is uptime correlated, but not sure how correlated really. It happens at different times. It used to be every 24h (now still happens on longer intervals). Thanks for your answer. I missed the comments on my original message. Apologies. – nitobuendia Jan 30 '21 at 05:26
  • @ExploitFate Could you kind elaborate what to expect out of this test? For example, what does the iptables have to do with this? I have them configured as I need them. Thanks for your answer. I missed the comments on my original message. Apologies. – nitobuendia Jan 30 '21 at 05:26
  • @nitobuendia I had 2 cases with AWS EC2 with snap services when my instances rebooted randomly, the reason was snap update. Also on another VPS I had few cases with iptable rules which caused kernel panic – ExploitFate Jan 30 '21 at 20:59

6 Answers6

4

I have a pair of P340Tiny systems acting as web servers for a couple of non-profits in the community and have not run into this particular issue. At first I thought that you might be running the device headless, as doing so may trigger the BIOS on some systems to force a shut down after a certain amount of time with no display. The same sort of “feature” exists on some Lenovo All-in-One’s, shutting the machine off if there has been zero activity from keyboards or mice for 16 hours. However, looking through the user guide at the power features, I do not see any such functionality. I also looked at the Power tab in the BIOS of one of my P340Tinies and didn’t see anything suggesting the machine might shut down on its own either:

Lenovo P300-series BIOS Settings

One thing I can say is that the P340Tiny units I’m running do not exhibit the behaviour you see. They both run Ubuntu Server 20.04.1 and are configured to run until they’re told otherwise. Aside from the very occasional reboot, they’ve been running 24/7.

That said, you had mentioned that you installed the ubuntu-desktop package on your machine for a GUI, and this has me thinking there is something in systemd that is shutting your machine off.

Check for a Sleep.Target

On desktop systems there is a systemd-sleep service that is used for various power-saving modes. This service may still exist on your server despite having Gnome Desktop removed. You can check for it’s existence with this command:

sudo systemctl status sleep.target

If the service exists and is running, you’ll see a response that looks something like this:

 ● sleep.target - Sleep
    Loaded: loaded (/lib/systemd/system/sleep.target; static; vendor preset: enabled)
    Active: inactive (dead)
      Docs: man:systemd.special(7)

If you see an output like this, then you’ll need to disable the power-saving bits of systemd. Fortunately, it’s not too difficult:

sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

This will disable sleep, suspend, hibernate, and the “hybrid-sleep” features in one step. You should see something like this as the output:

Created symlink /etc/systemd/system/sleep.target → /dev/null
Created symlink /etc/systemd/system/suspend.target → /dev/null
Created symlink /etc/systemd/system/hibernate.target → /dev/null
Created symlink /etc/systemd/system/hybrid-sleep.target → /dev/null

With these things disabled, now you can check for completeness:

$ systemctl status sleep.target
● sleep.target
   Loaded: masked (Reason: Unit sleep.target is masked.)
   Active: inactive (dead)

Note that the Loaded line now reads masked. Any attempt by systemd to sleep will be ignored.

This change takes effect immediately, so there is no need to reload a daemon or reboot the machine. Hopefully it will give you what you need.

  • Thanks a lot for the reply. My sleep.target looks like the latter example with: Loaded: masked (Reason: Unit sleep.target is masked.). The same applies to suspend, hybernate, and hybrid-sleep. The GUI was one of the potential issues here (maybe still the root cause?), but I followed the steps to fully uninstall all related packages with the commands above. I am running this nonetheless: sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target, but there was no output and I expect no impact, unfortunately. – nitobuendia Jan 24 '21 at 16:32
  • By the way, despite this not being my case, I tried to upvote your answer as I still think might be useful for other users (just not solving it for me). Unfortunately, my current points on this network are below 15, which means I can't upvote it. That's annoying :) – nitobuendia Jan 25 '21 at 15:11
  • S'all good. Trying to rebuild your situation in a VM to see if Gnome power saving features remain behind after an uninstall. If a monitor is connected to the machine and that monitor is on, then there shouldn't be anything telling the system to shut down. –  Jan 25 '21 at 15:15
  • There are no monitors connected to the desktop/server. The VM idea is good. Thanks. I feel it's hard to reproduce exactly in the same conditions, but it may help to test the server->desktop->remove. Question: when you start your system, what version exactly does it state you have? Does it say Server or just Ubuntu? – nitobuendia Jan 25 '21 at 18:00
  • 1
    There are no monitors connected? Then I am willing to bet this is a hardware "feature" that I don't have in my BIOS. A lot of desktop machines will shut themselves off when there is no monitor connected for a certain amount of time (or there is no keyboard/mouse input for a certain time). You will need to check your BIOS to ensure that all power-saving features are disabled. If there is something that specifically mentions "headless", ensure it is configured to allow headless operation –  Jan 25 '21 at 22:21
  • Thanks. I thought you had the same machine and couldn't find any options. Nonetheless, I'll have a look at the bios later today to be sure! – nitobuendia Jan 27 '21 at 01:45
  • I've checked the BIOS. There's an Enhanced Power Saving Mode. However, (1) this is about entering power saving mode when it's already Off, not about turning off; and (2) it's Disabled. In Power, most features were about automatic or controlled power on. Nothing else related. – nitobuendia Jan 28 '21 at 12:47
  • While this didn't fully solve, it's the closest to it. I ended up reinstalling Ubuntu Server, and let's see. – nitobuendia Jan 31 '21 at 14:06
  • THANK YOU!!! I was experiencing this on Ubuntu 20. I ran sudo journalctl -b -1 -e to see the systemd logs before the last shut down and noted entries like this NetworkManager[952]: <info> [1654791299.6815] device (enp1s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed') followed by systemd[1]: Reached target Sleep.. I used the systemctl command you shared, saw the symlink output you mentioned, and now my machine has stayed up. Thank you! – James Jun 11 '22 at 16:02
  • This just started happening on my headless ubuntu webserver (a basic mini pc) when I upgraded the OS for the first time in years. I masked sleep daemon and it seems to have worked. That was a right old bit of nasty, and thanks. I will not upgrade again, will install ubuntu server instead which is designed for headless operation. – Dominic Cerisano Jun 23 '23 at 21:36
0

While it's unclear the root cause, the changes on uninstalling UI and Power Management changed the reboot time from to ~24h to ~72h.

Additionally, after reinstalling Ubuntu Server, this doesn't seem to be an issue anymore.

0

Shot in the dark here.

    sudo su -
    crontab -l

also check the crontab for the user or app that runs the Ubuntu server.

If it's as regular as clockwork, then this maybe the utility controlling the behavior.

mondotofu
  • 777
  • There are no cron jobs running as root or my main user. Not sure how this relates to the problem. Could you kindly elaborate further? (Note: if you're saying it because of the "recurrent processes", I do not run those as cronjobs on the server. I run them using Jenkins, which ssh into the machine and runs them.) – nitobuendia Jan 25 '21 at 15:10
  • just looking for an unexpected process such as a backup that would somehow finish and try to suspend the system. – mondotofu Jan 25 '21 at 19:50
  • You can see the logs above. Both the recurrent ones as well as from the last shutdown. I couldn't identify anything. If you do, kindly let me know. Thanks a lot! – nitobuendia Jan 27 '21 at 01:46
0

Here's another idea scan your syslog for any string matching acpi or power

    less /var/log/syslog

It came from https://www.linuxquestions.org/questions/linux-general-1/linux-crash-log-66894/

mondotofu
  • 777
  • Hi, thanks a lot. I edited my original response, but I have a process that takes a snapshot of the syslogs every 30mins. The challenge is that we do not know at what time there will be a shutdown. I removed that information because the proposed sudo journalctl -b -1 -e brings the relevant logs from the last shutdown, which helped identify some of the issues. – nitobuendia Jan 27 '21 at 01:49
0

The first thing to know is if Server keeps responding to ping or not.

If the computer freezes this is normally caused by a problem in a memory module.

If it completely shutdowns this is normally caused by:

  • Incorrect cooler setup, excessive temperature of the CPU/s
  • Insufficient power PSU (seems unlikely in this case)
  • Voltage problems. Can be caused by SATA cables partially connected
  • Overclocking

I would recommend to disable overclocking if it's enabled and test memory. Reboot your Server and from the GRUB menu select Perform Memory Test or boot with an Ubuntu Live CD and select Memory Test. If it hangs in there you have your answer.

You can stress your system with stress-ng. It will test the memory too. Other tools to test the memory without rebooting:

sudo apt-get install memtester

This will allocate 4GB of RAM and run the test 10 times.

sudo memtester 4096 10

enter image description here

If it is a server with ECC Memory, memory errors will be logged and you can check IPMI. Server can have a remove GUI tool to access them called ILO (HP)/iDRAC (Dell), I think in Lenovo is called TSM (ThinkServer System Manager).

Do you have any special PCI card? Maybe a SAS controller or SATA additional ports?. If that's the case I would suggest to disconnect it.

Update: Please check the logs for Kernel Panics and other errors.

sudo cat /var/log/syslog | grep -i "panic\|error\|hang"
dmesg -T | grep -i "panic\|error\|hang"

To read the temperature of the drives and the CPU easily you can use:

sudo apt install hddtemp lm-sensors

Then use:

sudo hddtemp /dev/sda
# Outputs
/dev/sda: ST2000LM015-2E8174: 44°C

And

sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +48.0°C  (high = +95.0°C, crit = +105.0°C)
Core 0:        +46.0°C  (high = +95.0°C, crit = +105.0°C)
Core 1:        +48.0°C  (high = +95.0°C, crit = +105.0°C)
Core 2:        +46.0°C  (high = +95.0°C, crit = +105.0°C)
Core 3:        +46.0°C  (high = +95.0°C, crit = +105.0°C)
Core 4:        +45.0°C  (high = +95.0°C, crit = +105.0°C)
Core 5:        +46.0°C  (high = +95.0°C, crit = +105.0°C)

If you want to continuously watch it you can do:

sudo watch hddtemp /dev/sda

Which will refresh the value every 2 seconds by default.

You can specify more than one drive:

sudo watch hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd

hddtemp

Probably the best is to run sensors capturing in background so when it powers off again you know what was the last temperature reading.

I wrote this small script for you:

while [ true ]; do date | tee -a hddtemp.log; hddtemp /dev/sda /dev/sdb /dev/sdc /dev/sdd | tee -a hddtemp.log; sleep 2; done

Loop printing and keeping in log the temperature

You can do the same for CPU.

I recommend you to run in background inside a screen session.

To execute screen just type:

screen

And from there your command. To exit screen without interrupting press: CTRL + A, and D

To interrupt the script press CTRL + C

Carles Mateo
  • 1,577
  • 6
  • 12
  • Carles, thanks a lot for your reply. When it's "shutdown", the machine cannot be pinged. The memory hypothesis is covered in "Hypothesis: Memory is overloaded and system shuts down". The machine is not overclocking and barely using any of the memory on the snapshots I took. It's a 64GB machine running 10 docker containers (nginx, let'sencrypt, home-assistant), it always stays below 8GB in any given snapshot. I am not sure about the PCI card. Nonetheless, the server was bought about 1 month ago and I have not connected anything that didn't come with it except an additional SSD on the given slot – nitobuendia Jan 28 '21 at 12:52
  • Hi @Nitobuendia , mu pleasure. Is it a Server or a Desktop?. We should check ipmi to look for hardware problems. Memory and CPU have to perform well, I'm talking about a physical problem in a memory module. It happens with new modules or normally after some times serving in the Datacenters. Is very usual having to replace them. – Carles Mateo Jan 28 '21 at 12:56
  • If it is a Server, there normally come with a GUI for Amnistration. I've updated my answer, I think that for Lenovo is called TSM (ThinkServer System Manager) . – Carles Mateo Jan 28 '21 at 13:02
  • Thanks again. It's actually SFF (Tiny) Desktop. I wiped out the operating system (Windows 10) and installed Ubuntu Server, which does not come with GUI. I did install one in case I needed it. Some people suggested this could be the issue and the troubleshooting for that under "Hypothesis: GUI has power management and I should remove it". I am not familiar with "ThinkServer System Manager", but I have not seen any references to it anywhere. – nitobuendia Jan 28 '21 at 14:14
  • Glad to help. When I was talking about GUI I was referring to tools that we have accessible when working with Servers in Data Centers, not to have Ubuntu with GUI. If you have a Desktop you don't have such tool incorporated. As mentioned in the message you have to verify that Memory is Ok. Also check the cables to the drives and temperature of the CPU. Not a bad idea to check the logs for Kernel panics or alike. Will update my answer with instructions. – Carles Mateo Jan 28 '21 at 22:17
  • Any easy way to check the CPU temperature? I've added the command from /syslog/ to my first response. Nothing stands out, there's a power off, but that one was manual (machine has not restarted in the last 48h this time). No errors found with the second command. – nitobuendia Jan 29 '21 at 10:56
  • I have updated my answer with one easy way to check hdd temperature and cpu temperature. If all the drives are working at the same time the box may get hot. – Carles Mateo Jan 30 '21 at 13:37
0

You could just go to the logs directory and look at the prior logs

    cd /var/logs
    ls -artl 

The dmesg.0 log for example. You could also unpack the dmesg.1.gz dmesg2.gz,..., etc. files to see what transpired.

Use these in conjunction with sys.log, boot.log, kern.log.

All previous versions have either the .0 or .1 in the name or they are gzipped for preservation.

In the same directory look for apport.log files. Again they follow the same numbering of the previous log file being

    apport.log.1

There might be actual crash files in the

    /var/crash   

directory.

Good luck hunting, mondotofu

mondotofu
  • 777