3

Symptom:

The system freezes anywhere from two minutes to an hour after boot, then spontaneously reboots about ten seconds later. It doesn't matter if the system is sitting at the login screen, idle at a desktop, watching a video, etc. Temperature readings are normal leading up to the freeze+reboot.

I thought that implied a memory issue, but I've tried reseating modules, swapping slots, increasing DRAM voltage, etc. Threads on Ryzen and the Aorus motherboard sent me down rabbit holes and I've been toggling c-states off, increasing idle DRAM power, etc. No joy.

Note that this AMD Ryzen 5 3600 is not a defective CPU part; I swapped it with AMD via an RMA exchange and saw no difference! (When I install an AMD Ryzen 3400G for the CPU the system is rock solid. However, I can't use that CPU/APU long-term for this system.)

As much information as you can stand follows. Please let me know if I've missed anything which might help further diagnose what's wrong.

I am weeks of precious time into trying to get this build stable. At this point I feel like I've tried everything except swinging a dead chicken over my head. Please help me find the root cause! I'm at my wit's end and feeling very discouraged. :(

Short list of (potentially) relevant other threads:

Hardware

  • Gigabyte x570 Aorus Elite motherboard (UEFI Versions: F11 or F20)
  • AMD Ryzen 5 3600 6-Core Processor
  • 16GB Corsair Vengeance LPX memory (DDR4 2x8GB 3200Mhz)
  • MSI GeForce GTX 970 GAMING 4G
    • 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)

Pic of main components

Things I've tried with no change

  • Tested the memory exhaustively (overnight, no problems detected)
  • Reseating memory
  • Swapping memory to the opposite memory bank
  • Swapping memory sticks within the same bank
  • Swapping out the CPU via RMA with AMD
  • Different UEFI versions (F11 and F20)

Errors reported at boot typically look like this:

sudo journalctl | grep -i "hardware err"

Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged

Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108

Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff87930eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000

Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594686497 SOCKET 0 APIC 4 microcode 8701013

Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged

Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108

Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffbbf30eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000

Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594695977 SOCKET 0 APIC a microcode 8701021

Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged

Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108

Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff89330eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000

Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594857445 SOCKET 0 APIC 1 microcode 8701021

More:

UEFI settings

The settings in the picture below are referring to F20, the most recent stable UEFI release.

Things I've tried with no change (note NO overclocking of any sort)

  • Every version of Gigabyte's UEFI between F11 and F20 at "optimized default" settings
  • Increasing core DRAM voltage to 1.35V
  • Many of the settings below/pictured toggled in one direction or another:
    • CPU Clock Ratio: Auto (36.00)
    • CPU Clock Control: Auto (100.00MHz)
    • Extreme Memory Profile (X.M.P): Disabled
    • CPU Vcore: Auto
    • CPU Vcore Loadline Calibration: Auto
    • CSM Support: Enabled
    • SMT Mode: Disabled
    • Power Supply Idle Control: Typical Current Idle
    • IOMMU: Enabled
    • SVM Mode: Enabled
    • ACS Enabled: Auto
    • Enable AER Cap: Auto
    • Global C-state Control: Disabled
    • DRAM Power Options > Power Down Enable: Disabled

Software

Ubuntu 20.04 LTS

$ uname -a 
Kernel: Linux obelisk-ubuntu 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash atkbd.reset=1 i8042.reset pci=assign-busses apicmaintimer idle=poll reboot=cold,hard processor.max_cstate=1 rcu_nocbs=0-11"

I have also tried installing the ZenStates package and setting it to disable C6.

Here's a gist with everything else I think you might ask for.

bmogilefsky
  • 31
  • 1
  • 2
  • thanks for introducing me to zenstates, I'll look into it. I, on a Ryzen5 HP Envy X360 have been having lockups a lot (over 18 months)... less so from 19.10 to 20.04, but still seemingly at random. Loss of USB is still common, though less so too. There is a recent HP firmware update I have not tried (not prepared to risk it at this stage). My only advise, not helpful I am sure, is to wait.. and wait. – pierrely Jul 28 '20 at 10:21
  • I came across this one, perhaps you know all that is there. https://forum.manjaro.org/t/amd-ryzen-problems-and-fixes/55533

    and have you tried disabling tpm? https://askubuntu.com/questions/1250517/how-can-i-turn-tpm-off-or-disable-it-in-ubuntu#

    – pierrely Jul 28 '20 at 10:30
  • I'll give those a shot this weekend, thanks! Also I got other suggestions on Reddit to try, will report back here if anything works: https://www.reddit.com/r/linuxhardware/comments/hz7itu/crashing_ubuntu_2004_aorus_x570_elite_and_ryzen_5/ – bmogilefsky Aug 01 '20 at 04:06
  • Did you get it solved? Same issues with the same hardware config. – Vadim Peretokin Apr 21 '21 at 06:26

1 Answers1

0

I'm facing the same issue with a 3700X on that mainboard type, running Debian Buster and different kernels. The system was stable for long time before, issues started when I updated the bios together with the installation of new memory. Tried to flash the bios back to version F3 today and now the system seems to be stable again. Unfortunately it seems that this old bios version does not support the ecc on my memory banks.