2

My desktop is running Ubuntu 20.04. I've been noticing a lot of odd behavior over the last few months (at least) and I'm trying to figure out how to troubleshoot it.

The system has 32 GB of ram, AMD Ryzen 5900X CPU, MSI MAG B550 Tomahawk MD, and a couple SSDs attached.

The symptoms are:

  • Browser tabs crash all the time. Every 15 minutes or so, one of my open tabs (of which I typically have ~20-30) will crash (with "Aw Snap!" in Chrome, and similar messages in Firefox and Chromium).
  • Slack will crash intermittently. About 1-2 times per day it just hangs for about a minute and then dies.
  • My virtual machines will corrupt. I typically have 1-3 open at a time, and about once a month, my VM will start complaining about having a read only file system. I'll reboot and it boots into initramfs, and running fsck across the disk typically fixes it.

My gut is that I could have some failing RAM, but I don't know how to troubleshoot that. Are there logs I could look at that would help figure out why Chrome is crashing all the time?

Thanks in advance!

Edited to add at @heynnema request:

david@jawad:~$ ls -lah /var/crash/
total 240M
drwxrwsrwt  2 root     whoopsie 4.0K Mar 22 10:43 .
drwxr-xr-x 15 root     root     4.0K Aug 31  2021 ..
-rw-r-----  1 david     whoopsie  35M Mar 22 01:13 _opt_google_chrome_chrome.1000.crash
-rw-r-----  1 david     whoopsie  27M Mar 22 10:43 _usr_bin_python3.8.1000.crash
-rw-r-----  1 david     whoopsie  60M Mar 15 15:51 _usr_lib_insync_PySide2_Qt_libexec_QtWebEngineProcess.1000.crash
-rw-r--r--  1 david     whoopsie    0 Mar 15 15:51 _usr_lib_insync_PySide2_Qt_libexec_QtWebEngineProcess.1000.upload
-rw-------  1 whoopsie whoopsie   37 Mar 15 15:51 _usr_lib_insync_PySide2_Qt_libexec_QtWebEngineProcess.1000.uploaded
-rw-r-----  1 david     whoopsie  98M Mar 22 07:28 _usr_lib_slack_slack.1000.crash
-rw-r-----  1 david     whoopsie  22M Mar 17 09:45 _usr_share_typora_Typora.1000.crash

Will get memtest as well.

Next edit as reuqested:

# lshw -C memory
  *-firmware                
       description: BIOS
       vendor: American Megatrends International, LLC.
       physical id: 0
       version: A.60
       date: 05/12/2021
       size: 64KiB
       capacity: 32MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppynec int13floppytoshiba int13floppy360 int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 10
       slot: System board or motherboard
       size: 32GiB
     *-bank:0
          description: 2667 MHz (0.4 ns) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 0
          serial: Unknown
          slot: DIMM 0
          clock: 2667MHz (0.4ns)
     *-bank:1
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
          product: F4-3200C16-16GVK
          vendor: Unknown
          physical id: 1
          serial: 00000000
          slot: DIMM 1
          size: 16GiB
          width: 64 bits
          clock: 2667MHz (0.4ns)
     *-bank:2
          description: 2667 MHz (0.4 ns) [empty]
          product: Unknown
          vendor: Unknown
          physical id: 2
          serial: Unknown
          slot: DIMM 0
          clock: 2667MHz (0.4ns)
     *-bank:3
          description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0.4 ns)
          product: F4-3200C16-16GVK
          vendor: Unknown
          physical id: 3
          serial: 00000000
          slot: DIMM 1
          size: 16GiB
          width: 64 bits
          clock: 2667MHz (0.4ns)
  *-cache:0
       description: L1 cache
       physical id: 13
       slot: L1 - Cache
       size: 768KiB
       capacity: 768KiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=1
  *-cache:1
       description: L2 cache
       physical id: 14
       slot: L2 - Cache
       size: 6MiB
       capacity: 6MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=2
  *-cache:2
       description: L3 cache
       physical id: 15
       slot: L3 - Cache
       size: 64MiB
       capacity: 64MiB
       clock: 1GHz (1.0ns)
       capabilities: pipeline-burst internal write-back unified
       configuration: level=3
DCHeel
  • 33
  • Edit your question and show me ls -al /var/crash. Ryzen processors are very fussy about RAM. Go to https://www.memtest86.com/ and download/run their free memtest to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take a few hours to complete. Report back. Start comments to me with @heynnema or I'll miss them. – heynnema Mar 22 '22 at 18:54
  • @heynnema - I posted /var/crash in the post last night. I had some on going tasks that needed to finish, so I'm just now running memtest. It's I think not looking good. It's still in pass 1/4, but I'm at 2202 errors. Does this mean the ram is bad? What information would I need to save from this run to help with troubleshooting? – DCHeel Mar 23 '22 at 13:18
  • You found the problem. Ryzen vs RAM. Show me sudo lshw -C memory. – heynnema Mar 23 '22 at 13:46
  • @heynnema - posted – DCHeel Mar 23 '22 at 15:54
  • Please see Update #2 in my answer. Report back. – heynnema Mar 23 '22 at 19:59

2 Answers2

3

Lots of crash logs in /var/crash.

Ryzen processors are very fussy about RAM. Go to https://memtest86.com and download/run their free memtest to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take a few hours to complete.

Update #1:

memtest failed. Show me sudo lshw -C memory.

Update #2:

You have two 16G DIMMs of memory. First, power off the computer, and remove, then reinsert each DIMM, and re-run memtest. If it passes, then you've probably fixed the problem.

If it fails, remove one 16G DIMM and re-run memtest. If it fails, that DIMM may be defective.

In either case, remove that DIMM and reinsert the other DIMM and re-run memtest.

It's important to write down which DIMM passed, and which DIMM failed. Report back.

Update #3:

memtest fails on individual DIMMs. Suspect RAM compatibility issue, but first we need to update the BIOS and retest with memtest. Get the BIOS update here.

Note: Confirm that I have the correct web page for your motherboard (MSI MAG B550 Tomahawk MD).

Note: Have good backups before updating the BIOS.

Update #4:

Your memory DIMMs (F4-3200C16-16GVK) don't appear on the memory compatibility lists shown here.

Update #5:

Review pages 13 and 15 of the User Manual here and confirm that DIMM slot A2 is filled first, and B2 next, with exact same spec DIMMs.

Update #6:

See https://www.crucial.com/ for correct spec DIMMs. See https://www.crucial.com/compatible-upgrade-for/msi-%28micro-star%29/mag-b550-tomahawk

Update #7:

Confirm that CPU and RAM are NOT overclocked. If they are, set them back to default clocks, and retest with memtest.

Update #8:

Replaced the RAM. Everything is working fine now.

heynnema
  • 70,711
  • If I'm getting 100+ errors in the first minute, is that enough to say the test is failed and report back? Or is it important to run the full test?

    I get the same result with both removed and reset and with each individually - within the first minute, 50+ errors, and then hundreds of errors within a few minutes. I can run the full test later when I don't need my computer if that's necessary.

    – DCHeel Mar 23 '22 at 21:18
  • @DCHeel No, if you're getting errors with only one stick installed at a time, you don't need to run it further. You have a RAM compatibility problem. First thing is to update the BIOS and retest with memtest. Go to https://us.msi.com/Motherboard/MAG-B550-TOMAHAWK/support to get the update. If memtest still fails, then we need to check RAM compatibility. Report back. – heynnema Mar 23 '22 at 21:40
  • @DCHeel See Update #3 in my answer. – heynnema Mar 23 '22 at 21:42
  • @DCHeel See Update #4 in my answer. – heynnema Mar 23 '22 at 21:53
  • @DCHeel See Update #5 in my answer. – heynnema Mar 23 '22 at 22:03
  • updated the bios as instructed. running memtest again. same deal. lots of failures 1 minute in. – DCHeel Mar 23 '22 at 22:20
  • re #4: does this mean i just need new ram? – DCHeel Mar 23 '22 at 22:21
  • Re #5: can confirm they are in a2 and b2 – DCHeel Mar 23 '22 at 22:23
  • @DCHeel Confirm that the memory DIMMs are in slots A2 and B2. Retest if necessary. Then yes, you need different memory DIMMs (meaning correct spec). See https://www.crucial.com/ – heynnema Mar 23 '22 at 22:23
  • Dang. Is it surprising that it works at all? I've been using this computer with minor annoyance almost a year.... – DCHeel Mar 23 '22 at 22:26
  • @DCHeel You could try warranty replacements, if you have the correct spec memory. – heynnema Mar 23 '22 at 22:27
  • @DCHeel You don't have your CPU or RAM overclocked, do you? – heynnema Mar 23 '22 at 22:39
  • No overclock. As far as warranty, it sounds like it's not broken, just not a good fit? – DCHeel Mar 23 '22 at 22:49
  • @DCHeel May be the wrong spec. See Update #6. – heynnema Mar 23 '22 at 22:50
  • @DCHeel Did you finally resolve the memory problem? – heynnema Mar 27 '22 at 15:19
  • I actually ordered a new set of ram on amazon, and it worked flawlessly. So I've put in an RMA to send back the original ram to GSkill. – DCHeel Mar 28 '22 at 16:03
  • @DCHeel Good news! memtest passed with the new RAM? – heynnema Mar 28 '22 at 16:22
  • Yup! Full pass. – DCHeel Apr 04 '22 at 10:56
  • Did you update the BIOS, as outlined in Update #3, and then rerun memtest with your original RAM? – heynnema Apr 04 '22 at 12:32
  • Yes, I did that - see my comment from 23 Mar: "updated the bios as instructed. running memtest again. same deal. lots of failures 1 minute in." – DCHeel Apr 05 '22 at 17:45
  • @DCHeel Good deal! Thanks for the update. – heynnema Apr 05 '22 at 17:52
  • sure thing. thanks for the help. – DCHeel Apr 06 '22 at 18:31
1

I had the same problem, I try this and it works for me:

Ubuntu 22.04 LTS’s introduction of systemd-oomd, a user-space out of memory killer that’s designed to “take corrective action before an OOM occurs in the kernel space’. When it detects that memory pressure is getting a bit too stressed, it intervenes to ensure the system copes, and (most) things stay running. I hope it helps you.

Most systemd services can be managed via the systemctl utility. In this case, we want to disable the systemd-oomd service. This can be done with:

$ systemctl disable --now systemd-oomd

You should see something like (depending on your OS):

$ systemctl disable --now systemd-oomd
Removed /etc/systemd/system/multi-user.target.wants/systemd-oomd.service.
Removed /etc/systemd/system/dbus-org.freedesktop.oom1.service.

You can then verify that the service is disabled, with:

$ systemctl is-enabled systemd-oomd

And you should then see:

$ systemctl is-enabled systemd-oomd
disabled

It is possible, however, that other services might attempt to restart the systemd-oomd service. To prevent this, you can 'mask' the service. For example:

$ systemctl mask systemd-oomd
Created symlink /etc/systemd/system/systemd-oomd.service → /dev/null.

And then systemctl is-enabled should now report:

$ systemctl is-enabled systemd-oomd
masked

See man systemctl for more details; in particular, note the caveats regarding masking of systemd services.

How do I disable the systemd OOM process killer in Ubuntu 22.04?

  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review – Pilot6 Jun 22 '22 at 07:57
  • 1
    @Pilot6 Thank you very much for the advice, I already edited the answer, best regards – Juan Torchia Jun 23 '22 at 13:42