1
  • Device: HP ProBook 470 G4
  • Integrated GPU: Intel HD Graphics 620
  • Dedicated GPU: NVIDIA GeForce 930MX

My laptop just returned from a service center (because of a CPU failure). Everything was working well before the CPU failed. Now, I installed Ubuntu 20.04 and the proprietary NVIDIA drivers.

Note: I tried EVERY version of the driver. My GPU supports 390, 418, 430, 435, 440, 450 and 455. There's also a strange thing... when I install 440, APT installs 450. Same happens for 430 and 418. 435 is being replaced by 455. Anyway, here's my problem:

When I boot my laptop, it gets stuck on a black screen before gdm3 starts. I can't even switch the TTY. Only SSH is working. When I got the dmesg log, I saw this:

[   16.620560] ACPI Warning: \_SB.PCI0.RP01.PXSX._DSM: Argument #4 type mismatch - Found
[Buffer], ACPI requires [Package] (20200528/nsarguments-59)
[   17.126534] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off
[   17.126546] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[   18.695141] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.0
[   18.695154] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)                     
[   18.695159] pcieport 0000:00:1c.0: AER:   device [8086:9d10] error status/mask=00100000/00010000                                                     
[   18.695161] pcieport 0000:00:1c.0: AER:    [20] UnsupReq               (First)                                                                       
[   18.695164] pcieport 0000:00:1c.0: AER:   TLP Header: 34000000 00000010 00000000 00000000                                                            
[   18.695173] nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
[   18.695208] pcieport 0000:00:1c.0: AER: device recovery failed
[   18.699191] NVRM: GPU at PCI:0000:01:00: GPU-9fe5f99e-479c-1100-e75b-dc4310990232
[   18.699194] NVRM: Xid (PCI:0000:01:00): 79, pid=1521, GPU has fallen off the bus.                                                                    
[   18.699197] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[   18.699206] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[   19.031183] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
[   19.423191] irq 16: nobody cared (try booting with the "irqpoll" option)
[   19.423195] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           OE     5.8.0-050800-generic #202008022230
[   19.423195] Hardware name: HP HP ProBook 470 G4/8234, BIOS P85 Ver. 01.37 10/19/2020
[   19.423196] Call Trace:
[   19.423197]  <IRQ>
[   19.423202]  dump_stack+0x70/0x8d
[   19.423205]  __report_bad_irq+0x3a/0xaf
[   19.423206]  note_interrupt.cold+0x8/0x60
[   19.423208]  handle_irq_event+0xaa/0xb1
[   19.423208]  handle_fasteoi_irq+0x7d/0x1c0
[   19.423210]  asm_call_on_stack+0x12/0x20
[   19.423211]  </IRQ>
[   19.423213]  common_interrupt+0xbc/0x160
[   19.423214]  asm_common_interrupt+0x1e/0x40
[   19.423215] RIP: 0010:poll_idle+0x9b/0xb9
[   19.423217] Code: 44 89 e8 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 f7 48 89 de e8 77 71 dd ff 49 89 c6 b8 c9 00 00 00 49 8b 17 83 e2 08 75 b1 f3 90 <83> e8 01 75 f1 65 8b 3d 59 97 04 63 e8 34 f6 51 ff 4c 29 e0 4c 39
[   19.423217] RSP: 0018:ffffa8f3000ffe10 EFLAGS: 00000246
[   19.423218] RAX: 0000000000000020 RBX: ffff9b8bc05b7500 RCX: 000000000000001f
[   19.423219] RDX: 0000000000000000 RSI: ffff9b8bc05b7500 RDI: ffffffff9df6d760
[   19.423219] RBP: ffffa8f3000ffe38 R08: 0000000485b61e74 R09: 0000000000000001
[   19.423220] R10: 0000000000000003 R11: ffff9b8bc05ab364 R12: 0000000485b61e74
[   19.423221] R13: 0000000000000000 R14: 00000000000007d0 R15: ffff9b8bb5300000
[   19.423223]  cpuidle_enter_state+0x81/0x3f0
[   19.423224]  cpuidle_enter+0x2e/0x40
[   19.423226]  cpuidle_idle_call+0x145/0x200
[   19.423227]  do_idle+0x7a/0xe0
[   19.423228]  cpu_startup_entry+0x20/0x30
[   19.423230]  start_secondary+0xe6/0x100
[   19.423232]  secondary_startup_64+0xb6/0xc0
[   19.423233] handlers:
[   19.423236] [<00000000750c932b>] i801_isr [i2c_i801]
[   19.423237] Disabling IRQ #16

I can always sudo prime-select intel && sudo systemctl restart gdm3 using SSH to get the display manager working, but the NVIDIA card just doesn't work.

Note: I don't think this is indicates a GPU failure. I can get the GPU working by adding some boot arguments. For example, I tried these:

quiet splash rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009' pci=nomsi

Adding the arguments fixed the issue, but only for 1 boot. So, when I started my laptop, everything was working fine, even suspending. The GPU was working so I know it isn't failing. Also it's working fine in Windows. When I restart my laptop, it gets stuck on the black screen again (yes, I updated grub to make the changes permanent).

nomsi disables MSI, but it doesn't solve my issue. The GPU still "falls of the bus", but with different error messages (failed to enable MSI).

Is there a way to maybe disable the PCIe errors so the NVIDIA driver doesn't crash? I really think it's crashing because the kernel spams it with error messages. Any help would be greatly appreciated.

Edit 1: I tried the irqpoll option but it didn't fix anything... An odd thing here is everything works fine in Windows. It's just Ubuntu (I might try other distros if necessary). I can't open the laptop's case because it would void the repair warranty.

Edit 2: Output of lspci -tv:

-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
           +-02.0  Intel Corporation HD Graphics 620
           +-14.0  Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
           +-14.2  Intel Corporation Sunrise Point-LP Thermal subsystem
           +-17.0  Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode]
           +-1c.0-[01]----00.0  NVIDIA Corporation GM108M [GeForce 930MX]
           +-1c.4-[02]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-1c.5-[03]----00.0  Intel Corporation Wireless 7265
           +-1d.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader
           +-1f.0  Intel Corporation Sunrise Point-LP LPC Controller
           +-1f.2  Intel Corporation Sunrise Point-LP PMC
           +-1f.3  Intel Corporation Sunrise Point-LP HD Audio-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
           +-02.0  Intel Corporation HD Graphics 620
           +-14.0  Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
           +-14.2  Intel Corporation Sunrise Point-LP Thermal subsystem
           +-17.0  Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode]
           +-1c.0-[01]----00.0  NVIDIA Corporation GM108M [GeForce 930MX]
           +-1c.4-[02]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-1c.5-[03]----00.0  Intel Corporation Wireless 7265
           +-1d.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader
           +-1f.0  Intel Corporation Sunrise Point-LP LPC Controller
           +-1f.2  Intel Corporation Sunrise Point-LP PMC
           +-1f.3  Intel Corporation Sunrise Point-LP HD Audio
           \-1f.4  Intel Corporation Sunrise Point-LP SMBus
           \-1f.4  Intel Corporation Sunrise Point-LP SMBus
adazem009
  • 1,032
  • Edit your question and show me lspci -tv. Remove "rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009' pci=nomsi" and sudo update-grub. Your Nvidia driver should be 460.39, see https://www.nvidia.com/en-us/geforce/drivers/ – heynnema Jan 29 '21 at 21:20
  • @heynnema Yes, my driver version is 460. I didn't mention it because I asked this question before the new version came out. It's same on all releases anyway. – adazem009 Jan 31 '21 at 14:58
  • Please see my answer. Report back. – heynnema Jan 31 '21 at 15:31

2 Answers2

1

It seems it's a hardware issue. I'm not sure what's wrong with the GPU but I think it's attached to the motherboard incorrectly. I'll try to disassemble the laptop and see what's wrong. If I'm not able to fix it, I'll take the laptop to a service center again.

My tests:

  • It wasn't happening before
  • It started to happen in Windows too now (I got something like error 46 in the device manager)
  • It doesn't happen on every boot. Sometimes the GPU works, but it stops working on the next restart, hibernation or suspend.
  • I'm experiencing random PCIe bus errors (>100 dmesg messages per second) even when the Intel GPU is selected. Removing the GPU from the kernel (by writing 1 to /sys/bus/pci/devices/0000:01:00.0/remove solves this issue without a reboot.
adazem009
  • 1,032
0

Device 1c.0 is causing the problem... and the AER (Advanced Error Reporting) is reporting it...

       +-1c.0-[01]----00.0  NVIDIA Corporation GM108M [GeForce 930MX]

Although, like you, I suspect a hardware problem, for testing purposes, we can try this...

AER

sudo -H gedit /etc/default/grub # edit this file

Find:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

Change it to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"

Save the file.

sudo update-grub # update GRUB

reboot # reboot the computer

Otherwise, you'll need to send it back to the Service Center.

heynnema
  • 70,711
  • Hi. I've already tried this. It just hides the errors but the GPU still "falls off the bus". I've also tried kernel options like pcie_aspm=off, but nothing is working for me. – adazem009 Jan 31 '21 at 19:04
  • @adazem009 Send it back to the Service Center. – heynnema Jan 31 '21 at 19:08
  • I think I'll try to repair it myself because I can't afford more repairs. I looked at some disassembly videos for this laptop and I think I can check what's wrong with the GPU easily. I'll buy a new laptop in about 2 years anyway... – adazem009 Jan 31 '21 at 20:32
  • @adazem009 You may void the service warranty if you open the laptop. I'd bring it back and let THEM check it out. – heynnema Jan 31 '21 at 20:41
  • I disassembled the laptop when the CPU failed and it wasn't in warranty anyway. It's a 3 year old laptop and I took it to an unofficial service center. They replaced the CPU and now the laptop works. There are problems like caps lock + num lock blinks 5 times before boot. Also the dedicated GPU has issues. – adazem009 Jan 31 '21 at 20:43
  • @adazem009 if the num lock blinks 5 times, that indicates a hard failure. Look in the User Manual to see what 5 blinks means. – heynnema Jan 31 '21 at 20:45