I'm slowly losing my mind over having to install graphics drivers for a NVIDIA Tesla T4 on a Ubuntu system. For context, I work in a group where we were given a virtual server with a dedicated graphics card so we can use CUDA for compute-intensive applications.
I've been following the official installation documentation provided by NVIDIA. I used the network repo installation method for Ubuntu. I linked the installation logs and system information below. But every time I try to validate the installation, it fails.
# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
The message is odd to me since the Tesla T4 is well supported by the latest NVIDIA driver v515.65.01. So I've started digging.
What I've done so far
I've been trying to diagnose the problem, starting with the kernel module. Sure enough, the NVIDIA kernel module is actually loaded, but modprobe nvidia
claims there is no NVIDIA device node.
# lsmod | grep -i nvidia
nvidia 40796160 1
drm 491520 7 vmwgfx,drm_kms_helper,nvidia,ttm
# modinfo nvidia
filename: /lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko
firmware: nvidia/515.65.01/gsp.bin
alias: char-major-195-*
version: 515.65.01
supported: external
license: NVIDIA
srcversion: 8049D44E2C1B08F41E1B8A6
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: drm
retpoline: Y
name: nvidia
vermagic: 5.4.0-125-generic SMP mod_unload modversions
...
# modprobe nvidia -vv
modprobe: INFO: ../libkmod/libkmod.c:365 kmod_set_log_fn() custom logging function 0x56379348cc70 registered
insmod /lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko
modprobe: INFO: ../libkmod/libkmod-module.c:892 kmod_module_insert_module() Failed to insert module '/lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko': No such device
modprobe: ERROR: could not insert 'nvidia': No such device
modprobe: INFO: ../libkmod/libkmod.c:332 kmod_unref() context 0x563793ca6450 released
There is only one NVIDIA device node called /dev/nvidiactl
. I've never dealt with NVIDIA before so I don't know if this is unusual or not.
# ll /dev | grep -i nvidia
crw-rw-rw- 1 root root 195, 255 Aug 26 08:50 nvidiactl
I checked the kernel ring buffer immediately after installation and it shows the same error that probably prompts the error message of nvidia-smi
. After a reboot, the kernel ring buffer gets spammed with the same message, most likely (but not verified) due to the NVIDIA persistence daemon service unit.
# dmesg
[ 1408.306561] nvidia: loading out-of-tree module taints kernel.
[ 1408.306573] nvidia: module license 'NVIDIA' taints kernel.
[ 1408.306574] Disabling lock debugging due to kernel taint
[ 1408.328692] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1408.337548] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 1408.339537] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[ 1408.340568] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 1408.340694] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:1eb8)
NVRM: installed in this system is not supported by the
NVRM: NVIDIA 515.65.01 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
NVRM: in this release's README, available on the operating system
NVRM: specific graphics driver download page at www.nvidia.com.
[ 1408.341210] nvidia: probe of 0000:02:00.0 failed with error -1
[ 1408.341239] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 1408.341240] NVRM: None of the NVIDIA devices were initialized.
[ 1408.341491] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
I found this section in the Arch Wiki which seemed like the exact problem I was having. I added the pcie_port_pm=off
kernel parameter to the GRUB config, rebooted, and it still doesn't work.
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-125-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity pcie_port_pm=off
I also blacklisted the Nouveau driver manually. It shouldn't be an issue since the server is headless and Xorg isn't even installed, but better to be safe than sorry. But again, no luck.
# cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0
I contacted the sysadmins at my workplace about the issue and they basically responded saying "we haven't encountered this issue yet" and "we work with Windows 99% of the time anyway". Our work group is indeed an outlier since we rely on Linux only in a nearly pure Windows network. But I've basically been given permission by the sysadmins to do whatever's necessary to fix the issue and to report back any findings.
However, I'm currently at the end with my knowledge. I'm comfortable with the command line, but not enough to dig even deeper. Our sysadmins don't seem to really know what's going on, and since SSH is my only way of accessing the server, I'm not confident to do anything to the server that might brick it. And I hope that there's something painfully obvious that I'm missing here, or that there's someone out there who has had the same issue before. I already spent an entire workday trying to resolve the issue and I'm at the start of another one, so any pointers and helpful insights are much appreciated.
ls /sys/firmware
doesn't have anefi
entry. Could booting into UEFI without secure boot be the solution? Just want to verify it first because changes like these require me to talk to our sysadmins first. – mjugl Aug 29 '22 at 06:13