3

I'm slowly losing my mind over having to install graphics drivers for a NVIDIA Tesla T4 on a Ubuntu system. For context, I work in a group where we were given a virtual server with a dedicated graphics card so we can use CUDA for compute-intensive applications.

I've been following the official installation documentation provided by NVIDIA. I used the network repo installation method for Ubuntu. I linked the installation logs and system information below. But every time I try to validate the installation, it fails.

# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The message is odd to me since the Tesla T4 is well supported by the latest NVIDIA driver v515.65.01. So I've started digging.

What I've done so far

I've been trying to diagnose the problem, starting with the kernel module. Sure enough, the NVIDIA kernel module is actually loaded, but modprobe nvidia claims there is no NVIDIA device node.

# lsmod | grep -i nvidia     
nvidia              40796160  1         
drm                   491520  7 vmwgfx,drm_kms_helper,nvidia,ttm
# modinfo nvidia        
filename:       /lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko
firmware:       nvidia/515.65.01/gsp.bin                                                                              
alias:          char-major-195-*
version:        515.65.01
supported:      external
license:        NVIDIA
srcversion:     8049D44E2C1B08F41E1B8A6
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
retpoline:      Y
name:           nvidia
vermagic:       5.4.0-125-generic SMP mod_unload modversions
...
# modprobe nvidia -vv
modprobe: INFO: ../libkmod/libkmod.c:365 kmod_set_log_fn() custom logging function 0x56379348cc70 registered
insmod /lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko 
modprobe: INFO: ../libkmod/libkmod-module.c:892 kmod_module_insert_module() Failed to insert module '/lib/modules/5.4.0-125-generic/updates/dkms/nvidia.ko': No such device
modprobe: ERROR: could not insert 'nvidia': No such device
modprobe: INFO: ../libkmod/libkmod.c:332 kmod_unref() context 0x563793ca6450 released

There is only one NVIDIA device node called /dev/nvidiactl. I've never dealt with NVIDIA before so I don't know if this is unusual or not.

# ll /dev | grep -i nvidia
crw-rw-rw-   1 root root    195, 255 Aug 26 08:50 nvidiactl

I checked the kernel ring buffer immediately after installation and it shows the same error that probably prompts the error message of nvidia-smi. After a reboot, the kernel ring buffer gets spammed with the same message, most likely (but not verified) due to the NVIDIA persistence daemon service unit.

# dmesg
[ 1408.306561] nvidia: loading out-of-tree module taints kernel.
[ 1408.306573] nvidia: module license 'NVIDIA' taints kernel.
[ 1408.306574] Disabling lock debugging due to kernel taint
[ 1408.328692] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1408.337548] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 1408.339537] nvidia 0000:02:00.0: enabling device (0100 -> 0102)
[ 1408.340568] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 1408.340694] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:1eb8)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 515.65.01 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[ 1408.341210] nvidia: probe of 0000:02:00.0 failed with error -1
[ 1408.341239] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 1408.341240] NVRM: None of the NVIDIA devices were initialized.
[ 1408.341491] nvidia-nvlink: Unregistered Nvlink Core, major device number 239

I found this section in the Arch Wiki which seemed like the exact problem I was having. I added the pcie_port_pm=off kernel parameter to the GRUB config, rebooted, and it still doesn't work.

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-125-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity pcie_port_pm=off

I also blacklisted the Nouveau driver manually. It shouldn't be an issue since the server is headless and Xorg isn't even installed, but better to be safe than sorry. But again, no luck.

# cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf 
blacklist nouveau
options nouveau modeset=0

I contacted the sysadmins at my workplace about the issue and they basically responded saying "we haven't encountered this issue yet" and "we work with Windows 99% of the time anyway". Our work group is indeed an outlier since we rely on Linux only in a nearly pure Windows network. But I've basically been given permission by the sysadmins to do whatever's necessary to fix the issue and to report back any findings.

However, I'm currently at the end with my knowledge. I'm comfortable with the command line, but not enough to dig even deeper. Our sysadmins don't seem to really know what's going on, and since SSH is my only way of accessing the server, I'm not confident to do anything to the server that might brick it. And I hope that there's something painfully obvious that I'm missing here, or that there's someone out there who has had the same issue before. I already spent an entire workday trying to resolve the issue and I'm at the start of another one, so any pointers and helpful insights are much appreciated.

Logs

mjugl
  • 31
  • Disable Secure Boot in UEFI and try again – ChanganAuto Aug 26 '22 at 20:37
  • @ChanganAuto I just found that the server is booted in BIOS mode, not UEFI. ls /sys/firmware doesn't have an efi entry. Could booting into UEFI without secure boot be the solution? Just want to verify it first because changes like these require me to talk to our sysadmins first. – mjugl Aug 29 '22 at 06:13
  • 1
    Every current hardware and from a decade ago is UEFI so UEFI mode is preferred if not mandatory. But in this case I can't say it would make a difference. – ChanganAuto Aug 29 '22 at 07:36
  • Take a look at https://askubuntu.com/questions/1406888/ubuntu-22-04-gpu-passthrough-qemu and google nvidia gpu passthrough from ubuntu virtual machine – ubfan1 Sep 02 '22 at 18:34

1 Answers1

0

It's been a hot minute but I figured it out. The issue boiled down to the fact that our sysadmins forgot to mention two key pieces of information.

  1. We cannot use the stock NVIDIA drivers. Since we're running on a virtual server (VMWare), we need NVIDIAs vGPU software. The sysadmins have provided the correct version to us and the GPU has been running fine since then.

  2. To keep using the GPU, we need to set up a license key. The documentation and the key itself has also been provided by the sysadmins after a long wait.

This is probably a very niche problem. In the end it came down to communication problems. So if anyone sees this issue and is trying to get an NVIDIA GPU to work on a virtual server, maybe push for the two points that I mentioned.

mjugl
  • 31