A recent kernel update broke my Cuda installation (it works fine if I boot an older kernel) - very similar setups have persisted across kernel updates on other machines, the major difference being that this installation is Ubuntu Server and the others are Ubuntu Desktop. Does this sound like a DKMS issue? Or something else? How can I get my cuda modules to build themselves against new kernels?
I have Ubuntu 16.04, Cuda 10.0 (installed by local .deb), nvidia driver 410.48 (installed automatically during cuda install) and a 2080Ti GPU
$ ls -al /boot
total 111740
drwxr-xr-x 3 root root 4096 Apr 9 12:02 .
drwxr-xr-x 24 root root 4096 Apr 4 16:53 ..
-rw-r--r-- 1 root root 1252376 Jan 16 23:29 abi-4.4.0-142-generic
-rw-r--r-- 1 root root 190580 Jan 16 23:29 config-4.4.0-142-generic
-rw-r--r-- 1 root root 190580 Mar 26 14:02 config-4.4.0-145-generic
drwxr-xr-x 5 root root 4096 Apr 9 12:02 grub
-rw-r--r-- 1 root root 50832836 Apr 4 16:54 initrd.img-4.4.0-142-generic
-rw-r--r-- 1 root root 39170185 Apr 9 11:15 initrd.img-4.4.0-145-generic
-rw-r--r-- 1 root root 182704 Jan 28 2016 memtest86+.bin
-rw-r--r-- 1 root root 184380 Jan 28 2016 memtest86+.elf
-rw-r--r-- 1 root root 184840 Jan 28 2016 memtest86+_multiboot.bin
-rw-r--r-- 1 root root 255 Jan 16 23:29 retpoline-4.4.0-142-generic
-rw------- 1 root root 3904797 Jan 16 23:29 System.map-4.4.0-142-generic
-rw------- 1 root root 3906115 Mar 26 14:02 System.map-4.4.0-145-generic
-rw------- 1 root root 7184032 Jan 16 23:29 vmlinuz-4.4.0-142-generic
-rw------- 1 root root 7188984 Mar 27 10:03 vmlinuz-4.4.0-145-generic
$ dkms status
bbswitch, 0.8, 4.4.0-142-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-145-generic, x86_64: installed
nvidia-410, 410.48, 4.4.0-142-generic, x86_64: installed
$ ls -al /usr/src
total 44
drwxr-xr-x 11 root root 4096 Apr 9 12:02 .
drwxr-xr-x 12 root root 4096 Mar 14 12:56 ..
drwxr-xr-x 2 root root 4096 Mar 14 11:05 bbswitch-0.8
drwxr-xr-x 5 root root 4096 Mar 14 14:55 cudnn_samples_v7
drwxr-xr-x 3 root root 4096 Mar 14 12:56 gmock
drwxr-xr-x 4 root root 4096 Mar 14 12:56 gtest
drwxr-xr-x 27 root root 4096 Feb 27 18:41 linux-headers-4.4.0-142
drwxr-xr-x 7 root root 4096 Feb 27 18:43 linux-headers-4.4.0-142-generic
drwxr-xr-x 27 root root 4096 Apr 4 16:53 linux-headers-4.4.0-145
drwxr-xr-x 7 root root 4096 Apr 4 16:53 linux-headers-4.4.0-145-generic
drwxr-xr-x 8 root root 4096 Mar 14 14:49 nvidia-410-410.48
$ ls -alR /var/lib/dkms
[Very long output] https://pastebin.com/RRMsBT0s
dkms status
andls -al /boot
. Report back to @heynnema – heynnema Apr 10 '19 at 19:28ls -al /usr/src
andls -alR /var/lib/dkms
. What make/model video card do you have? – heynnema Apr 11 '19 at 12:33dkms status
show the missing 4th line? It's my understanding that the cuda install also installs the nvidia video driver, yes? If not, then go to nvidia.com and find the latest driver for your card, which may be 418.56, and install it separately (I don't know if there are any cuda compatibility issues, and you might have to ask on the nvidia forums about that). Report back please. – heynnema Apr 15 '19 at 14:28dkms status
shows the same thing. I guess I need to do a deeper purge. The Nvidia Docs don't say how to do this, but Ill try following these instructions: https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one – ezekiel Apr 16 '19 at 10:12Purging and reinstalling cuda (whilst being booted in the new kernel) made no difference. Ultimately, purging cuda and nouveua and installing the latest nvidia driver direct from nvidia.com got me an expected output from nvidia-smi. However, it has created more problems that are probably not so relevant to this question. I am seeking further assistance in the nvidia forums. When (if!) I get a full resolution, I will post back here.
– ezekiel Apr 17 '19 at 13:07dkms status
to show correctly? Did you follow ALL the instructions outlined in the installation document? You may have to wait to get help from the nvidia forums. Please keep me posted. – heynnema Apr 17 '19 at 13:31I'm going to try purging cuda 10-0 from dpkg and then installing it with the runfile. I think that cuda 10.1 isn't really installed as there is nothing under /usr/local/cuda and nvcc -V doesn't do anything. nvidia-smi just claims that it is.
– ezekiel Apr 17 '19 at 13:41