0

A recent kernel update broke my Cuda installation (it works fine if I boot an older kernel) - very similar setups have persisted across kernel updates on other machines, the major difference being that this installation is Ubuntu Server and the others are Ubuntu Desktop. Does this sound like a DKMS issue? Or something else? How can I get my cuda modules to build themselves against new kernels?

I have Ubuntu 16.04, Cuda 10.0 (installed by local .deb), nvidia driver 410.48 (installed automatically during cuda install) and a 2080Ti GPU

$ ls -al /boot
total 111740
drwxr-xr-x  3 root root     4096 Apr  9 12:02 .
drwxr-xr-x 24 root root     4096 Apr  4 16:53 ..
-rw-r--r--  1 root root  1252376 Jan 16 23:29 abi-4.4.0-142-generic
-rw-r--r--  1 root root   190580 Jan 16 23:29 config-4.4.0-142-generic
-rw-r--r--  1 root root   190580 Mar 26 14:02 config-4.4.0-145-generic
drwxr-xr-x  5 root root     4096 Apr  9 12:02 grub
-rw-r--r--  1 root root 50832836 Apr  4 16:54 initrd.img-4.4.0-142-generic
-rw-r--r--  1 root root 39170185 Apr  9 11:15 initrd.img-4.4.0-145-generic
-rw-r--r--  1 root root   182704 Jan 28  2016 memtest86+.bin
-rw-r--r--  1 root root   184380 Jan 28  2016 memtest86+.elf
-rw-r--r--  1 root root   184840 Jan 28  2016 memtest86+_multiboot.bin
-rw-r--r--  1 root root      255 Jan 16 23:29 retpoline-4.4.0-142-generic
-rw-------  1 root root  3904797 Jan 16 23:29 System.map-4.4.0-142-generic
-rw-------  1 root root  3906115 Mar 26 14:02 System.map-4.4.0-145-generic
-rw-------  1 root root  7184032 Jan 16 23:29 vmlinuz-4.4.0-142-generic
-rw-------  1 root root  7188984 Mar 27 10:03 vmlinuz-4.4.0-145-generic

$ dkms status
bbswitch, 0.8, 4.4.0-142-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-145-generic, x86_64: installed
nvidia-410, 410.48, 4.4.0-142-generic, x86_64: installed

$ ls -al /usr/src
total 44
drwxr-xr-x 11 root root 4096 Apr  9 12:02 .
drwxr-xr-x 12 root root 4096 Mar 14 12:56 ..
drwxr-xr-x  2 root root 4096 Mar 14 11:05 bbswitch-0.8
drwxr-xr-x  5 root root 4096 Mar 14 14:55 cudnn_samples_v7
drwxr-xr-x  3 root root 4096 Mar 14 12:56 gmock
drwxr-xr-x  4 root root 4096 Mar 14 12:56 gtest
drwxr-xr-x 27 root root 4096 Feb 27 18:41 linux-headers-4.4.0-142
drwxr-xr-x  7 root root 4096 Feb 27 18:43 linux-headers-4.4.0-142-generic
drwxr-xr-x 27 root root 4096 Apr  4 16:53 linux-headers-4.4.0-145
drwxr-xr-x  7 root root 4096 Apr  4 16:53 linux-headers-4.4.0-145-generic
drwxr-xr-x  8 root root 4096 Mar 14 14:49 nvidia-410-410.48

$ ls -alR /var/lib/dkms
[Very long output] https://pastebin.com/RRMsBT0s
ezekiel
  • 211
  • Edit your question and show me dkms status and ls -al /boot. Report back to @heynnema – heynnema Apr 10 '19 at 19:28
  • @heynnema have added the diagnostic info you asked for :) – ezekiel Apr 11 '19 at 10:04
  • The nvidia module didn't build for your current kernel. Now show me ls -al /usr/src and ls -alR /var/lib/dkms. What make/model video card do you have? – heynnema Apr 11 '19 at 12:33
  • You accidentally did the first command twice. You didn't tell me what make/model your video card is. – heynnema Apr 11 '19 at 16:51
  • Please see my answer. Please remember to accept it if it solves your problem. Thanks! – heynnema Apr 11 '19 at 17:24
  • @heynnema I did add that I have a 2080Ti GPU - have updated to give the second output. Will reinstall cuda and report back shortly as you suggest. – ezekiel Apr 12 '19 at 10:15
  • Status please... – heynnema Apr 14 '19 at 16:30
  • I did (all sudo): apt-get --purge remove cuda reboot apt update && sudo apt install cuda reboot - no difference dpkg --remove cuda dpkg --remove cuda-10-0 dpkg --install cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64.deb apt update apt install cuda reboot [No difference] I think that I was just removing and installing some tiny meta-packages cuda and cuda-10-0 - there were no errors or warnings but nothing was built. There are 20 packages listed in dpkg by cuda-* maybe i should --remove each one and then rm everything under /usr/local/cuda and /usr/local/cuda-10-0 ? – ezekiel Apr 15 '19 at 14:08
  • After reinstalling, does dkms status show the missing 4th line? It's my understanding that the cuda install also installs the nvidia video driver, yes? If not, then go to nvidia.com and find the latest driver for your card, which may be 418.56, and install it separately (I don't know if there are any cuda compatibility issues, and you might have to ask on the nvidia forums about that). Report back please. – heynnema Apr 15 '19 at 14:28
  • Yes, the Cuda install also installs the Nvidia Driver although not necessarily the most recent driver. dkms status shows the same thing. I guess I need to do a deeper purge. The Nvidia Docs don't say how to do this, but Ill try following these instructions: https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one – ezekiel Apr 16 '19 at 10:12
  • Thanks for the update. Let me know how you do. You may just have to reinstall cuda, and manually get the latest Nvidia driver that supports your card, and install it. I think the latest is 418.56. – heynnema Apr 16 '19 at 12:30
  • @heynnema I believe that the problem was in fact that the older nvidia-driver that is packaged with cuda is simply not compatible with the recent kernel update.

    Purging and reinstalling cuda (whilst being booted in the new kernel) made no difference. Ultimately, purging cuda and nouveua and installing the latest nvidia driver direct from nvidia.com got me an expected output from nvidia-smi. However, it has created more problems that are probably not so relevant to this question. I am seeking further assistance in the nvidia forums. When (if!) I get a full resolution, I will post back here.

    – ezekiel Apr 17 '19 at 13:07
  • Specifically, the problem now is that according to nvidia-smi, CUDA 10.1 has apparently installed itself with the driver. I didn't expect this and I specifically want CUDA 10.0 because 10.1 is known to not work with tensorflow and others: apt install cuda-10-0 lists nvidia-410 as a package to install. I prob. don't want to install multiple versions of the nvidia driver. So is it possible to remove cuda 10.1, retain the 418 driver and install cuda 10.0 without installing the 410 driver? – ezekiel Apr 17 '19 at 13:11
  • Tough choice. Does the cuda 10.1 directly effect you? If not, keep it and the newer video driver. Did installing the new nvidia video driver allow dkms status to show correctly? Did you follow ALL the instructions outlined in the installation document? You may have to wait to get help from the nvidia forums. Please keep me posted. – heynnema Apr 17 '19 at 13:31
  • $ dkms status nvidia, 418.56, 4.4.0-145-generic, x86_64: installed – ezekiel Apr 17 '19 at 13:36
  • Yeah cuda 10.1 is unacceptable - we need tensorflow. There is only one instruction that comes with the driver install - you just run the .sh with sudo.

    I'm going to try purging cuda 10-0 from dpkg and then installing it with the runfile. I think that cuda 10.1 isn't really installed as there is nothing under /usr/local/cuda and nvcc -V doesn't do anything. nvidia-smi just claims that it is.

    – ezekiel Apr 17 '19 at 13:41
  • I can do if that makes sense for the purposes of the forums but I haven't actually solved my problem. Your answer was very useful in helping me to understand dkms a little more and narrowing down what the issue is but wasn't actually a solution? – ezekiel Apr 17 '19 at 13:44
  • I added an Update to my answer. The answer correctly identified the nature of the problem. Total fixes are going to require Nvidia support, or Nvidia forum, response. As I mentioned, we can worry about accepting the answer later... and I can/will update more as you get feedback from Nvidia :-) – heynnema Apr 17 '19 at 13:50
  • Let me add Update #2 to my answer... give me a couple of minutes... – heynnema Apr 17 '19 at 13:55
  • problem solved. the driver doesn't actually install cuda 10.1, confusingly it just says 'cuda 10.1' in the nvidia-smi output to indicate that it is capable of supporting up to cuda 10.1. Having purged the cuda deb install with "apt-get remove --purge cuda && apt-get autoremove" I then installed cuda 10.0 using the runfile method, saying "no" when it offered to install the driver. so cuda 10.0 now works with the 145 kernel. – ezekiel Apr 18 '19 at 08:57
  • $ dkms status: nvidia, 418.56, 4.4.0-145-generic, x86_64: installed – ezekiel Apr 18 '19 at 08:57

1 Answers1

1

$ dkms status

bbswitch, 0.8, 4.4.0-142-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-145-generic, x86_64: installed
nvidia-410, 410.48, 4.4.0-142-generic, x86_64: installed

This shows us that the nvidia-410 dkms driver did not build on kernel 4.4.0-145-generic. There should be a 4th line that looks like:

nvidia-410, 410.48, 4.4.0-145-generic, x86_64: installed

Who knows why it didn't build on the -145 kernel... there IS a dkms.conf file there.

Cuda 10.0 can be downloaded from here. Installation document is here.

Download the Cuda package and reinstall it. Then do a dkms status command and verify that it shows the 4th line, as I show above.

Note: if you'd like a newer version of the Nvidia driver, the latest version is 418.56. I can't say if it's compatible with Cuda 10.0.

Note: if there's a bug in the Cuda/Nvidia software package(s), you may have to do this every time the kernel is updated :-(

Update #1:

Recent updates have caused the Nvidia video driver 410 not to build on the current kernel.

Cuda 10.0 installs Nvidia video driver 410.

Cuda 10.1 gets installed with Nvidia video driver 418.

Cuda 10.1 has problems with tensorflow.

User needs Cuda 10.0 with a working Nvidia video driver.

User is requesting further help from Nvidia.

Update #2:

Try this...

Remove Cuda 10.1 and video driver 418.

Reinstall Cuda 10.0 and video driver 410.

This will put you back to the beginning status.

Let's try and build the 410 driver manually...

sudo dkms build nvidia-410/410.48 # the build may fail, but give us a reason

sudo dkms install nvidia-410/410.48 # run if the build is clean

dkms status # verify 410 installation on current kernel

Update #3:

It turned out that installing either Cuda 10.0/10.1 did not automatically install Nvidia video drivers, and visa-versa.

Final solution was for the user to manually install the required Cuda 10.0, and manually install the latest Nvidia video driver 418.56, and it's all working again.

heynnema
  • 70,711