1

This is a question on a topic that has, in different variations, been asked already. However, since none of the answers I found was applicable to my problem, I will first outline the problem and then, in case anyone else finds themselves in the same spot, outline the answers I tried. Perhaps they work for you. In any case, I would be grateful for any new information on this issue.

Version: 16.04

Kernel: 4.15.0-133-generic

Since I wanted to use CUDA 11, I uninstalled my previous NVIDIA driver with

sudo apt --purge remove "*nvidia*"

as well as tried to remove everything from the previous CUDA versions via

sudo apt --purge remove "*cuda*" "*cublas*" "*cufft*" "*curand*" "*cusolver*" "*cusparse*" "*npp*" "*nvjpeg*" "cuda*" "nsight*"

and

sudo apt-get autoremove .

I then installed the graphics driver and CUDA from command line as described in the nvidia page, as well as here. For a successful installation, this step needed to be performed in the terminal with Ctrl+Alt+F1. Also, the XServer needed to be stopped via sudo service lightdm stop (at least I think that's what it does). After the installation of both driver and the CUDA toolkit and rebooting the system, I ran the deviceQuery program as well as a simulation I wrote for CUDA succesfully. However, in the graphical interface I was stuck in a log-in loop (references to similar posts below).

Since none of the below listed remedies worked, I tried to install CUDA and the NVIDIA driver from the graphics-drivers ppa via sudo add-apt-repository ppa:graphics-drivers/ppa. After installing the appropriate driver via sudo apt-get install nvidia-460 and rebooting, I could access the graphical interface again. nvidia-smi shows a running nvidia driver:

    $ nvidia-smi
    Tue Feb 23 14:50:14 2021       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Quadro P3000        Off  | 00000000:01:00.0  On |                  N/A |
    | N/A   50C    P0    23W /  N/A |    405MiB /  6078MiB |      2%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1322      G   /usr/lib/xorg/Xorg                260MiB |
|    0   N/A  N/A      2502      G   compiz                             49MiB |
|    0   N/A  N/A     32082      G   ...gAAAAAAAAA --shared-files       91MiB |
+-----------------------------------------------------------------------------+

On the other hand, no method of installing CUDA (either via the runfile but without a new installation of the driver, nor through sudo apt install nvidia-cuda-toolkit or sudo apt install cuda-toolkit-11-2) leads to a successful installation of CUDA. Programs compile via the nvcc without problems, however ./deviceQuery returns

$ ./deviceQuery 
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

and other programs terminate once CUDA-parts are reached. Note that the reason for failing (driver version is insufficient) is not correct, since the installed driver is 460.32.03, which is sufficient according to the nvidia manual. On the other hand, the nvidia-smi also doesn't seem to notice CUDA is installed. Currently, with the driver installed from the ppa and CUDA installed from the runfile, and I have

$ lspci -k | grep -EA3 'VGA|3D|Display'
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
    Subsystem: Lenovo Device 224c
    Kernel driver in use: i915
    Kernel modules: i915
--
01:00.0 3D controller: NVIDIA Corporation GP104GLM [Quadro P3000 Mobile] (rev a1)
    Subsystem: Lenovo Device 224c
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_460_drm, nvidia_460

I would be very grateful for any ideas on how to either make the driver installed via the runfile work together with the Xserver or to make the driver from the ppa work together with CUDA.

Thank you and best,

David

Now for some tried and failed solutions: With driver installed from runfile:

  1. try installing gdm instead of lightdm as specified here by WindowsEscapist
  2. make sure Xauthority user rights are with user.

with driver installed from ppa:graphics-drivers/ppa:

  1. try to run with sudo optirun ./deviceQuery as specified in this link
  2. try setting Prime profiles to NVIDIA in NVIDIA X server settings (already set)
  3. try using sudo prime-select nvidia as suggested in here
  • There are several good answers on this site for CUDA installation methods, but I think all of them assume you start with a working Nvidia setup, and suppress any subsequent attempt by CUDA to update for change it. Stick with gdm. – ubfan1 Feb 23 '21 at 17:16
  • Hi ubfan1, thank you for your answer! I tried installing gdm as an alternative to lightdm, but with either I landed in the login-loop (although both login-interfaces looked the same, so maybe I did something wrong?). Since the nvidia driver works together with cuda in the runfile setup, I think the Xserver cannot communicate/ find the driver, could that be? – Matsumoto San Feb 23 '21 at 18:09
  • You must be running a hwe update to get the 4.15 kernel on 16.04, so ensure you have all the supporting packages, like the equivalent X server. Even if you get a running setup with the Nvidia supplied driver, you are probably a kernel update away from a failed driver recompile and "update broke my system" problem. – ubfan1 Feb 23 '21 at 18:58
  • Hmm, I'm not sure I understand you correctly. I did the hwe update, then tried the runscript installation again, with the same result as before. Then I followed this description and reinstalled the xserver, as well as lightdm, with no success. Unfortunately, installing gdm3 did not work at all, and there was no login-screen. If you could expand a bit on your answer, you would help me a lot. – Matsumoto San Feb 23 '21 at 22:44
  • 1
    See https://askubuntu.com/questions/1077061/how-do-i-install-nvidia-and-cuda-drivers-into-ubuntu/1288405#1288405 fro a detailed run file installation. I have answered with ways to setup the cuda files from download deb source, allowing multiple parallel installations of different versions. Reject any CUDA offer to mess with the Nvidia drivers - In the run file, unckeck the Nvidia box, in the deb unpack, don't process the Nvidia sub-debs. If you start with a working system using Nvidia drivers, just getting the CUDA files should not break anything. – ubfan1 Feb 23 '21 at 23:11
  • Brilliant, thank you! I upgraded to 18.04, since apparently nvidia-driver-450 is only available from bionic on upwards, and installed that. This is sufficient to install CUDA from the runfile, and everything works. Well, apart from a weird error saying libmpi is not reachable, but I think this is something for another thread, if at all. Thank you very much! – Matsumoto San Feb 24 '21 at 19:46
  • btw, if you want to post an answer yourself, I'll retract mine. – Matsumoto San Feb 24 '21 at 19:52
  • You can accept your own answer after a few days, and close the question. It was good to upgrade from 16.04, it's only got a couple of months left before Endo of Support. – ubfan1 Feb 24 '21 at 21:34

1 Answers1

0

Since the driver nvidia-460 offered in the version of Ubuntu I had first did not work with the cuda toolkit from the runfile from nvidias website, I did an hwe update. In the updated version of 16.04, it was not possible to me to install nvidia-driver-460 or nvidia-driver-450 so, I installed bionic (18.04) and then the nvidia-driver-450. As @ubfan1 pointed out, the rest of the answer is in this link, where the toolkit is installed via the runfile, but without the driver.