2

My tensorflow 2.3.1 setup with cuda 10.1 was working fine till the time I mistakenly updated nvidia drivers and cuda.

Following are the steps I am using to install cuda 10-1

  1. Purge all cuda and nvidia drivers

sudo apt-get --purge remove "cublas" "cuda*" "nsight*"

sudo apt-get --purge "nvidia*"

sudo apt-get autoremove sudo apt-get autoclean sudo rm -rf /usr/local/cuda*

Reboot

  1. After this I follow instructions from tensorflow page

https://www.tensorflow.org/install/gpu

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb

sudo apt-get update

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

sudo apt-get update

  1. sudo apt-get install --no-install-recommends nvidia-driver-450

  2. sudo apt-get install --no-install-recommends cuda-10-1

It creates 2 folders in my /usr/local cuda-10.1 cuda-10.2

at this step, it removes 450 driver and installs 455, following are part of the messages I get

The following packages will be REMOVED: libnvidia-cfg1-450 libnvidia-compute-450 libnvidia-decode-450 libnvidia-encode-450 libnvidia-extra-450 libnvidia-fbc1-450 libnvidia-gl-450 libnvidia-ifr1-450 nvidia-compute-utils-450 nvidia-dkms-450 nvidia-driver-450 nvidia-kernel-common-450 nvidia-kernel-source-450 nvidia-utils-450 xserver-xorg-video-nvidia-450

If I go forward and install libcudnn7, and tensorflow

sudo apt-get install --no-install-recommends
libcudnn7=7.6.5.32-1+cuda10.1
libcudnn7-dev=7.6.5.32-1+cuda10.1

I get this in python

tf.config.list_physical_devices("GPU")

2020-10-07 13:10:02.262260: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 450.80.2 does not match DSO version 455.23.5 -- cannot find working devices in this configuration

To fix this I tried

  1. uninstalling 455

sudo apt purge nvidia-455*

reinstalling tensorflow, Now I get this error in python

tf.config.list_physical_devices("GPU")

2020-10-07 13:20:46.923513: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-10-07 13:20:46.959289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-07 13:20:46.959608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5 coreClock: 1.62GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s 2020-10-07 13:20:46.959626: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-10-07 13:20:46.959769: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory

How to fix this, Thanks

T.singh
  • 41
  • 1
    Use the .run file for cuda install instead of the .deb file. The .run file allows you to unselect the video driver it is trying to install. Then it should all work fine. Download the .run file from https://developer.nvidia.com/cuda-10.1-download-archive-update2?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal – Terrance Oct 07 '20 at 18:20

1 Answers1

2

Terrance's reply helped fixing the issue of driver upgrade but had to install additional packages and set the config files.

this https://www.pugetsystems.com/labs/hpc/How-to-install-CUDA-9-2-on-Ubuntu-18-04-1184/ helped with additional steps

Following are the steps I used for cuda10.1 with nvidia 450 driver for unix 18.04

Steps:

Before installing cuda from run file, we need to install Driver

##Driver, this is as per tensorflow requirement, 455 doesnt work for current tensorflow version

  1. sudo apt-get install --no-install-recommends nvidia-driver-450

##get runfile for cuda 10.1

  1. wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run

##install dependencies

  1. sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev

##Follow installation steps by running following

  1. sudo sh cuda_10.1.243_418.87.00_linux.run

#installer gives warning about preexisting driver, continue #select everything except driver in the menu, cuda will be installed, use ls /usr/local

Folder cuda-10.1

  1. Create bash file for cuda profile

#you can use any text editor,

vim /etc/profile.d/cuda.sh

##add the following lines to this file to add path

export PATH=$PATH:/usr/local/cuda-10.1/bin export CUDADIR=/usr/local/cuda-10.1

##Create another file for LD_LIBRARY_PATH

vim /etc/ld.so.conf.d/cuda.conf

#add this line

/usr/local/cuda-10.1/lib64

#run

sudo ldconfig

  1. For Cudnn, use these steps for tar file installation

https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

These are 4 commands

tar -xzvf cudnn-10.1-linux-x64-v7.6.5.32.tgz

sudo cp cuda/include/cudnn*.h /usr/local/cuda/include

sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

  1. If you get this error while using tf

failed call to cuInit: CUDA_ERROR_UNKNOWN

#use this sudo apt install nvidia-modprobe

  1. If somebody wants to install tensorRT, these links are helpful

https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing-tar

Why do I get "/sbin/ldconfig.real: /usr/local/cuda/lib64/libcudnn.so.7 is not a symbolic link"?

T.singh
  • 41