0

I am trying to install CUDA-10.0 on Ubuntu 16.04 running on DGX-1 server. I followed the instructions for "runfile installation" in https://docs.nvidia.com/cuda/archive/10.0/cuda-installation-guide-linux/index.html#runfile.

I selected to install CUDA Drivers, CUDA Toolkit and CUDA Samples.

The previous versions of Nvidia driver and CUDA were removed using (as suggested in How can I install CUDA on Ubuntu 16.04?):

sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia-*

After step 4.2.6 (i.e. Reboot the system to reload the graphical interface.), I checked the CUDA version as follows:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

However, when I run "nvidia-smi", I get the following error:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I went to step 4.4 (Device Node Verification.), and found that the device files "/dev/nvidia*" don't exist. I tried to create them manually, however, running "modprobe" returns error:

sudo /sbin/modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Exec format error

Please help to solve the problem. Thanks!

========================================================================== Other details.

lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

uname -m && cat /etc/*release
x86_64
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8280021
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

uname -r
4.4.0-142-generic

cat /proc/version
Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019

dpkg -l | grep nvidia
ii  dgx-peer-mem-loader                             1.1-10                                        amd64        Ensure nvidia is loaded before nv_peer_mem
  • 1
    The custom Ubuntu version that ships with Nvidia comes with CUDA preinstalled, I never had to install it nor any of the drivers. – Yaron Sep 22 '19 at 07:31
  • @Yaron, previously it was running CUDA 9.0, I am not sure if it was pre-installed or somebody installed it. Now I need to install CUDA 10.0. – Khassan Sep 22 '19 at 07:41
  • Make sure you have all the correct Nvidia repositories and try using them, you might end up upgrading your Ubuntu installation, the thing about DGX is eliminating any drift you might have and stick with the preinstalled system, it's bad for research but it's pretty cumbersome to solve otherwise. – Yaron Sep 22 '19 at 07:46

0 Answers0