I am trying to install CUDA-10.0 on Ubuntu 16.04 running on DGX-1 server. I followed the instructions for "runfile installation" in https://docs.nvidia.com/cuda/archive/10.0/cuda-installation-guide-linux/index.html#runfile.
I selected to install CUDA Drivers, CUDA Toolkit and CUDA Samples.
The previous versions of Nvidia driver and CUDA were removed using (as suggested in How can I install CUDA on Ubuntu 16.04?):
sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia-*
After step 4.2.6 (i.e. Reboot the system to reload the graphical interface.), I checked the CUDA version as follows:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
However, when I run "nvidia-smi", I get the following error:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I went to step 4.4 (Device Node Verification.), and found that the device files "/dev/nvidia*" don't exist. I tried to create them manually, however, running "modprobe" returns error:
sudo /sbin/modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Exec format error
Please help to solve the problem. Thanks!
========================================================================== Other details.
lspci | grep -i nvidia
06:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
86:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
uname -m && cat /etc/*release
x86_64
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2018-03-20"
DGX_SWBUILD_VERSION="3.1.6"
DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd"
DGX_SERIAL_NUMBER=QTFCOU8280021
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
uname -r
4.4.0-142-generic
cat /proc/version
Linux version 4.4.0-142-generic (buildd@lgw01-amd64-033) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019
dpkg -l | grep nvidia
ii dgx-peer-mem-loader 1.1-10 amd64 Ensure nvidia is loaded before nv_peer_mem