0

I have been trying to install CUDA for the past few days to fit my Tensorflow CNNs. Right now is installed on my machine (Ubuntu 20.04 LTS, RTX3060):

tensorflow-gpu 2.4

python 3.8.10

cuDNN 8.0

CUDA 11.0

nvidia-driver-495

The driver was installed along side CUDA 11.0.

When i fit a model, i can see that my GPU is allocating all his memory but the model verbose stays at : Epoch : 1/50 and will never go further.

I tried to downgrade my driver to nvidia-driver-470 as the 495 is not officially out. This acction led everything to stop working : my GPU does not allocate anymore when fitting, nvidia -smi does not work anymore, and importing tensorflow now returns:

Could not load dynamic library 'libcudart.so.11.0'; dlerror: ,

which was not the case previously.

Does anyone knows where this issue may come from?

Thanks

edit 1:

After reboot, importing Tensorflow returns:

tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2021-11-02 06:24:40.852786: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Directories /usr/lib/cuda/include and /usr/lib/cuda/lib64 actually exist.

edit 2:

After reinstalling cuda from this link : https://askubuntu.com/a/1288405/231142

Tensorflow import work and does not return any issues.

EarlyStop=EarlyStopping(patience=10,restore_best_weights=True)
Reduce_LR=ReduceLROnPlateau(monitor='val_accuracy',verbose=2,factor=0.5,min_lr=0.00001)
model_check=ModelCheckpoint('model.hdf5',monitor='val_loss',verbose=1,save_best_only=True)
tensorbord=TensorBoard(log_dir='logs')
callback=[EarlyStop , Reduce_LR,model_check,tensorbord]

returns :

2021-11-02 20:09:55.607299: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:09:55.607335: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:09:55.608325: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-11-02 20:09:55.609026: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.2'; dlerror: libcupti.so.11.2: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609320: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.5/lib64:/usr/lib/cuda/include:/usr/lib/cuda/lib64:/usr/local/cuda-11.5/lib64
2021-11-02 20:09:55.609372: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-11-02 20:09:55.609476: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-11-02 20:09:55.609527: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.

Model fitting starts and uses all my GPU and CPU while still going slowly and returns :

2021-11-02 20:09:55.832301: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.269844: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:56.669900: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 428802048 exceeds 10% of free system memory.
2021-11-02 20:09:56.821919: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 571736064 exceeds 10% of free system memory.
2021-11-02 20:09:57.065544: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/20
2021-11-02 20:09:59.868007: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
  1/137 [..............................] - ETA: 1:15:21 - loss: 0.7485 - accuracy: 0.38712021-11-02 20:10:30.404084: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-11-02 20:10:30.404114: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-11-02 20:10:30.404277: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.

There may be an issue with the libcupti.so.11.2 library but i have not find it for the moment.

Louis
  • 1
  • 2
  • I hate to ask this, but when you "deprecated" your NVIDIA driver, did you reboot your system so that the older driver takes effect? – Terrance Nov 02 '21 at 04:36
  • i did for good measures. importing tensorflow now returns :

    2021-11-02 06:01:48.281681: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64: 2021-11-02 06:01:48.281751: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

    – Louis Nov 02 '21 at 05:04
  • I am not sure how you setup your system for CUDA, but you might want to look at my answer here and see if you may have missed a step in the installation of CUDA for like the additional information that you need to add to the ~/.profile file. I wish I had a better card on my home system as some of the tensorflow tests I cannot run due to my card being older, but other CUDA tests pass. Sometimes running sudo ldconfig can fix library file issues as well. – Terrance Nov 02 '21 at 14:28
  • i followed the instructions on your link. i updated the post with the new state. – Louis Nov 02 '21 at 19:17

0 Answers0