2

I am starting programming with CUDA but I am facing a very hard to fix problem: After some time the systems gives the error:

NVRM: GPU at 0000:03:00.0 has fallen off the bus

And the computer needs to be powered off to detect again the nVidia card.

At first I though it was a fault in my code: If I ran the same executable for 1000 times, the first 200 iterations were OK giving the same output, but then the system gave the aforementioned error and all the remaining iteration were giving errors. I then took the matrixMul example from cuda, compiled it, and ran it 1000 times. The same error happened around iteration 200!. That pointed me to some driver problem.

Therefore, and unfortunately without any success, I tested the same procedure with:

  • Several drivers, some old (which google results stated could fix the problem), the latest long lived, the latest experimental, beta, etc.
  • Cuda 5 and cuda 4.2 with the aforementioned drivers
  • I booted on text only without
  • I removed xorgserver completely
  • Enabled persistent mode.
  • Seeral solutions proposed in the forums and after google search.

None of the previous worked.

Please remember the very simple test: I compile the matrixMul example (with jusf make) and run the executable for 1000 times. I tested this also on my macbook pro and everything went fine (although of course different SO, card, etc). I am clueless right now.

What I haven't tested yet:

  • Another kernel version.
  • Another linux distribution (desperate solution).

This is my system info:

  • Ubuntu 12.04.2
  • Cuda 5
  • Current driver version : 313.30 (downloaded directly from nvidia)
  • Ubuntu kernel : 3.2.
  • g++ version : 4.6
  • nVidia Card : Quadro 4000 (GF 100)

Please, if you have any suggestion, let me know. Thanks in advance.

iluvatar
  • 133
  • I highly recommend reading this http://askubuntu.com/questions/235760/unity-does-not-appear-after-installing-proprietary-nvidia-drivers-gpu-has-falle and also would like to know if after getting the error, the whole system freezes and you need to turn the computer on and off. – Luis Alvarado Apr 15 '13 at 15:32
  • Hi Luis, thanks for your comment. No, the whole system is not freezed. Just the card is not detected any more. I do not have unity/xserver/lightdm running. I am just interested in cuda programming on the machine. I am actually running the test through ssh, no graphical interface. Do you have a similar system like mine? if you run the example matrixMul 1000 times, do you have any problem? I already tried what I can from the post you wrote (I cannot exchange cards, though), the only last option is to clean the machine as you did ... – iluvatar Apr 15 '13 at 16:21
  • No, the only thing in common is the error. With the information you just mentioned it is an entirely different thing. – Luis Alvarado Apr 15 '13 at 16:25
  • I just want to add that my desperate solution worked: I removed Ubuntu, installed Slackware, and now the card is working properly. I cannot atribute the full failure to Ubuntu, maybe I tried too much things. – iluvatar Aug 30 '13 at 20:46

0 Answers0