MATLAB: Do I get the error “CUDNN_STATUS_EXECUTION_FAILED” when training a neural network on a GPU on a server

deeperrorgpulearningNetworkneuralserverStatistics and Machine Learning Toolboxtraining

When training a neural network on a GPU on a server, it usually fails after some time with the following error message:

Error using trainNetwork (line 154)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Caused by:
Error using nnet.internal.cnngpu.lstmForwardTrain
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.

This generally happens when someone else launches another program on the same GPU.

Best Answer

In general, it is not a good idea to share the GPU for computations across different programs or users. This will very likely cause kernel execution timeouts, memory issues and other failures.

Please try to change "Compute Mode" in the GPU to "Exclusive Mode", so that no other process can grab the GPU while MATLAB is performing computations. Please see the following link for more information:

http://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

Best Answer

Related Solutions

MATLAB: CUDA_ERROR_LAUNCH_FAILED when training large networks

MATLAB: CUDA crashes when training LSTM on GeForce RTX 2080 SUPER

Related Question