MATLAB: CUDA_ERROR_LAUNCH_FAILED when training large networks

I have trained networks (trainNetwork()) on my GPU with MATLAB R2018b for over a year without any issues.

Since when I upgraded to MATLAB R2020b, I've only been able to train small networks. The same script that would run flawlessly in R2018b with an arbitrarily large number of units (e.g., n = 2000), in R2020b works up until n = 50, and then crashes for (n > 100).

The reported error is typically:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
        
Error using trainNetwork (line 183)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
        
Error in RNNprediction (line 170)
net = trainNetwork({traind.x}, {traind.y}, layers, options);

The crash happens between the 2nd and 5th training iteration. When this happens, I have to restart MATLAB in order to be able to do any training at all since reset(gpuDevice) also fails and returns:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED 
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable

Training of the same network runs smoothly on CPU (although very slowly).

NOTE: I have already increased the WDDM TDR Delaty to 60, but nothing has changed. I have also tried disabling altoghether the TDR with no success.

Here are some CUDA properties:

>> gpuDevice
ans = 
  CUDADevice with properties:
                      Name: 'GeForce RTX 2070'
                     Index: 1
         ComputeCapability: '7.5'
            SupportsDouble: 1
             DriverVersion: 10.2000
            ToolkitVersion: 10.2000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 8.5899e+09
       MultiprocessorCount: 36
              ClockRateKHz: 1620000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
            DeviceSelected: 1

Best Answer

This issue seems to be specific to the training of recurrent neural networks. Following https://www.mathworks.com/matlabcentral/answers/485733-cuda-crashes-when-training-lstm-on-geforce-rtx-2080-super, I have fixed my issue by installing R2020a, with CUDA toolkit 10.1 and NVIDIA Studio Driver Version 431.86 WHQL (https://www.nvidia.com/Download/driverR ... 1050/en-us).

Best Answer

Related Solutions

MATLAB: CUDA crashes when training LSTM on GeForce RTX 2080 SUPER

MATLAB: CUDA compatibility issues in R2017a

Related Question