MATLAB: CUDA_ERROR_LAUNCH_FAILED when training large networks

deep learninggpu

I have trained networks (trainNetwork()) on my GPU with MATLAB R2018b for over a year without any issues.
Since when I upgraded to MATLAB R2020b, I've only been able to train small networks. The same script that would run flawlessly in R2018b with an arbitrarily large number of units (e.g., n = 2000), in R2020b works up until n = 50, and then crashes for (n > 100).
The reported error is typically:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error using trainNetwork (line 183)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Error in RNNprediction (line 170)
net = trainNetwork({traind.x}, {traind.y}, layers, options);
The crash happens between the 2nd and 5th training iteration. When this happens, I have to restart MATLAB in order to be able to do any training at all since reset(gpuDevice) also fails and returns:
Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error using parallel.gpu.CUDADevice/reset
An unexpected error occurred during CUDA execution. The CUDA error was:
all CUDA-capable devices are busy or unavailable
Training of the same network runs smoothly on CPU (although very slowly).
NOTE: I have already increased the WDDM TDR Delaty to 60, but nothing has changed. I have also tried disabling altoghether the TDR with no success.
Here are some CUDA properties:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce RTX 2070'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 10.2000
ToolkitVersion: 10.2000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
MultiprocessorCount: 36
ClockRateKHz: 1620000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1

Best Answer

This issue seems to be specific to the training of recurrent neural networks. Following https://www.mathworks.com/matlabcentral/answers/485733-cuda-crashes-when-training-lstm-on-geforce-rtx-2080-super, I have fixed my issue by installing R2020a, with CUDA toolkit 10.1 and NVIDIA Studio Driver Version 431.86 WHQL (https://www.nvidia.com/Download/driverR ... 1050/en-us).