MATLAB: Initializing GPU on multiple workers cause an unknown error

cuda 8.0cuda jitgpugpudeviceMATLABMATLAB Compiler SDKParallel Computing Toolboxparpoolpascal gpur2016b

I've noticed that the following simple code results in an weird error, if I use R2016b on a machine with two GTX1080Ti and one K2200 :

% start a _new_ Matlab instance first!
parpool(16); 
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )

The error message I get:

Error using parallel.FevalOnAllFuture/fetchOutputs (line 69)
One or more futures resulted in an error.
Caused by:
    Error using parallel.internal.pool.deserialize>@()gather(gpuArray(1))
    An unexpected error occurred during CUDA execution. The CUDA error was:
    unknown error
    <-- repeated multiple times -->

After that, all GPU functionality gets completely broken:

>> a=gpuArray(1)
Error using gpuArray
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error

Even re-starting Matlab won't help. The fix is to clear the CUDA JIT cache folder, "%USERPROFILE%\AppData\Roaming\NVIDIA\ComputeCache".

However, the following "longer pre-initialization" works OK for me:

% start a _new_ Matlab instance first and clear CUDA JIT cache if there was an error.
gpuDevice(1)
gather(gpuArray(1)) 
parpool(); 
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) ) 
fetchOutputs(parfevalOnAll(@() gather(gpuArray(1)),1))

AFAIU:

Matlab R2016b that I use here, was designed for CUDA 7.5, and there are no binaries for CUDA Compute Capability 6.1.
That's why Matlab uses CUDA JIT to recompile a ton (~400 MB) of stuff when user calls any gpu-related function the first time. (Which also causes many " gpuDevice() is slow " questions.
There's something wrong with that JIT, if combined with parpool (a race condition?).

My system is: Windows 10, CUDA 8.0 (cuda_8.0.61_win10) with patch 2 (cuda_8.0.61.2_windows), nvidia driver r384.94. The CUDA_CACHE_MAXSIZE environment variable is set to 2147483647.

My questions:

Is my "longer pre-initialization" workaround actually "safe"? Is it a real workaround for those "race condition"? Or is it as good as the original (might be stable on my specific system, but is likely to fail on some other)? Assuming I have to stay with R2016b for now, targeting CUDA 8.0 and Pascal GPU (building a dll).
Same code works OK in R2017b-R2018a and above. Is that just because they don't use CUDA JIT here? Or is the real underlying issue actually fixed? (I don't have a device with compute capability >6.x at hand, so I'm unable to check that.)R2017a behaves like R2016b here, even though it claims CUDA 8.0 support – it still writes something (but just ~40MB) to CUDA JIT cache, fails in test #1 and works in test #2.

MATLAB: Initializing GPU on multiple workers cause an unknown error

Best Answer

Related Question

Best Answer

Related Solutions

MATLAB: Error in executing parallelde​mo_gpu_ben​chmark : An unexpected error occurred during CUDA execution. The CUDA error was: unspecified launch failure

MATLAB: Can’t find the Gpu device.. was working yesterday..

Related Question

MATLAB: Error in executing paralleldemo_gpu_benchmark : An unexpected error occurred during CUDA execution. The CUDA error was: unspecified launch failure