MATLAB: Initializing GPU on multiple workers cause an unknown error

cuda 8.0cuda jitgpugpudeviceMATLABMATLAB Compiler SDKParallel Computing Toolboxparpoolpascal gpur2016b

I've noticed that the following simple code results in an weird error, if I use R2016b on a machine with two GTX1080Ti and one K2200 :
% start a _new_ Matlab instance first!
parpool(16);
fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )
The error message I get:
Error using parallel.FevalOnAllFuture/fetchOutputs (line 69)
One or more futures resulted in an error.
Caused by:
Error using parallel.internal.pool.deserialize>@()gather(gpuArray(1))
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
<-- repeated multiple times -->
After that, all GPU functionality gets completely broken:
>> a=gpuArray(1)
Error using gpuArray
An unexpected error occurred during CUDA execution. The CUDA error was:
unknown error
Even re-starting Matlab won't help. The fix is to clear the CUDA JIT cache folder, "%USERPROFILE%\AppData\Roaming\NVIDIA\ComputeCache".
However, the following "longer pre-initialization" works OK for me:
% start a _new_ Matlab instance first and clear CUDA JIT cache if there was an error.
gpuDevice(1)
gather(gpuArray(1))
parpool();
fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) )
fetchOutputs(parfevalOnAll(@() gather(gpuArray(1)),1))
AFAIU:
  1. Matlab R2016b that I use here, was designed for CUDA 7.5, and there are no binaries for CUDA Compute Capability 6.1.
  2. That's why Matlab uses CUDA JIT to recompile a ton (~400 MB) of stuff when user calls any gpu-related function the first time. (Which also causes many " gpuDevice() is slow " questions.
  3. There's something wrong with that JIT, if combined with parpool (a race condition?).
My system is: Windows 10, CUDA 8.0 (cuda_8.0.61_win10) with patch 2 (cuda_8.0.61.2_windows), nvidia driver r384.94. The CUDA_CACHE_MAXSIZE environment variable is set to 2147483647.
My questions:
  1. Is my "longer pre-initialization" workaround actually "safe"? Is it a real workaround for those "race condition"? Or is it as good as the original (might be stable on my specific system, but is likely to fail on some other)? Assuming I have to stay with R2016b for now, targeting CUDA 8.0 and Pascal GPU (building a dll).
  2. Same code works OK in R2017b-R2018a and above. Is that just because they don't use CUDA JIT here? Or is the real underlying issue actually fixed? (I don't have a device with compute capability >6.x at hand, so I'm unable to check that.)R2017a behaves like R2016b here, even though it claims CUDA 8.0 support – it still writes something (but just ~40MB) to CUDA JIT cache, fails in test #1 and works in test #2.

Best Answer

As noted in comments, it looks like the issue does not exist in newer driver versions. So, I'm sorry for the buzz.