I've noticed that the following simple code results in an weird error, if I use R2016b on a machine with two GTX1080Ti and one K2200 :
% start a _new_ Matlab instance first!parpool(16);fetchOutputs( parfevalOnAll(@() gather(gpuArray(1)),1) )
The error message I get:
Error using parallel.FevalOnAllFuture/fetchOutputs (line 69)One or more futures resulted in an error.Caused by:Error using parallel.internal.pool.deserialize>@()gather(gpuArray(1))An unexpected error occurred during CUDA execution. The CUDA error was:unknown error<-- repeated multiple times -->
After that, all GPU functionality gets completely broken:
>> a=gpuArray(1)Error using gpuArrayAn unexpected error occurred during CUDA execution. The CUDA error was:unknown error
Even re-starting Matlab won't help. The fix is to clear the CUDA JIT cache folder, "%USERPROFILE%\AppData\Roaming\NVIDIA\ComputeCache".
However, the following "longer pre-initialization" works OK for me:
% start a _new_ Matlab instance first and clear CUDA JIT cache if there was an error.gpuDevice(1)gather(gpuArray(1))parpool();fetchOutputs( parfevalOnAll(@() gpuDevice(1),1) )fetchOutputs(parfevalOnAll(@() gather(gpuArray(1)),1))
- Matlab R2016b that I use here, was designed for CUDA 7.5, and there are no binaries for CUDA Compute Capability 6.1.
- That's why Matlab uses CUDA JIT to recompile a ton (~400 MB) of stuff when user calls any gpu-related function the first time. (Which also causes many " gpuDevice() is slow " questions.
- There's something wrong with that JIT, if combined with parpool (a race condition?).
My system is: Windows 10, CUDA 8.0 (cuda_8.0.61_win10) with patch 2 (cuda_220.127.116.11_windows), nvidia driver r384.94. The CUDA_CACHE_MAXSIZE environment variable is set to 2147483647.
- Is my "longer pre-initialization" workaround actually "safe"? Is it a real workaround for those "race condition"? Or is it as good as the original (might be stable on my specific system, but is likely to fail on some other)? Assuming I have to stay with R2016b for now, targeting CUDA 8.0 and Pascal GPU (building a dll).
- Same code works OK in R2017b-R2018a and above. Is that just because they don't use CUDA JIT here? Or is the real underlying issue actually fixed? (I don't have a device with compute capability >6.x at hand, so I'm unable to check that.)R2017a behaves like R2016b here, even though it claims CUDA 8.0 support – it still writes something (but just ~40MB) to CUDA JIT cache, fails in test #1 and works in test #2.