MATLAB: How to control which GPUs and CPUs get which tasks during multiple calls to trainNetwork

cpucpu/gpuDeep Learning ToolboxgpuParallel Computing Toolboxspdmtrainnetwork

I am working on a machine with a number of CPU cores (40) and a number of GPUs (4). I need to train a large number of shallow LSTM neural networks (~500,000), and would like to use my compute resources as efficiently as possible.
Here are the options I've come up with:
1) parpool('local') gives 40 workers max, which are the number of CPU cores available. Apparently parpool('local') does not provide access to the GPUs – is this correct? I can then use spmd to launch separate instances of trainNetwork across individual CPUs on my machine, and this runs 40 such instances at a time.
I have three questions about this:
First, is there a way to use both the GPUs and CPUs as separate laboratories (i.e., with different labindex values) in my spmd loop? Why do I not have a total of 44 avaialble workers from parpool?
Second, is there a way to assign more than one CPU to a particular lab, for example, could I divide my 40 cores up into 8 gorups of 5 and deploy a separate instance of trainNetwork to each of the 8 groups?
Third, given that I am using LSTMs, my 'ExecutionEnvironment' options are 'gpu', 'cpu', and 'auto', but it apears that the 'cpu' option uses more than one cpu at a time, because the timing for each task increases by a factor of about ~6 when I use spdm vs. only running one instance of trainNetwork (with 'ExecutionEnvironment' = 'cpu') at a time – this leads me to belive that when I run a single instance of trainNetowrk with 'ExecutionEnvironment' = 'cpu' it uses more than one CPU core. Is this correct?
2) I can access the GPUs individually using gpuDevice, and I can run 4 instances of trainNetwork simultaneously on my 4 GPUs. This works well, with effectively linear spedup as compared to only using one GPU at a time, but apparently does not take advantage of my CPUs.
Ideally, I'd lke a way to (1) test scaling across multiple CPUs for my partiucalr trainNetwork problem, and (2) a way to run multiple parallel instances of trainNetwork that use all of my hardware. Ideally, the best option seems to be to let the GPUs each take a number of the trainNetwork instances in parallel, and then to deploy groups of CPUs (with optimal size currently unknown) to handle a number of the trainNetwork instances.
Is there a way to do this?
Thank you,
Grey

Best Answer

The computation on the GPU is so much faster than on the CPU for a typical Deep Learning example that there are only disadvantages to getting the CPU cores involved for the most intensive parts of the computation. Of course the CPU is being used, for all the MATLAB business logic, but that is generally low overhead and not suitable for GPU execution.
When you train on the CPU only the heavy computation is heavily vectorized and multithreaded, so there is a good chance that moving to parallel execution won't give much of an additional advantage. Parallel execution for multi-cpu comes more into its own when you go multi-node, i.e. have a cluster of multiple machines.
You can control how much multi-threading MATLAB does using maxNumCompThreads. You can run this inside parfevalOnAll, perhaps, to set the multi-threading level on your pool before training. That way you may find a good balance between numbers of MATLABs and number of threads for your particular network. You may indeed find that for your network, there is an ideal pool size for which training in parallel is effective even on a single machine.
Related Question