MATLAB: Is batch() so slow

batchparallelParallel Computing Toolbox

I'm trying to use batch() to load some data from a slow disk in the background, but it is extremely slow. See code example with timings below. I think it is slower than what can be explained by the overhead of communicating with the worker (consider that I am not even transferring the loaded data from the worker to the client in the example).
>> a = rand(512, 512, 1000);
>> save('a');
>> tic; load('a'); toc
Elapsed time is 5.574926 seconds.
>> tic; b = batch(@load, 1, {'a'}); toc; tic; wait(b); toc;
Elapsed time is 0.444297 seconds.
Elapsed time is 41.229590 seconds.
You can see that the time until the batch job is done is more than 35 s longer than the same operation on the client. This is not because a new Matlab worker has to be started — in my example, a worker was already running (if no worker were running, the batch(…) command itself would take longer, not the wait(b)).
Where does this overhead come from? How can I avoid it? (I also tried parfeval, but parfeval is plagued by a memory leak that makes it unusable — confirmed as a known bug by MathWorks).
Thanks, Matthias

Best Answer

Firstly, if you're using the local cluster type, then the batch command absolutely does need to launch the worker MATLAB process - it is not already running - you can verify this using Task Manager or similar. (Clusters of type MJS keep the workers running). The time for the batch command is simply the time needed to create the parallel.Job and parallel.Task objects needed for running the batch job, and saving those to disk.
Roughly speaking, the time taken to execute submitting and waiting for the results can be broken down like this:
  1. Time taken to create and submit the batch job to the scheduler
  2. Time taken to launch the worker process (unless you're using MJS)
  3. Time taken for the worker to load the job and task information
  4. Time for the worker to actually run the task
  5. Time for the worker to save the task results to disk (or database for MJS)
I suspect that the "missing" time is probably largely related to item 5 in the list above - as you've written it, the 512x512x1000 array is returned by your task function @load, and this result gets saved to disk.
How long does your save('a') command take? I suspect item 5 would take at least that long.
Note that there are several additional properties on the job object that can help you work out what's going on - see the reference page. In particular, note CreateTime, SubmitTime, StartTime, and FinishTime. The underlying task object has the same properties (except SubmitTime).