In terms of memory, that is. A parfor loop uses vastly more memory than its for-loop counterpart, apparently because it makes copies of all of the data for each thread. But it does this even when the data are read-only, and therefore such copies are completely unnecessary–simultaneous reads of a piece of data from multiple threads are just fine, in general. Moreover, Matlab clearly already knows what data is read only, through its 'classification'. Yet the copies are made anyway. I have lost a lot of time as my system grinds to a halt when trying to run parallelized code on large data files. Is there any way to remedy the situation? Or is it just a programming fail we have to live with (at least for now)?
MATLAB: Are parfor loops so inefficient
data paralleldata parallelismefficiencymemoryParallel Computing Toolboxparfor
Related Solutions
Imagine that each matlab worker required 1 byte of data and instructions. Imagine that you have a petabyte of memory. Clearly after you increased the number of workers to more 10^12, all of the memory would be used just in maintaining the workers and you would not be able to improve performance by adding more workers.
The actual amount of memory used as overhead per worker varies with release. These days about 2 gigabytes is a good estimate, so with your petabyte of memory you would not be able to improve performance beyond 500 workers.
You probably don't have a petabyte though. You probably have 8 or 16 or 32 gigabytes, maybe 64. And in reality you need to account for the data used on each worker. Some algorithms need very little memory but some need gigabytes each. It would not be uncommon for you to start running out of memory by 8 workers.
Now... you have to get each worker the data it needs to work on, and you need to transfer the results back. So each iteration could potentially require sending a notable amount of data around. If the amount of work done with the data is small, then the overhead of sending and receiving the data can dominate. This is fairly common!!
Next: each worker is a Process that needs to be scheduled by the operating system. In practice the operating system needs a core to handle scheduling and device interrupts and run the antivirus and firewall and polling for new email and user interaction... It doesn't necessarily need a dedicated core, but you should not count on getting much computation done on the core. So subtract one from your cores. And the workers have to be allocated to the remaining cores. If they are heavy CPU users they will not be wanting to give up the core even enough to make hyperthreading useful (hyperthreading is fast process switching, not additional computing resources. When a process has to wait on something then the CPU can quickly switch to new work, but in the case of heavy computation the process is not waiting on anything external except during transfer of data between processes. Hyperthreading can slow down something that uses CPU extensively.)
We are now at the point that when the number of workers exceeds (cores minus one) then the workers are going to contend for core access. Number of workers equal to the number of cores is common, but one of them might not run at full speed as the operating system uses one.
People have studied the optimal number of workers in various scenarios. There are some computations and data patterns for which increased cores always results in increased performance, "embarrassingly parallel" computation. But for more general tasks, it is common that performance increases sharply up to 4 workers, moderately to 6 (enough to often be worthwhile), less so to 8... and that beyond that it often becomes questionable whether more cores is cost effective.
If you were asked to choose between 16 cores at 2 gigahertz vs 6 cores at 4 gigahertz, there are some times that the more but slower cores is a big advantage, but more of the time you are better off with fewer much faster cores (and correspondingly fewer workers)
If each calculation is independent (as in, you don't need the entire 15000000x1 cell to make 1 calculation), then you could rewrite the parfor loop to work on a sliced variable. Workers will only receive a "slice" of the large 15000000x1 cell. This prevents passing the large cell as a broadcast variable, which will use too much memory.
The parfor loop should look something like this:
LargeCell = repmat({zeros(24)}, 100, 1); %Represent your 15000000x1 cell
Results = zeros(size(LargeCell)); %Store results here
parfor k = 1:length(LargeCell) Results(k) = complex_function( LargeCell{k} ); %Do your CPU-intensive calculation here
end
Best Answer