MATLAB: Efficiency of for vs parfor loop for iterating a moderately large matrix operation


Running a program at the moment that takes several hours to complete. Roughly half of that time is spent on one line of code (repeated hundreds of thousands of times), which does multiplication, comparison, and addition of ~300×300 matrix (not sparse), and ~1×300 vectors.
I wanted to look into parallel-processing to speed up the program a bit. Technically the program is supposed to run linearly, where each iteration relies on output from the last, but I can do the for loop iterations in batches with the same initial values. However that's more of a follow-up question.
The question I have pertains to the results I'm getting with using a for loop, vs a parfor loop for essentially just executing this one line of code several thousand times.
The test code I have below runs two for-loops, and records the time to execute each forloop. The second for loop runs the mathematical process in one line. The first for loop runs the same algorithm, but separates it into four smaller operations, and then combining them together to complete the algorithm. Each forloop runs 10,000 times. This is repeated 20 times, and the mean time required for both for-loops is recorded. Then the program is run with "parfor" replacing the for loops, with the core limit set to 1,2,3,4, and 6.
The code, and results are below:
%%%%parfor test
%%%%figure out how the hell to use parfor
L_rate = .01;
cd = 1;
Vl0 = rand(1,256);
Hl0 = rand(1,301);
Vlcd = rand(1,257);
Hlcd = rand(1,301);
dwnet = zeros(257,301);
%%%calculation to perform:
%%%dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
for k = 1:20
parfor (n=1:10000,1)
dwnet = zeros(257,301);
v1 = L_rate/sqrt(cd);
v2 = [Vl0,1]';
v3 = rand(size(Hl0))<Hl0;
v4 = Vlcd'*Hlcd;
dwnet = v1*(v2*v3-v4);
time1(k,1) = toc;

parfor (n=1:10000,1)
dwnet = zeros(257,301);
dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
time2(k,1) = toc;
%%%%Time results
forloop_1 foorloop_2
%%%for 14.89sec 13.89sec
%%%parfor,1 9.95sec 10.98sec
%%%parfor,2 9.74sec 10.60sec
%%%parfor,3 9.91sec 10.90sec
%%%parfor,4 9.92sec 10.95sec
%%%parfor,6 9.93sec 10.93sec
%%%Run on Intel Core i7-Q720 (first-gen i-series; 4 cores, 8 logic threads)
So, as I expected, using a regular for-loop it is in fact faster to do everything with one line. This makes sense, as I'm adding time by writing v1 v2 v3 and v4 on top of ultimately doing the same calculation with dwnet.
But what fails to make sense to me is that when I use parfor, suddenly that becomes MORE efficient, by roughly 10%. Also, increasing the core-limit while using parfor actually seems to slightly slow the algorithm down – though that could just be a result of my poor understanding of parfor.
Question: Why is the first for-loop more efficient than the 2nd when using parfor? Feel free to expound upon parfor as much as you care to.
Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch). Any advice on how to do that with parfor? I need to get the cumulative sum of dwnets from each iteration to update Vl0 and Hl0, so I'd have to add up the dwnets for each worker, and then add them all up between the workers. But I'm fuzzy on how that's handled, and how parfor handles variables.
Any help answering this question is greatly appreciated.

Best Answer

The statements
dwnet = zeros(257,301);
are unnecessary overhead and are clouding the issue. I re-implemented your code as below, removing all 3 occurrences of these statements and also handling v1 more efficiently. On my machine, the results of a plain for-loop are not significantly different. I expected no difference.
When running with parfor, I find that loop 1 is indeed about 25% faster than loop 2. I assume it's because parfor pre-parses the contents of the loop, analyzing which variables are temporary and which have other roles, see Classification of Variables. Because of this, I imagine parfor can pre-allocate temporary variables or optimize their use in some other obscure way.
v1 = L_rate/sqrt(cd);
for k = 1:5
parfor (n=1:10000,2)
%for n=1:1e4
v2 = v1*[Vl0,1].';
v3 = rand(size(Hl0))<Hl0;
v4 = Vlcd'*Hlcd;
dwnet = v2*v3-v4;
time1(k,1) = toc;

parfor (n=1:10000,2)
%for n=1:1e4
dwnet = (v1*[Vl0,1]')*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd;
time2(k,1) = toc;
Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch).
You would probably have to use a double for-loop
for i=1:N
parfor j=1:M
where the inner parfor loop loops over the batches over which Vl0 and Hl0 remain constant. It might not be worthwhile, though. Batches that small could probably be vectorized, circumventing a for-loop altogether. At least, your simplified example certain can be.
You also have to factor in communication to/from the labs. A normal parfor loop would have to broadcast its data to/from the client every time it begins and ends, so in your case at the beginning and end of every i-th iteration of the outer for-loop. It may help, however, to use Edric Ellis' Worker Object Wrapper, which can make data persist between parfor-loops.