Hi.
Running a program at the moment that takes several hours to complete. Roughly half of that time is spent on one line of code (repeated hundreds of thousands of times), which does multiplication, comparison, and addition of ~300×300 matrix (not sparse), and ~1×300 vectors.
I wanted to look into parallel-processing to speed up the program a bit. Technically the program is supposed to run linearly, where each iteration relies on output from the last, but I can do the for loop iterations in batches with the same initial values. However that's more of a follow-up question.
The question I have pertains to the results I'm getting with using a for loop, vs a parfor loop for essentially just executing this one line of code several thousand times.
The test code I have below runs two for-loops, and records the time to execute each forloop. The second for loop runs the mathematical process in one line. The first for loop runs the same algorithm, but separates it into four smaller operations, and then combining them together to complete the algorithm. Each forloop runs 10,000 times. This is repeated 20 times, and the mean time required for both for-loops is recorded. Then the program is run with "parfor" replacing the for loops, with the core limit set to 1,2,3,4, and 6.
The code, and results are below:
%%%%parfor test
%%%%figure out how the hell to use parfor
L_rate = .01;cd = 1;Vl0 = rand(1,256);Hl0 = rand(1,301);Vlcd = rand(1,257);Hlcd = rand(1,301);dwnet = zeros(257,301);%%%calculation to perform:
%%%dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
for k = 1:20ticparfor (n=1:10000,1) dwnet = zeros(257,301); v1 = L_rate/sqrt(cd); v2 = [Vl0,1]'; v3 = rand(size(Hl0))<Hl0; v4 = Vlcd'*Hlcd; dwnet = v1*(v2*v3-v4); end time1(k,1) = toc; %toc;
ticparfor (n=1:10000,1) dwnet = zeros(257,301); dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);endtime2(k,1) = toc;%toc;endplot([time1,time2])mean([time1,time2])%%%%Time results
forloop_1 foorloop_2
%%%for 14.89sec 13.89sec
%%%parfor,1 9.95sec 10.98sec
%%%parfor,2 9.74sec 10.60sec
%%%parfor,3 9.91sec 10.90sec
%%%parfor,4 9.92sec 10.95sec
%%%parfor,6 9.93sec 10.93sec
%%%Run on Intel Core i7-Q720 (first-gen i-series; 4 cores, 8 logic threads)
So, as I expected, using a regular for-loop it is in fact faster to do everything with one line. This makes sense, as I'm adding time by writing v1 v2 v3 and v4 on top of ultimately doing the same calculation with dwnet.
But what fails to make sense to me is that when I use parfor, suddenly that becomes MORE efficient, by roughly 10%. Also, increasing the core-limit while using parfor actually seems to slightly slow the algorithm down – though that could just be a result of my poor understanding of parfor.
Question: Why is the first for-loop more efficient than the 2nd when using parfor? Feel free to expound upon parfor as much as you care to.
Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch). Any advice on how to do that with parfor? I need to get the cumulative sum of dwnets from each iteration to update Vl0 and Hl0, so I'd have to add up the dwnets for each worker, and then add them all up between the workers. But I'm fuzzy on how that's handled, and how parfor handles variables.
Any help answering this question is greatly appreciated.
Best Answer