MATLAB: Parallelizing multiple iterations and randperm for a Monte Carlo simulation

bsxfungpuarrayparforrandpermrepmat

I have a function which randomly picks stocks to buy, subject to the constraint of a fixed number of holding periods and fixed number of stocks that can be bought at any one time. The actual function is as follows:
function profit=monteCarloSim(stockPrices,numHoldingPeriods,numStocks,numSimuls)
[rows,cols] = size(stockPrices);
stockReturns = diff(stockPrices);
profit = zeros(numSimuls,1);
parfor i=1:numSimuls
buyWeights = zeros(rows-1,cols);
buySignal = randperm(rows-1,numHoldingPeriods);
buyRandomAssetPicks = randperm(cols, numStocks);
buyWeights(buySignal,buyRandomAssetPicks)=1/numStocks;
portfolioReturns = sum(stockReturns.*buyWeights,2);
profit(i) = sum(portfolioReturns);
end
end
I was planning to limit the number of operations in the loop by assigning each iteration of buyWeights to a 3D matrix. This way, I would take the portfolioReturns and profits(i) steps out of the loop, and completely vectorize them.
However, this approach becomes complicated when I have a huge number of simulations and a large data set, because there's not enough memory to store the whole 3D matrix. So I have to come up with a few extra steps to manage memory, e.g. build smaller chunks of the 3D matrix iteratively.
Even if I had the memory to do the 3D matrix approach, I'll still have to iterate the randperm function. I can't find a function to repeat this more effectively and I'm still a novice at using bsxfun.
Is there a way to speed this function up? I also have a good number of GPU cores, if a gpuArray method helps.
Thanks in advance!

Best Answer

You could use the faster FEX: Shuffle instead of RANDPERM. But unfortunately this function is not threadsafe currently and the replied random numbers will be equal in each thread.
Allocating buyWeights repeatedly consumes time. Better set it to zero:
buyWeights = zeros(rows-1,cols);
parfor i=1:numSimuls
buyWeights(:) = 0;
...
But here the PARFOR is a problem also. I cannot test this currently, but I guess this could help for both problems:
block = round(linspace(1, numSimuls + 1, numberOfCores + 1));
parfor iCore = 1:numberOfCores
buyWeights = zeros(rows-1,cols);
Shuffle(rand(1,4) * 2^32, 'seed');
for iS = block(iCore):block(iCore + 1) - 1
buyWeights(:) = 0;
buySignal = Shuffle(rows-1, 'index', numHoldingPeriods);
buyRandomAssetPicks = Shuffle(cols, 'index', numStocks);
buyWeights(buySignal,buyRandomAssetPicks) = 1/numStocks;
profit(iS) = sum(stockReturns(:) .* buyWeights(:));
end
end
In addition it is avoided to calculate the sum over the 2nd dimension at first, because this means accessing elementes spread through wide sections of the RAM. Better create the sum over the first dimension at first:
b = sum(stockReturns .* buyWeights, 1);
profit(iS) = sum(b);
Or join it to one SUM command as in the example above.
NOTE: Please test and debug this, because it is written from the scratch in a text editor.