MATLAB: How to use map and reduce efficiently

mapreduce

I have the following table, which has over 40 million rows and 5 columns:
Screenshot 2019-08-08 at 15.58.48.png
The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID – some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore)
% gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip);
% creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP);
IDnum = cellstr(IDnum);
% finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares);
dataTotalShares = data.shrout2;
for ii = 1:numel(intermKeys)
totalOwnership = sum(dataOwnership(idx==ii));
totalShares = mode(dataTotalShares(idx==ii));
totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];
add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
end
end
and
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
databasereducedFinal = array2table([]);
while hasnext(intermValIter)
databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
end
add(outKVStore, 'output', databasereducedFinal);
end
I then run
output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:
Screenshot 2019-08-08 at 12.12.29.png
I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.

Best Answer

Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},
t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));
I was getting an error message
What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.