I have the following table, which has over 40 million rows and 5 columns:
The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID – some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');ds.TextscanFormats{3} = '%q';ds.TextscanFormats{4} = '%q';outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore) % gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip); % creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP); IDnum = cellstr(IDnum); % finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares); dataTotalShares = data.shrout2; for ii = 1:numel(intermKeys) totalOwnership = sum(dataOwnership(idx==ii)); totalShares = mode(dataTotalShares(idx==ii)); totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ]; add(intermKVStore, intermKeys{ii}, totalOwnershipInfo); end end
and
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore) databasereducedFinal = array2table([]); while hasnext(intermValIter) databasereducedFinal = [databasereducedFinal; getnext(intermValIter)]; end add(outKVStore, 'output', databasereducedFinal); end
I then run
output = readall(outds);c = vertcat(output{:, 2});tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:
I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.
Best Answer