MATLAB: How to use map and reduce efficiently

I have the following table, which has over 40 million rows and 5 columns:

The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID – some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.

I wish to do 2 things:

1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.

2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.

3) divide A by B.

This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.

ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);

where the functions are defined as

function gvkeyMapFun2(data, ~, intermKVStore)
 
 
    % gets quarter variables
    
    vQuarter = num2str(data.rdate); % char format
    
    % gets cusip in char format
 
    vNCUSIP = cell2mat(data.cusip);
       
    % creates quarter-ncusip identifer
 
     IDnum = strcat(vQuarter,vNCUSIP);
      
     IDnum = cellstr(IDnum);
            
    % finds unique NCUSIPS-quarter
        
    [intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
        
    % gets variables of intersst
   
     dataOwnership = cellfun(@(x) str2double(x),data.shares);
 
     dataTotalShares =  data.shrout2;
 
     for ii = 1:numel(intermKeys)
        
         totalOwnership  = sum(dataOwnership(idx==ii));
               
         totalShares  = mode(dataTotalShares(idx==ii));
                
         totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1)   ];
        
         add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
         
     end
 
end

and

function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
 
    databasereducedFinal = array2table([]);
 
    while hasnext(intermValIter)
        databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
    end
  
    add(outKVStore, 'output', databasereducedFinal);
 
end

I then run

output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});

to try and get the table because "output" looks like this:

I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.

thank you.

Best Answer

Assuming your shares variable is numeric and assuming your grouping variables are {'rdate', 'cusip'},

t = tall(ds);
[group, rdate, cusip] = findgroups(t.rdate, t.cusip);
shareratio  = splitapply(@(shares, shrout2) sum(shares) / mode(shrout2), t.shares, t.shrout2, group);
result = gather(table(rdate, cusip, shareratio));

I was getting an error message

What was the error message. I would suspect that the a posteriori str2double conversion would really slow things down and really it shouldn't be necessary.

Best Answer

Related Solutions

MATLAB: How to save a tall array / table to a text or csv file

MATLAB: Accumarray with tall arrays

Related Question