MATLAB: Table with a few identical column names

identicaltable

Hello Matlab Community,

I got a not-so-regular data table (I converted it to .csv) containing identically named columns, and I have to calculate mean values both horizontally and vertically.

The schame is the following: The first cells of each row is a string, the others are doubles.

a b b b c c [...]
x  
x  
y  
z  
z  
z  
[...]

The output table should be:

a b c [...]  
x   
y  
z  
[...]

Where the values are the mean values, like the x row is the means of

-all the x rows (like the 'grup by' command in sql) and

-all the b, c, … values

I know that in standard database management I would have to re-structure and normalize the table, but is there any solution in Matlab to solve this?

All I could try is to group the rows:

function [ C ] = CsvAvg( in_csv )
T = readtable(in_csv);
C = table;
C(1,1)=T(1,1);
for i=2:height(T)-1 
           if strcmp(strjoin(table2cell(T(i,1))),strjoin(table2cell(T(i-1,1))))==0
            C(i,1)=T(i,1);
            for j=2:width(T)
               % 
            end
           else C(i,1)=cell2table(cellstr(' '));
           end
end
end

I noticed that Matlab re-name the identical column names (which is correct in most cases).

Thank you in advance and Im sorry for my lack of English language skills. Im a beginner in Matlab programming too.

Best Answer

Grouping the rows is trivially achieved with varfun in just one line:

newT = varfun(@mean, T, 'GroupingVariables', 1)

The tricky bit is indeed the grouping of the columns as it's not something that's really supported by tables. Probably the easiest way is to use a custom function with rowfun. This can do the row grouping at the same time. Something like:

function varargout = groupingfunction(columngroups, columns)
  %this function to be used with rowfun with the option 'SeparateInputs', false
  %columngroup: row vector of integers from 1:n indicating how to group the column together (all 1 columns together, all 2 columns together, etc.)
  %columns contains the content of the columns of the table
  columngroups = repmat(columngroups, size(columns, 1), 1);  %replicate grouping for each row
  varargout = num2cell(accumarray(columngroups(:), columns(:), [], @mean));
end

You can then use that with rowfun:

columgroups = [1 1 1 2 2 3 3 3 4 4 4 4 5 6 6 6];  %generate that however you want. Indicates how to group the columns together, ignoring the first column
columnames = {'a', 'b', 'c', 'd', 'e', 'f'};  %as many as there are unique values in columngroups
newT = rowfun(@(cols) groupingfunction(columngroups, cols), T, 'GroupingVariables', 1,'SeparateInputs', false, 'NumOutputs', numel(unique(columngroups)), 'OutputVariableNames', columnnames);

A possible way of generating columngroups and columnnames, assuming they're all named prefix_number

prefixes = regexp(T.Properties.VariableNames(2:end), '.*?(?=(_\d+)?$)', 'match', 'once');
[columnnames, ~, columngroups] = unique(prefixes);

edit, now that you've posted some demo data: The above works without issue on your demo data, as long as you transpose the columngroups vector returned by unique:

T = readtable('test.csv');
prefixes = regexp(T.Properties.VariableNames(2:end), '.*?(?=(_\d+)?$)', 'match', 'once');
[columnnames, ~, columngroups] = unique(prefixes);
newT = rowfun(@(cols) groupingfunction(columngroups.', cols), T, 'GroupingVariables', 1,'SeparateInputs', false, 'NumOutputs', numel(unique(columngroups)), 'OutputVariableNames', columnnames);

Related Solutions

MATLAB: Is there an efficient way to sort specific ranges/subsections of rows independently in a table by a particular column

splitapply() provided that you do not mind that disconnected regions with the same identifier code will be brought together

MATLAB: Save in a string the values obtained after varfun and grouping

If you want to create a new table with two new variables, one for the number of unique countries and one for the list of unique countries then a) you have to use rowfun instead of varfun, and b) you can't use an anonymous function (since you need a function with 2 outputs) but a normal m function:

function [count, list] = uniquecountries(countries)
   %inputs: 
   %    countries: a list of countries (categorical)
   %    count: the number of unique countries (double)
   %    list: the list of unique countries, joined by a ', ' (scalar cell array of a char vector)
   %Note: list is returned as a scalar cell array instead of a char vector so that rowfun can concatenate the output of uniquecountries
   validateattributes(countries, {'categorical'}, {});
   list = unique(countries);
   count = numel(list);
   list = {strjoin(cellstr(list), ', ')};
end

Then:

sections = rowfun(@uniquecountries, fullTable, ...
                  'GroupingVariables', {'ID1','Date'}, ...
                  'InputVariables', 'Country', ...
                  'NumOutputs', 2, ...
                  'OutputVariableNames', {'Country_count', 'Country_list'})

Alternatively, if you don't want a separate m function, you can have two calls to rowfun or varfun to create two tables that you then join. It will be much slower however:

sectionscount = rowfun(@(v) numel(unique(v)), fullTable, ...
                       'GroupingVariables', {'ID1','Date'}, ...
                       'InputVariables', 'Country', ...
                       'OutputVariableNames', 'Country_count');
sectionslist = rowfun(@(v) {strjoin(cellstr(unique(v)), ', ')}, fullTable, ...
                       'GroupingVariables', {'ID1','Date'}, ...
                       'InputVariables', 'Country', ...
                       'OutputVariableNames', 'Country_list');
sections = join(sectionscount, sectionslist)

Best Answer

Related Solutions

MATLAB: Is there an efficient way to sort specific ranges/subsections of rows independently in a table by a particular column

MATLAB: Save in a string the values obtained after varfun and grouping

Related Question