MATLAB: Finding groups based on matching multiple values in a column

findgroupsgroupingismembersortunique

I'd like to find groups in my table based on the values of multiple columnns- but I'd like to have the groups allow multiple specific values in one of the columns.

For the "I'd like to find groups in my table based on the values of multiple columnns" part, I can do that:

>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
  6×4 table
        Store          Country        Manager       Revenue
    ______________    __________    ____________    _______
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123  
    {'Lowes'     }    {'China' }    {'Johnson' }      456  
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789  
    {'Menards'   }    {'France'}    {'Miller'  }     1010  
    {'Menards'   }    {'USA'   }    {'Williams'}     1112  
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314  
    
>> [G,~,idx]=unique(T(:,[1 2]),'stable') %want 'stable' option so not using findgroups
G =
  5×2 table
        Store          Country  
    ______________    __________
    {'Home Depot'}    {'USA'   }
    {'Lowes'     }    {'China' }
    {'Home Depot'}    {'Canada'}
    {'Menards'   }    {'France'}
    {'Menards'   }    {'USA'   }
idx =
     1
     2
     3
     4
     5
     3

But what I'd like to do is introduce an 'or' grouping rule to designate the region that Country is in. In this case, I want to group the Country by region as well. I can also do that… though not sure my converting to cell method is the best way, but it works.

     >> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
     1
     3
     5
     6

But ultimately, I want to do this at the same time- show groups that match both the Store as well as the Region. I am not sure the best way to go about this- my actual table is very large. I thought of creating a new varibale for Region and then match off that- is that the way to go or can I combine these into one sort? I was also considering looping methods, but not sure the most efficient way to proceed. My desired result would be:

  T =
  6×4 table
        Store          Country        Manager       Revenue
    ______________    __________    ____________    _______
    {'Home Depot'}    {'USA'   }    {'Smith'   }      123  
    {'Lowes'     }    {'China' }    {'Johnson' }      456  
    {'Home Depot'}    {'Canada'}    {'Jones'   }      789  
    {'Menards'   }    {'France'}    {'Miller'  }     1010  
    {'Menards'   }    {'USA'   }    {'Williams'}     1112  
    {'Home Depot'}    {'Canada'}    {'Brown'   }     1314     
    
    idx =
     1
     2
     1
     3
     4
     1

Thanks for any insight.

Best Answer

I recommend adding the region column to your tables. I doubt it will have an impact on computation time relative to using a separate column and it will keep the data tidy.

I prefer to work with catgegories over string when working with categorical data.

T=table(categorical({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Menards'}),...
    categorical({'ISS';'Mexico';'Canada';'France';'USA';'Germany'}),...
    categorical({'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'}),...
    [123;456;789;1010;1112;1314],...
    'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'})  % modified example
T = 6x4 table
      Store       Country    Manager     Revenue
    __________    _______    ________    _______

    Home Depot    ISS        Smith         123  
    Lowes         Mexico     Johnson       456  
    Home Depot    Canada     Jones         789  
    Menards       France     Miller       1010  
    Menards       USA        Williams     1112  
    Menards       Germany    Brown        1314

Add region column for regions (assuming you have a key already)

% Column 1 is region name
% Column 2 is a list of countries in the region
regions = {
    categorical({'North America'}), categorical({'Canada', 'USA', 'Mexico'});
    categorical({'Europe'}),        categorical({'France', 'Germany', 'Bulgaria'});
    };
T.Region = repmat(categorical({'NA'}), height(T),1); % Default is NA
for i = 1:size(regions,1); 
    T.Region(ismember(T.Country, regions{i,2})) = regions{i,1};
end
disp(T)
      Store       Country    Manager     Revenue       Region    
    __________    _______    ________    _______    _____________

    Home Depot    ISS        Smith         123      NA           
    Lowes         Mexico     Johnson       456      North America
    Home Depot    Canada     Jones         789      North America
    Menards       France     Miller       1010      Europe       
    Menards       USA        Williams     1112      North America
    Menards       Germany    Brown        1314      Europe

Summary of Stores by region using

G = groupsummary(T,groupvars,method,datavars)

Tsummary = groupsummary(T,["Store","Region"],["sum","mean"], "Revenue")
Tsummary = 5x5 table
      Store          Region        GroupCount    sum_Revenue    mean_Revenue
    __________    _____________    __________    ___________    ____________

    Home Depot    NA                   1             123             123    
    Home Depot    North America        1             789             789    
    Lowes         North America        1             456             456    
    Menards       North America        1            1112            1112    
    Menards       Europe               2            2324            1162

Time test

I replicated the demo table many times to create a single table with 184,320 rows and times this solution 1000 times using tic/toc (including the construction of the regions array but not including the disp()). The median of the distribution of durations was 0.07 seconds.

Best Answer

Related Solutions

MATLAB: Find matched string in table

MATLAB: Get value from table/matrix based on common attribute

Related Question