MATLAB: Finding groups based on matching multiple values in a column

findgroupsgroupingismembersortunique

I'd like to find groups in my table based on the values of multiple columnns- but I'd like to have the groups allow multiple specific values in one of the columns.
For the "I'd like to find groups in my table based on the values of multiple columnns" part, I can do that:
>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
>> [G,~,idx]=unique(T(:,[1 2]),'stable') %want 'stable' option so not using findgroups
G =
5×2 table
Store Country
______________ __________
{'Home Depot'} {'USA' }
{'Lowes' } {'China' }
{'Home Depot'} {'Canada'}
{'Menards' } {'France'}
{'Menards' } {'USA' }
idx =
1
2
3
4
5
3
But what I'd like to do is introduce an 'or' grouping rule to designate the region that Country is in. In this case, I want to group the Country by region as well. I can also do that… though not sure my converting to cell method is the best way, but it works.
>> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
1
3
5
6
But ultimately, I want to do this at the same time- show groups that match both the Store as well as the Region. I am not sure the best way to go about this- my actual table is very large. I thought of creating a new varibale for Region and then match off that- is that the way to go or can I combine these into one sort? I was also considering looping methods, but not sure the most efficient way to proceed. My desired result would be:
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
idx =
1
2
1
3
4
1
Thanks for any insight.

Best Answer

I recommend adding the region column to your tables. I doubt it will have an impact on computation time relative to using a separate column and it will keep the data tidy.
I prefer to work with catgegories over string when working with categorical data.
T=table(categorical({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Menards'}),...
categorical({'ISS';'Mexico';'Canada';'France';'USA';'Germany'}),...
categorical({'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'}),...
[123;456;789;1010;1112;1314],...
'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'}) % modified example
T = 6x4 table
Store Country Manager Revenue __________ _______ ________ _______ Home Depot ISS Smith 123 Lowes Mexico Johnson 456 Home Depot Canada Jones 789 Menards France Miller 1010 Menards USA Williams 1112 Menards Germany Brown 1314
Add region column for regions (assuming you have a key already)
% Column 1 is region name
% Column 2 is a list of countries in the region
regions = {
categorical({'North America'}), categorical({'Canada', 'USA', 'Mexico'});
categorical({'Europe'}), categorical({'France', 'Germany', 'Bulgaria'});
};
T.Region = repmat(categorical({'NA'}), height(T),1); % Default is NA
for i = 1:size(regions,1);
T.Region(ismember(T.Country, regions{i,2})) = regions{i,1};
end
disp(T)
Store Country Manager Revenue Region __________ _______ ________ _______ _____________ Home Depot ISS Smith 123 NA Lowes Mexico Johnson 456 North America Home Depot Canada Jones 789 North America Menards France Miller 1010 Europe Menards USA Williams 1112 North America Menards Germany Brown 1314 Europe
Summary of Stores by region using
Tsummary = groupsummary(T,["Store","Region"],["sum","mean"], "Revenue")
Tsummary = 5x5 table
Store Region GroupCount sum_Revenue mean_Revenue __________ _____________ __________ ___________ ____________ Home Depot NA 1 123 123 Home Depot North America 1 789 789 Lowes North America 1 456 456 Menards North America 1 1112 1112 Menards Europe 2 2324 1162
Time test
I replicated the demo table many times to create a single table with 184,320 rows and times this solution 1000 times using tic/toc (including the construction of the regions array but not including the disp()). The median of the distribution of durations was 0.07 seconds.