How to eliminate redundant columns in matrix of nominal data

I have data like this:

   c1  c2  c3  c4  c5
r1  a   a   b   c   d
r2 NA   a   b   c   a
r3 NA   b   b   b   c
r4  a   b   c   b   c

Here, each row is a data record, each column is a feature. I have several thousand such matrices, sizes range from 10-100 rows and most have around 50 columns. In this data NA entries must be treated as missing / not distinguishable from non-NA values; it is not possible to simply ignore rows and columns where NAs occur.

My goal is to figure out, for each matrix, how many columns do I really need to distinguish / discriminate each row from the other rows, and which set(s) of column(s) achieve that? For example, in the matrix given above, columns c3 and c5 together are enough to discriminate all four rows.

To me this seems like a two-step problem: the first question ("how many columns are needed to distinguish all rows from each other?") seems akin to finding matrix rank, but I'm not sure how to do that with nominal data. I'm considering recoding each value as a power of two (with NA as 0), then doing something like gaussian elimination but where the coefficient of each row can only be 0 or 1. I've not yet convinced myself that that will work though, so before I do that, I wanted to ask if anyone knows a better approach or can spot an obvious pitfall that will make my approach fail.

For the second question ("which set(s) of columns sufficiently encode the fact that all rows are distinct?"), I only care about finding the smallest such solutions, not all solutions. So for the example above, the solution "c3,c5" is all I want, I don't care about solutions like "c3,c4,c5" where extra, redundant information is included. I do however expect some cases where there are multiple "equally minimal" solutions. For this part of the problem I've already implemented a brute-force-ish approach and I'm content to use it, as long as I can know in advance what number of columns is needed. But I am of course also open to suggestions on how to do that part more efficiently.

from io import StringIO from itertools import combinations import pandas as pd import pulp data = """\ c1 c2 c3 c4 c5 c6 r1 a a b c d a r2 NA a b c a b r3 NA b b b c a r4 a b c b c b """ df = pd.read_table(StringIO(data), sep=' ', header=0, skipinitialspace=True, dtype=pd.StringDtype()) model = pulp.LpProblem('mwe') feat_vars = [pulp.LpVariable(feat, cat=pulp.LpBinary) for feat in df.columns] row_pairs = combinations(df.index, 2) for row_pair in row_pairs: # here are the elementwise distances between the rows... rows = df.loc[list(row_pair)] dist = rows.iloc[0] != rows.iloc[1] # ...which become the coefficients for the feat_vars in each constraint if pd.notna(dist).any(): model += sum(dist[feat_var.name] * feat_var for feat_var in feat_vars if pd.notna(dist[feat_var.name])) >= 1 # set objective & solve; objective is "minimize" by default, so will find # a solution that uses the fewest feature variables model.setObjective(pulp.lpSum(feat_vars)) model.solve(pulp.GLPK_CMD(msg=False)) converged = model.sol_status == pulp.LpStatusOptimal if converged: # var.value() will be 1 for retained features, 0 for dropped features retained_feat_vars = list( filter(lambda var: var.value(), model.variables())) retained_feat_names = [f.name for f in retained_feat_vars] # return a one-row dataframe per solution so the aggregation is clean _df = pd.DataFrame(index=[0]) _df['n_features_needed'] = len(retained_feat_vars) _df['features'] = ','.join(retained_feat_names) # find other "equally minimal" solutions. make sure they're the same length # as the first solution (shouldn't be necessary, but may speed things up) model += pulp.lpSum(feat_vars) == len(retained_feat_names) while converged: # require at least one feat. be different from any existing solutions model += pulp.lpSum(retained_feat_vars) <= len(retained_feat_vars) - 1 # re-solve model.solve(pulp.GLPK_CMD(msg=False)) converged = model.sol_status == pulp.LpStatusOptimal # add new solutions to the results dataframe if converged: retained_feat_vars = list( filter(lambda var: var.value(), model.variables())) retained_feat_names = [f.name for f in retained_feat_vars] _row = pd.DataFrame(index=[0]) _row['n_features_needed'] = len(retained_feat_vars) _row['features'] = ','.join(retained_feat_names) _df = pd.concat([_df, _row], axis='index', ignore_index=True)

How to eliminate redundant columns in matrix of nominal data

Best Answer

Related Question

Best Answer

Related Solutions

Related Question