How to eliminate redundant columns in matrix of nominal data

categorical datalinear algebra

I have data like this:

   c1  c2  c3  c4  c5
r1  a   a   b   c   d
r2 NA   a   b   c   a
r3 NA   b   b   b   c
r4  a   b   c   b   c

Here, each row is a data record, each column is a feature. I have several thousand such matrices, sizes range from 10-100 rows and most have around 50 columns. In this data NA entries must be treated as missing / not distinguishable from non-NA values; it is not possible to simply ignore rows and columns where NAs occur.

My goal is to figure out, for each matrix, how many columns do I really need to distinguish / discriminate each row from the other rows, and which set(s) of column(s) achieve that? For example, in the matrix given above, columns c3 and c5 together are enough to discriminate all four rows.

To me this seems like a two-step problem: the first question ("how many columns are needed to distinguish all rows from each other?") seems akin to finding matrix rank, but I'm not sure how to do that with nominal data. I'm considering recoding each value as a power of two (with NA as 0), then doing something like gaussian elimination but where the coefficient of each row can only be 0 or 1. I've not yet convinced myself that that will work though, so before I do that, I wanted to ask if anyone knows a better approach or can spot an obvious pitfall that will make my approach fail.

For the second question ("which set(s) of columns sufficiently encode the fact that all rows are distinct?"), I only care about finding the smallest such solutions, not all solutions. So for the example above, the solution "c3,c5" is all I want, I don't care about solutions like "c3,c4,c5" where extra, redundant information is included. I do however expect some cases where there are multiple "equally minimal" solutions. For this part of the problem I've already implemented a brute-force-ish approach and I'm content to use it, as long as I can know in advance what number of columns is needed. But I am of course also open to suggestions on how to do that part more efficiently.

Best Answer

I ended up solving this using linear programming. Here's an example script, with an extra column added to the sample data from the question (to illustrate the step of finding multiple solutions)

from io import StringIO
from itertools import combinations

import pandas as pd
import pulp

data = """\
   c1  c2  c3  c4  c5  c6
r1  a   a   b   c   d   a
r2 NA   a   b   c   a   b
r3 NA   b   b   b   c   a
r4  a   b   c   b   c   b
"""

df = pd.read_table(StringIO(data), sep=' ', header=0, skipinitialspace=True,
                   dtype=pd.StringDtype())

model = pulp.LpProblem('mwe')
feat_vars = [pulp.LpVariable(feat, cat=pulp.LpBinary)
             for feat in df.columns]
row_pairs = combinations(df.index, 2)
for row_pair in row_pairs:
    # here are the elementwise distances between the rows...
    rows = df.loc[list(row_pair)]
    dist = rows.iloc[0] != rows.iloc[1]
    # ...which become the coefficients for the feat_vars in each constraint
    if pd.notna(dist).any():
        model += sum(dist[feat_var.name] * feat_var
                     for feat_var in feat_vars
                     if pd.notna(dist[feat_var.name])) >= 1
# set objective & solve; objective is "minimize" by default, so will find
# a solution that uses the fewest feature variables
model.setObjective(pulp.lpSum(feat_vars))
model.solve(pulp.GLPK_CMD(msg=False))
converged = model.sol_status == pulp.LpStatusOptimal

if converged:
    # var.value() will be 1 for retained features, 0 for dropped features
    retained_feat_vars = list(
        filter(lambda var: var.value(), model.variables()))
    retained_feat_names = [f.name for f in retained_feat_vars]

    # return a one-row dataframe per solution so the aggregation is clean
    _df = pd.DataFrame(index=[0])
    _df['n_features_needed'] = len(retained_feat_vars)
    _df['features'] = ','.join(retained_feat_names)

    # find other "equally minimal" solutions. make sure they're the same length
    # as the first solution (shouldn't be necessary, but may speed things up)
    model += pulp.lpSum(feat_vars) == len(retained_feat_names)
    while converged:
        # require at least one feat. be different from any existing solutions
        model += pulp.lpSum(retained_feat_vars) <= len(retained_feat_vars) - 1
        # re-solve
        model.solve(pulp.GLPK_CMD(msg=False))
        converged = model.sol_status == pulp.LpStatusOptimal
        # add new solutions to the results dataframe
        if converged:
            retained_feat_vars = list(
                filter(lambda var: var.value(), model.variables()))
            retained_feat_names = [f.name for f in retained_feat_vars]
            _row = pd.DataFrame(index=[0])
            _row['n_features_needed'] = len(retained_feat_vars)
            _row['features'] = ','.join(retained_feat_names)
            _df = pd.concat([_df, _row], axis='index', ignore_index=True)

the resulting _df looks like this:

   n_features_needed features
0                  2    c3,c5
1                  2    c2,c6
2                  2    c4,c6
3                  2    c5,c6
Related Question