Solved – Understand important features in UMAP

clusteringdimensionality reductionoptimizationtsne

I am using a dimensionality reduction algorithm (UMAP) to cluster high-dimensional data.

Particularly, I have ~50000 vectors of dimension ~20000 to visualise. These vectors are highly structured: They lie on low-dimensional manifolds, which I don't know. Because of this reason, UMAP is able to cluster them perfectly: I can easily see the clusters and they match exactly the shape I was expecting.

I know that, among the ~20000 entries of every vector, only a few of them actually play a role in the final dimensionality reduction. I just do not know which ones. In other words, most of the features are useless and do not contain much information, and I would like to find out which ones are them and cancel them out.

Is there a way to understand which entries are important in the final prediction?

Best Answer

One possiblity - that seems in a sense a bit backwards, but would probably work - is to use some model to predict the UMAP embedding and look at what features play a role in the prediction.

Since the the embedding is potentially a highly non-linear transformation, I would think about using a neural network (either with some normalization before hand - e.g. rankgauss - or a batch normalization as the first layer, followed by a few dense layers) and having a look at which of the inputs are get meaningful activation for any of the examples.

Other prediction tools that can deal with non-linear transformations such as xgboost/LightGBM etc. may be even better options, because they do not need input normalization and allow easy interogation using SHAP values.