Solved – t-SNE with mixed continuous and binary variables

dimensionality reductionmixed type datapythontsneunsupervised learning

I am currently investigating the visualisation of high-dimensional data using t-SNE. I have some data with mixed binary and continuous variables and the data appears to cluster the binary data much too readily. Of course this is expected for scaled (between 0 and 1) data: the Euclidian distance will always be greatest/smallest between binary variables. How should one deal with mixed binary/continuous datasets using t-SNE? Should we drop the binary columns? It there a different metric we can use?

As an example consider this python code:

x1 = np.random.rand(200)
x2 = np.random.rand(200)
x3 = np.r_[np.ones(100), np.zeros(100)]

X = np.c_[x1, x2, x3]

# plot of the original data
plt.scatter(x1, x2, c=x3)
# … format graph

so my raw data is:

raw_data

where the colour is the value of the third feature (x3) – in 3D the data points lie in two planes (x3=0 plane and x3=1 plane).

I then perform t-SNE:

tsne = TSNE() # sci-kit learn implementation
X_transformed = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=2, perplexity=5)
X_embedded = tsne.fit_transform(X_transformed)

with the resulting plot:

tsne_data

and the data has of course clustered by x3. My gut instinct is that because a distance metric is not well defined for binary features we should drop them before performing any t-SNE, which would be a shame as these features may contain useful information for the generating the clusters.

Best Answer

Disclaimer: I only have tangential knowledge on the topic, but since no one else answered, I will give it a try

Distance is important

Any dimensionality reduction technique based on distances (tSNE, UMAP, MDS, PCoA and possibly others) is only as good as the distance metric you use. As @amoeba correctly points out, there cannot be one-size-fits-all solution, you need to have a distance metric that captures what you deem important in the data, i.e. that rows you would consider similar have small distance and rows you would consider different have large distance.

How do you choose a good distance metric? First, let me do a little diversion:

Ordination

Well before the glory days of modern machine learning, community ecologists (and quite likely others) have tried to make nice plots for exploratory analysis of multidimensional data. They call the process ordination and it is a useful keyword to search for in the ecology literature going back at least to the 70s and still going strong today.

The important thing is that ecologists have a very diverse datasets and deal with mixtures of binary, integer and real-valued features (e.g. presence/absence of species, number of observed specimens, pH, temperature). They've spent a lot of time thinking about distances and transformations to make ordinations work well. I do not understand the field very well, but for example the review by Legendre and De Cáceres Beta diversity as the variance of community data: dissimilaritycoefficients and partitioning shows an overwhelming number of possible distances you might want to check out.

Multidimensional scaling

The go-to tool for ordination is multi-dimensional scaling (MDS), especially the non-metric variant (NMDS) which I encourage you to try in addition to t-SNE. I don't know about the Python world, but the R implementation in metaMDS function of the vegan package does a lot of tricks for you (e.g. running multiple runs until it finds two that are similar).

This has been disputed, see comments: The nice part about MDS is that it also projects the features (columns), so you can see which features drive the dimensionality reduction. This helps you to interpret your data.

Keep in mind that t-SNE has been criticized as a tool to derive understanding see e.g. this exploration of its pitfalls - I've heard UMAP solves some of the issues, but I have no experience with UMAP. I also don't doubt part of the reason ecologists use NMDS is culture and inertia, maybe UMAP or t-SNE are actually better. I honestly don't know.

Rolling out your own distance

If you understand the structure of your data, the ready-made distances and transformations might not be best for you and you might want to build a custom distance metric. While I don't know what your data represent, it might be sensible to compute distance separately for the real-valued variables (e.g. using Euclidean distance if that makes sense) and for the binary variables and add them. Common distances for binary data are for example Jaccard distance or Cosine distance. You might need to think about some multiplicative coefficient for the distances as Jaccard and Cosine both have values in $[0,1]$ regardless of the number of features while the magnitude of Euclidean distance reflects the number of features.

A word of caution

All the time you should keep in mind that since you have so many knobs to tune, you can easily fall into the trap of tuning until you see what you wanted to see. This is difficult to avoid completely in exploratory analysis, but you should be cautious.

Related Question