Solved – Visualizing all high-dimensional categorical data

data visualization

I'm running many combinations of hyperparameters, such that each combination of transformation (e.g. preprocessing, classifier) is combined with all others in a Cartesian product, along with a fit metric (accuracy). One possible solution would be a table where each column is a classifier, each row is a preprocessor, however in this case, I'm working with greater than two hyperparameters.

What's a good, interactive way to display this data to a user (e.g. a person interested in exploring the data)?

Best Answer

Displaying in a 3-dimensional plot, that you can move around, zoom and so on is not a terrible option.

You'll need to project the high-dimensional data onto 3 dimensions of course. Two methods to do this are t-SNE, and PCA.

PCA projects onto axes which can maximize the amount of variance of the resulting plot, and minimize the residual maintenance that was lost during the projection. This is a fairly straightforward projection to understand intuitively. The downside is that you'll lose correlations that need some more manifold-like projection to show.

t-SNE is sort of the opposite: it projects onto a potentially very convoluted, complex manifold, that doesnt need to have any kind of global coherence in any way. It can represent local structure fairly well, and handle high-dimensional manifolds, but it loses any sense of the actual global structure.

As an example of t-SNE, if you have two interlocking rings, t-SNE can show them as two flat, separated, non-interlocking rings. This page https://distill.pub/2016/misread-tsne/ shows some very interesting examples:

At a practical level, an implementation of both a t-SNE projector and viewer, and a PCA projector and viewer is in the Tensorflow Tensorboard. https://www.tensorflow.org/programmers_guide/embedding#visualizing_embeddings

Related Solutions

Exploratory Data Analysis – Quick Glance at a Dataset for Insights

@Ondrej and @Michelle have provided some good information here. I wonder if I can contribute by addressing some points not mentioned elsewhere. I wouldn't beat yourself up about not being able to glean much from the data in tabular form, tables are generally not a very good way to present information (cf., Gelman et al., Turning Tables into Graphs). On the other hand, asking for a tool that will automatically generate all of the right graphs to help you explore a new data set is almost like asking for a tool that will do your thinking for you. (Don't take that the wrong way, I recognize your question makes clear that you aren't going that far; I just mean that there will never really be such a tool.) A nice discussion that is related to this can be found here.

These things having been said, I wanted to talk a little about the kinds of plots that you might want to use to explore your data. The plots listed in the question would be a good start, but we might be able to optimize that a little. To start with, making "a large number of plots" correlating pairs of variables might not be ideal. A scatterplot only displays the marginal relationship between two variables. Important relationships can often be hidden in some combination of multiple variables. So the first way to beef up this approach is to make a scatterplot matrix that displays all pairwise scatterplots simultaneously. Scatterplot matrices can be enhanced in various ways: E.g., they can be combined with univariate kernel density plots of each variable's distribution, different markers / colors can be used to plot different groups, and possible nonlinear relationships can be assessed by overlaying a loess fit. The scatterplot.matrix function in the car package in R can do all of these things nicely (an example can be seen halfway down the page linked above).

However, while scatterplot matrices are a good start, they are still only displaying the marginal projections. There are a few ways to try to move beyond this. One is to explore 3-dimensional plots using the rgl package in R. Another approach is to use conditional plots; coplots can help with relationships amongst 3 or 4 variables simultaneously. An especially useful approach is to use a scatterplot matrix interactively (albeit, this will require more effort to learn), e.g. by 'brushing'. Brushing allows you to highlight a point or points in one frame of a matrix and those points will simultaneously be highlighted in all of the other frames. By moving the brush around, you can see how all of the variables change together. UPDATE: Another possibility that I had forgotten to mention is to use a parallel coordinates plot. This has a disadvantage in not making your response variable distinct, but could be useful, for example, in examining inter-correlations amongst your X variables.

I also want to commend you for examining your data sorted by date collected. Although data are always gathered over time, people don't always do this. Plotting a line graph is nice, but I would suggest you supplement that with graphs of autocorrelations and partial autocorrelations. In R, the functions for these are acf and pacf respectively.

I recognize that all of this doesn't quite answer your question in the sense of giving you a tool that will make all the plots for you automatically, but one implication is that you wouldn't actually have to make as many plots as you fear, e.g., a scatterplot matrix is just one line of code. In addition, in R, it should be possible to write a function / some reusable code for yourself that would partly automate some of this (e.g., I can imagine a function that takes in a list of variables and a date-ordering, sorts them, pops up a new window for each with line, acf, and pacf plots).

Solved – Visualizing high dimensional binary data

Even if this is binary, you can do a scaled Principal Component Analysis (PCA). By projecting the results on the 2D plane of the first Principal Components you get an idea of the clustering of your data.

In R:

# data is your data.frame/matrix of data
pca <- prcomp(data, scale.=TRUE)
# Screeplot to see how much variance is in the 2D plane
plot(pca)
# Projections
plot(data %*% pca$rotation[,1:2])

Best Answer

Related Solutions

Exploratory Data Analysis – Quick Glance at a Dataset for Insights

Solved – Visualizing high dimensional binary data

Related Question