Solved – Visualizing high dimensional binary data

binary datadata visualization

What is a good way to visualize high dimensional (say n=10) binary data? I remember reading something about that a few years ago.

Say for instance, you want to plot / cluster pizzas based on their topping, e.g. ham, chicken, mushrooms etc.

Best Answer

Even if this is binary, you can do a scaled Principal Component Analysis (PCA). By projecting the results on the 2D plane of the first Principal Components you get an idea of the clustering of your data.

In R:

# data is your data.frame/matrix of data
pca <- prcomp(data, scale.=TRUE)
# Screeplot to see how much variance is in the 2D plane
plot(pca)
# Projections
plot(data %*% pca$rotation[,1:2])

Related Solutions

Solved – Visualizing 2-letter combinations

Here is a start: visualize these on a grid of first and second letters:

combi <- c("Ad", "am", "ar", "as", "bc", "bd", "bp", "br", "BR", "bs", 
"by", "c",  "C",  "cc", "cd", "ch", "ci", "CJ", "ck", "Cl", "cm", "cn", 
"cq", "cs", "Cs", "cv", "d",  "D",  "dc", "dd", "de", "df", "dg", "dn", 
"do", "ds", "dt", "e",  "E",  "el", "ES", "F",  "FF", "fn", "gc", "gl", 
"go", "H",  "Hi", "hm", "I",  "ic", "id", "ID", "if", "IJ", "Im", "In", 
"ip", "is", "J",  "lh", "ll", "lm", "lo", "Lo", "ls", "lu", "m",  "MH", 
"mn", "ms", "N",  "nc", "nd", "nn", "ns", "on", "Op", "P",  "pa", "pf", 
"pi", "Pi", "pm", "pp", "ps", "pt", "q",  "qf", "qq", "qr", "qt", "r",  
"Re", "rf", "rk", "rl", "rm", "rt", "s",  "sc", "sd", "SJ", "sn", "sp", 
"ss", "t",  "T",  "te", "tr", "ts", "tt", "tz", "ug", "UG", "UN", "V",  
"VA", "Vd", "vi", "Vo", "w",  "W",  "y")

df <- data.frame (first = factor (gsub ("^(.).", "\\1", combi), 
                                  levels = c (LETTERS, letters)),
                  second = factor (gsub ("^.", "", combi), 
                                  levels = c (LETTERS, letters)),
                  combi = combi))

library(ggplot2)
ggplot (data = df, aes (x = first, y = second)) + 
   geom_text (aes (label = combi), size = 3) + 
   ## geom_point () +
   geom_vline (x = 26.5, col = "grey") + 
   geom_hline (y = 26.5, col = "grey")

(was: two letter ) grid with letters

ggplot (data = df, aes (x = second)) + geom_histogram ()

second letter

ggplot (data = df, aes (x = first)) + geom_histogram ()

first letter

I gather:

of the one letter names,
- fortunately i, j, k, and l are available (so I can index up to 4d arrays)
- unfortunately t (time), c (concentration) are gone. So are m (mass), V (volume) and F (force). No radius r nor diameter d.
- I can have pressure (p), amount of substance (n), and length l, though.
- Maybe I'll have to change to greek names: ε is OK, but then shouldn't
```
π <- pi
```
  ?
I can have whatever lowerUPPER name I want.
In general, starting with an upper case letter is a safer bet than lower case.
don't start with c or d

Solved – Visualizing high dimensional data

You could give tSNE a try. It is pretty straightforward to use. It works with Octave, in addition to Matlab and Python. Take a look at the guide to get a first plot within a minute.

Best Answer

Related Solutions

Solved – Visualizing 2-letter combinations

Solved – Visualizing high dimensional data

Related Question