Solved – Visualizing 2-letter combinations

data visualizationr

The answers to this question on SO returned a set of approximately 125 one- to two-letter names:
https://stackoverflow.com/questions/6979630/what-1-2-letter-object-names-conflict-with-existing-r-objects

  [1] "Ad" "am" "ar" "as" "bc" "bd" "bp" "br" "BR" "bs" "by" "c"  "C" 
 [14] "cc" "cd" "ch" "ci" "CJ" "ck" "Cl" "cm" "cn" "cq" "cs" "Cs" "cv"
 [27] "d"  "D"  "dc" "dd" "de" "df" "dg" "dn" "do" "ds" "dt" "e"  "E" 
 [40] "el" "ES" "F"  "FF" "fn" "gc" "gl" "go" "H"  "Hi" "hm" "I"  "ic"
 [53] "id" "ID" "if" "IJ" "Im" "In" "ip" "is" "J"  "lh" "ll" "lm" "lo"
 [66] "Lo" "ls" "lu" "m"  "MH" "mn" "ms" "N"  "nc" "nd" "nn" "ns" "on"
 [79] "Op" "P"  "pa" "pf" "pi" "Pi" "pm" "pp" "ps" "pt" "q"  "qf" "qq"
 [92] "qr" "qt" "r"  "Re" "rf" "rk" "rl" "rm" "rt" "s"  "sc" "sd" "SJ"
[105] "sn" "sp" "ss" "t"  "T"  "te" "tr" "ts" "tt" "tz" "ug" "UG" "UN"
[118] "V"  "VA" "Vd" "vi" "Vo" "w"  "W"  "y"

And R import code:

nms <- c("Ad","am","ar","as","bc","bd","bp","br","BR","bs","by","c","C","cc","cd","ch","ci","CJ","ck","Cl","cm","cn","cq","cs","Cs","cv","d","D","dc","dd","de","df","dg","dn","do","ds","dt","e","E","el","ES","F","FF","fn","gc","gl","go","H","Hi","hm","I","ic","id","ID","if","IJ","Im","In","ip","is","J","lh","ll","lm","lo","Lo","ls","lu","m","MH","mn","ms","N","nc","nd","nn","ns","on","Op","P","pa","pf","pi","Pi","pm","pp","ps","pt","q","qf","qq","qr","qt","r","Re","rf","rk","rl","rm","rt","s","sc","sd","SJ","sn","sp","ss","t","T","te","tr","ts","tt","tz","ug","UG","UN","V","VA","Vd","vi","Vo","w","W","y")

Since the point of the question was to come up with a memorable list of object names to avoid, and most humans are not so good at making sense out of a solid block of text, I would like to visualize this.

Unfortunately I'm not exactly certain of the best way to do this. I had thought of something like a stem-and-leaf plot, only since there are no repeated values each "leaf" was placed in the appropriate column rather than being left justified. Or a wordcloud-style adaptation where letters are sized according to its prevalence.

How might this be most clearly and efficiently be visualized?

Visualizations which do either of the following fit in the spirit of this question:

Primary goal: Enhance the memorizability of the set of names by revealing patterns in the data
Alternate goal: Highlight interesting features of the set of names (e.g. which help visualize the distribution, most common letters, etc.)

Answers in R are preferred, but all interesting ideas are welcome.

Ignoring the single-letter names is allowed, since those are easier to just give as a separate list.

Best Answer

Here is a start: visualize these on a grid of first and second letters:

combi <- c("Ad", "am", "ar", "as", "bc", "bd", "bp", "br", "BR", "bs", 
"by", "c",  "C",  "cc", "cd", "ch", "ci", "CJ", "ck", "Cl", "cm", "cn", 
"cq", "cs", "Cs", "cv", "d",  "D",  "dc", "dd", "de", "df", "dg", "dn", 
"do", "ds", "dt", "e",  "E",  "el", "ES", "F",  "FF", "fn", "gc", "gl", 
"go", "H",  "Hi", "hm", "I",  "ic", "id", "ID", "if", "IJ", "Im", "In", 
"ip", "is", "J",  "lh", "ll", "lm", "lo", "Lo", "ls", "lu", "m",  "MH", 
"mn", "ms", "N",  "nc", "nd", "nn", "ns", "on", "Op", "P",  "pa", "pf", 
"pi", "Pi", "pm", "pp", "ps", "pt", "q",  "qf", "qq", "qr", "qt", "r",  
"Re", "rf", "rk", "rl", "rm", "rt", "s",  "sc", "sd", "SJ", "sn", "sp", 
"ss", "t",  "T",  "te", "tr", "ts", "tt", "tz", "ug", "UG", "UN", "V",  
"VA", "Vd", "vi", "Vo", "w",  "W",  "y")

df <- data.frame (first = factor (gsub ("^(.).", "\\1", combi), 
                                  levels = c (LETTERS, letters)),
                  second = factor (gsub ("^.", "", combi), 
                                  levels = c (LETTERS, letters)),
                  combi = combi))

library(ggplot2)
ggplot (data = df, aes (x = first, y = second)) + 
   geom_text (aes (label = combi), size = 3) + 
   ## geom_point () +
   geom_vline (x = 26.5, col = "grey") + 
   geom_hline (y = 26.5, col = "grey")

(was: two letter ) grid with letters

ggplot (data = df, aes (x = second)) + geom_histogram ()

second letter

ggplot (data = df, aes (x = first)) + geom_histogram ()

first letter

I gather:

of the one letter names,
- fortunately i, j, k, and l are available (so I can index up to 4d arrays)
- unfortunately t (time), c (concentration) are gone. So are m (mass), V (volume) and F (force). No radius r nor diameter d.
- I can have pressure (p), amount of substance (n), and length l, though.
- Maybe I'll have to change to greek names: ε is OK, but then shouldn't
```
π <- pi
```
  ?
I can have whatever lowerUPPER name I want.
In general, starting with an upper case letter is a safer bet than lower case.
don't start with c or d

Related Solutions

Solved – Visualization of binned frequency distribution in R

This kind of plot could be generated with geom_rect.

Your data:

names <- read.csv("http://samswift.org/files/app_c.csv")
sum50 <- tapply(names$count, (seq_along(names$count)-1) %/% 50, sum)

First, we need additional variables:

The cumulative sum:

cum <- rev(cumsum(rev(sum50)))

Put all into a data frame. The variables start and stop indicate where the rectangles should begin and end, respectively:

data <- data.frame(sum = sum50,
                   names = paste(as.numeric(names(sum50)) * 50 + 1,
                                 as.numeric(names(sum50)) * 50 + 50, sep = "-"),
                   start = c(cum[-1], 0),
                   stop = cum, stringsAsFactors = FALSE)
data$names[nrow(data)] <- paste(as.numeric(names(sum50)[length(sum50)]) * 50 + 1,
                                as.numeric(names(sum50)[length(sum50)]) * 50 + 
                                                          nrow(names) %% 50, sep = "-")

The variable center is the center between start and stop position:

data$center <- (data$stop - data$start)/2 + data$start

For this example, I use the first five rows:

data <- data[1:5, ]

Plot:

library(ggplot2)

ggplot(data, aes(xmin = start, xmax = stop, ymin = 0, ymax = sum)) +
  geom_rect(fill = NA, colour = "black") +
  scale_x_reverse("bin", breaks = data$center, labels = data$names) +
  coord_equal() # because we want squares

enter image description here

This is the version based on the complete data set. You should consider using only a subset of x-axis labels.

enter image description here

Solved – Understanding and Interpreting letter value boxplots

The key term is letter-value (box)plots and the key reference is now

Hofmann, Heike, Wickham, Hadley and Kafadar, Karen. 2017. Letter-value plots: Boxplots for large Data. Journal of Computational and Graphical Statistics 10.1080/10618600.2017.1305277 http://dx.doi.org/10.1080/10618600.2017.1305277

Earlier versions of this paper can easily be found on-line.

As I understand it the width of each box just indicates how a box is defined. The fattest box is between letter values that are (approximate) quartiles, the next fattest boxes stretch between (approximate) quartiles and the (approximate) octiles beyond in either tail, and so on. Positively, this is just an extension of the common box plot convention that each box indicates that it is the interval between quartiles and the width is otherwise just a conventional choice. (Only occasionally are boxes shown that indicate the number of values in each.)

A little more negatively, people have to learn that the width of the box is otherwise arbitrary. It's not, for example, a boxy version of a density plot.

But the interpretation is otherwise similar to that of box plots, e.g. the central half of a sample is within these limits; the central three-quarters within these limits; and so on. Are groups or variables similar or different in distribution?

For a survey of letter values with different emphasis, see

Cox, N. J. 2016. Speaking Stata: Letter values as selected quantiles Stata Journal 16(4): 1058-1071. http://www.stata-journal.com/article.html?article=st0465

I have to worry, on behalf of those who advocate this plot, that naive users are all too likely to interpret it as a blocky version of a violin plot, just as histograms are discretised density plots. The ideal of showing more detail than a box plot is admirable, and the practice usually helps, but there are many other ways to do that. Naturally, advice to read how it is defined and constructed should always be followed.

Best Answer

Related Solutions

Solved – Visualization of binned frequency distribution in R

Solved – Understanding and Interpreting letter value boxplots

Related Question