Solved – Visualization of binned frequency distribution in R

I am trying to visually represent the distribution of last names in the US. Specifically, I am trying to show that the distribution is such that the most common names (say the top 50) are very common but that it drops off quickly after that. The conclusion I hope to support is that it's not that meaningful to differentiate between common and less common among names that are not in the top X because they are all a very small portion of the population.

I have the observed frequency of all last names appearing more than 100 times in the 2000 census.

## Data from the US Census, extracted and CSV re-hosted 
## http://www.census.gov/genealogy/www/data/2000surnames/names.zip
names <- read.csv("http://samswift.org/files/app_c.csv")

My intuition was to bin the ranked list by groups of 50. Most common names 1-50, 51-100, …

sum50 <-  tapply(names$count, (seq_along(names$count)-1) %/% 50, sum)

So we now have the sum population of people with a top 50 name, a second 50 name, etc.

I was imaging a plot like this
desired plot
where the x-axis is the ordered factor of bins (1-50, 51-100 ..) and the y-axis is the sum population in that bin. I think it's important that bar widths scale with the y variable too so that the area of the square conveys the mass of the population.

So, two part question really (although I think that's frowned upon)

How might I generate this plot in R with the provided data. I generally use ggplot2, but I am not wed to it. I tried using geom_bar and trying to set the width, but I failed to generate anything even a bit functional.
Do you have a better thought on how to visualize the assertion I'm making, or disagree with the assertion entirely?

Best Answer

This kind of plot could be generated with geom_rect.

Your data:

names <- read.csv("http://samswift.org/files/app_c.csv")
sum50 <- tapply(names$count, (seq_along(names$count)-1) %/% 50, sum)

First, we need additional variables:

The cumulative sum:

cum <- rev(cumsum(rev(sum50)))

Put all into a data frame. The variables start and stop indicate where the rectangles should begin and end, respectively:

data <- data.frame(sum = sum50,
                   names = paste(as.numeric(names(sum50)) * 50 + 1,
                                 as.numeric(names(sum50)) * 50 + 50, sep = "-"),
                   start = c(cum[-1], 0),
                   stop = cum, stringsAsFactors = FALSE)
data$names[nrow(data)] <- paste(as.numeric(names(sum50)[length(sum50)]) * 50 + 1,
                                as.numeric(names(sum50)[length(sum50)]) * 50 + 
                                                          nrow(names) %% 50, sep = "-")

The variable center is the center between start and stop position:

data$center <- (data$stop - data$start)/2 + data$start

For this example, I use the first five rows:

data <- data[1:5, ]

Plot:

library(ggplot2)

ggplot(data, aes(xmin = start, xmax = stop, ymin = 0, ymax = sum)) +
  geom_rect(fill = NA, colour = "black") +
  scale_x_reverse("bin", breaks = data$center, labels = data$names) +
  coord_equal() # because we want squares

enter image description here

This is the version based on the complete data set. You should consider using only a subset of x-axis labels.

Best Answer

Related Solutions

Related Question