I am trying to visually represent the distribution of last names in the US. Specifically, I am trying to show that the distribution is such that the most common names (say the top 50) are very common but that it drops off quickly after that. The conclusion I hope to support is that it's not that meaningful to differentiate between common and less common among names that are not in the top X because they are all a very small portion of the population.
I have the observed frequency of all last names appearing more than 100 times in the 2000 census.
## Data from the US Census, extracted and CSV re-hosted
## http://www.census.gov/genealogy/www/data/2000surnames/names.zip
names <- read.csv("http://samswift.org/files/app_c.csv")
My intuition was to bin the ranked list by groups of 50. Most common names 1-50, 51-100, …
sum50 <- tapply(names$count, (seq_along(names$count)-1) %/% 50, sum)
So we now have the sum population of people with a top 50 name, a second 50 name, etc.
I was imaging a plot like this
where the x-axis is the ordered factor of bins (1-50, 51-100 ..) and the y-axis is the sum population in that bin. I think it's important that bar widths scale with the y variable too so that the area of the square conveys the mass of the population.
So, two part question really (although I think that's frowned upon)
-
How might I generate this plot in R with the provided data. I generally use ggplot2, but I am not wed to it. I tried using geom_bar and trying to set the width, but I failed to generate anything even a bit functional.
-
Do you have a better thought on how to visualize the assertion I'm making, or disagree with the assertion entirely?
Best Answer
This kind of plot could be generated with
geom_rect
.Your data:
First, we need additional variables:
The cumulative sum:
Put all into a data frame. The variables
start
andstop
indicate where the rectangles should begin and end, respectively:The variable
center
is the center between start and stop position:For this example, I use the first five rows:
Plot:
This is the version based on the complete data set. You should consider using only a subset of x-axis labels.