Data Visualization – Removing Extraneous Points Near the Center of a QQ-Plot in R

data visualizationqq-plotr

I'm trying to plot a QQ-plot with two data sets of about 1.2 million points, in R (using qqplot, and feeding the data into ggplot2). The calculation is easy enough, but the resulting graph is painfully slow to load, because there's so many points. I've tried linear approximation to reduce the number of points to 10000 (this is what the qqplot function does anyway, if one of your data sets is bigger than the other), but then you lose a lot of the detail in the tails.

Most of the data points towards the centre are basically useless – they overlap so much that there's probably about 100 per pixel. Is there any simple way of removing data that is too close together, without loosing the more sparse data toward the tails?

Best Answer

Q-Q plots are incredibly autocorrelated except in the tails. In reviewing them, one focuses on the overall shape of the plot and on tail behavior. Ergo, you will do fine by coarsely subsampling in the centers of the distributions and including a sufficient amount of the tails.

Here is code illustrating how to sample across an entire dataset as well as how to take extreme values.

quant.subsample <- function(y, m=100, e=1) {
  # m: size of a systematic sample
  # e: number of extreme values at either end to use
  x <- sort(y)
  n <- length(x)
  quants <- (1 + sin(1:m / (m+1) * pi - pi/2))/2
  sort(c(x[1:e], quantile(x, probs=quants), x[(n+1-e):n]))
  # Returns m + 2*e sorted values from the EDF of y
}

To illustrate, this simulated dataset shows a structural difference between two datasets of approximately 1.2 million values as well as a very small amount of "contamination" in one of them. Also, to make this test stringent, an interval of values is excluded from one of the datasets altogether: the QQ plot needs to show a break for those values.

set.seed(17)
n.x <- 1.21 * 10^6
n.y <- 1.20 * 10^6
k <- floor(0.0001*n.x)
x <- c(rnorm(n.x-k), rnorm(k, mean=2, sd=2))
x <- x[x <= -3 | x >= -2.5]
y <- rbeta(n.y, 10,13)

We can subsample 0.1% of each dataset and include another 0.1% of their extremes, giving 2420 points to plot. Total elapsed time is less than 0.5 seconds:

m <- .001 * max(n.x, n.y)
e <- floor(0.0005 * max(n.x, n.y))

system.time(
  plot(quant.subsample(x, m, e), 
       quant.subsample(y, m, e), 
       pch=".", cex=4,
       xlab="x", ylab="y", main="QQ Plot")
  )

No information is lost whatsoever:

QQ plot

Related Question