Data Visualization – Removing Extraneous Points Near the Center of a QQ-Plot in R

data visualizationqq-plotr

I'm trying to plot a QQ-plot with two data sets of about 1.2 million points, in R (using qqplot, and feeding the data into ggplot2). The calculation is easy enough, but the resulting graph is painfully slow to load, because there's so many points. I've tried linear approximation to reduce the number of points to 10000 (this is what the qqplot function does anyway, if one of your data sets is bigger than the other), but then you lose a lot of the detail in the tails.

Most of the data points towards the centre are basically useless – they overlap so much that there's probably about 100 per pixel. Is there any simple way of removing data that is too close together, without loosing the more sparse data toward the tails?

Best Answer

Q-Q plots are incredibly autocorrelated except in the tails. In reviewing them, one focuses on the overall shape of the plot and on tail behavior. Ergo, you will do fine by coarsely subsampling in the centers of the distributions and including a sufficient amount of the tails.

Here is code illustrating how to sample across an entire dataset as well as how to take extreme values.

quant.subsample <- function(y, m=100, e=1) {
  # m: size of a systematic sample
  # e: number of extreme values at either end to use
  x <- sort(y)
  n <- length(x)
  quants <- (1 + sin(1:m / (m+1) * pi - pi/2))/2
  sort(c(x[1:e], quantile(x, probs=quants), x[(n+1-e):n]))
  # Returns m + 2*e sorted values from the EDF of y
}

To illustrate, this simulated dataset shows a structural difference between two datasets of approximately 1.2 million values as well as a very small amount of "contamination" in one of them. Also, to make this test stringent, an interval of values is excluded from one of the datasets altogether: the QQ plot needs to show a break for those values.

set.seed(17)
n.x <- 1.21 * 10^6
n.y <- 1.20 * 10^6
k <- floor(0.0001*n.x)
x <- c(rnorm(n.x-k), rnorm(k, mean=2, sd=2))
x <- x[x <= -3 | x >= -2.5]
y <- rbeta(n.y, 10,13)

We can subsample 0.1% of each dataset and include another 0.1% of their extremes, giving 2420 points to plot. Total elapsed time is less than 0.5 seconds:

m <- .001 * max(n.x, n.y)
e <- floor(0.0005 * max(n.x, n.y))

system.time(
  plot(quant.subsample(x, m, e), 
       quant.subsample(y, m, e), 
       pch=".", cex=4,
       xlab="x", ylab="y", main="QQ Plot")
  )

No information is lost whatsoever:

QQ plot

Related Solutions

Solved – Is Lorenz curve the same as QQ-plot

The Lorenz curve is just a cumulative distribution function for a random variable bounded between 0 and 1, e.g., a proportion. In economics, the Lorenz curve asks, "what fraction of income is earned by the lowest x% of earners?" Typically, it is compared to the uniform distribution over [0,1], a distribution that would arise under perfect equality in income. The Gini coefficient is the area under the perfect equality curve less the area under the Lorenz curve, normalized by the area under the perfect equality curve; note that the area under the perfect equality curve is equal to 0.5.

So, to be clear, while a QQ plot compares two distributions to one another, the Lorenz curve considers only one distribution that has a range over [0,1].

Solved – Why is the fit line of a QQ (normality) plot not y=x, reflecting the theoretical normal distribution

The y-axis is labelled "Quantiles of Input Samples", so having a fixed line at $y=x$ would be useless except for the case where the samples are actually generated from $N(0,1)$. In fact, the line is a function of the data, likely with intercept = $\bar{x}$ and slope = $s$.

Here is an illustration in R (note that the left frame has intercept $\approx 1$ and the right frame has slope $\approx 2$):

> x11 <- rnorm(100,1) # 100 draws from N(1,1)
> x02 <- rnorm(100,0,2) # 100 draws from N(0,2)
> par(mfrow=c(1,2))
> qqnorm(x11, main="N(1,1)")
> qqline(x11)
> abline(v=c(0,1), h=c(1,2), lty=2)
> qqnorm(x02, main="N(0,2)")
> qqline(x02)
> abline(v=c(0,1), h=c(0,2), lty=2)

Best Answer

Related Solutions

Solved – Is Lorenz curve the same as QQ-plot

Solved – Why is the fit line of a QQ (normality) plot not y=x, reflecting the theoretical normal distribution

Related Question