Quantiles – How to Generate QQ Plot for Sets of Different Sizes

qq-plotquantiles

I'm coming from this question Similarity between two sets of random values (which still stands BTW) where @whuber suggested in the comments that I use a Q-Q plot to assess the similarity of two discrete distributions.

To recapitulate, I have two sets of random floats between [0,1] of different sizes:

$$A = \{0.3637852, 0.2330702, 0.1683102, 0.2127219, 0.0152532, …, N_A\}$$
$$B = \{0.4541056, 0.7521812, 0.0266602, 0.5099002, 0.3468181, …, N_B\}$$

where $N_B > N_A$.

and I need to generate the Q-Q plot for thess sets. For what I've read, if the sets where of equal sizes ($N_B = N_A$) I would simply sort both sets from minor to major values and then plot one against the other in a 1:1 correspondence.

My sets are different in sizes and the sources I've found [1], [2] both say that I need to sort both sets and then interpolate the larger set ($B$ in my case). This is counter-intuitive to me since I would expect to have to generate more points from the smaller $A$ set (through interpolation) so as to be able to plot this expanded set $A'$ against the $B$ set.

So obviously I'm not understanding correctly the process. What are the steps I should follow to generate a Q-Q plot for two distributions like the ones I presented above?

Best Answer

Generate some data:

 A <- rgamma(50,4,.1)
 B <- rnorm(30,40,20)

And then:

 plot(sort(B),sort(quantile(A,probs=ppoints(B))))

qq plot with interpolated A

Here I interpolated the larger set (that is, estimated 30 quantiles from 50 points); you can do the same to "interpolate" the smaller set but I don't think that helps at all, since the extra 'information' is really just a function of your quantile interpolation function.

That looks like this:

enter image description here

Here they are both plotted together (with the second plot above now having X and Y swapped so they both have the same variable on the same axis); the greater number of points has been plotted with the smaller, red circles. I think there's not really any additional information of any real value in the plot with more points, but you are free to disagree:

enter image description here

So in short, look at ?quantile and ?ppoints to see what's going on.

edit: sorry, I just noticed you have $N_B>N_A$. Mine is the other way around. I assume you can interchange the roles of A and B.

By the way, you can easily avoid sorting the sets, it's just that ppoints doesn't return things in the same order as its argument and I was being lazy.

If you define alpha=.5 then quantile(A,probs=(rank(B)-alpha)/(length(B)+1-2*alpha)) should reproduce the above plot without calling sort on either argument to plot.

(personally, I prefer values nearer alpha=.375 but not so much that I would bother fiddling with the ppoints default in most cases)

That should be enough detail that you can implement it in Python, I think.

Related Solutions

Solved – Q-Q plot and sample size

I think there is less here than meets the eye. You need to recognize that the appearance of these plots will bounce around with different data. I modified your code with:

set.seed(2501)
par(mfrow=c(3,3), pty="s")

And then ran the rest of your code three times. Here is the resulting plot:

enter image description here

Sometimes the distinction between the left and center plots is clear and sometimes it isn't. That's the way it goes. Data are information. More data give you more information (all else being equal), and it is easier to see / figure out what you want to know.

One thing that may help you is to explore the qqPlot function in the car package, which will plot a 95% confidence band around the plot to help you see how much a dataset might vary from the ideal form to help you judge the deviations that you see in your observed data. Here it is with the last iteration of y:

enter image description here

Given the amount that 100 data can vary from the ideal, you just don't have enough information to reject the possibility of normality for these data (even though they were drawn from a $t$-distribution with 3 degrees of freedom).

Normal Distribution – Why is the QQ Plot for Normal Distribution a Straight Line?

Why is the QQ Plot for Normal Distribution a Straight Line?

The values on the y-axis are the sorted data values (the order statistics).

The values on the x-axis are what you'd expect sorted data (at the same sample size) from a standard normal distribution to give. That is, the smallest data value is paired with what you expect the smallest data value from a standard normal distribution of the same sample size to be, the second smallest data value is paired with the expected second-smallest data value from a standard normal distribution, and so on up to the largest value.

Since any normal distribution is a scaled and shifted standard normal distribution, and scaling and shifting just change the axes, not the appearance of the plot, samples drawn from any normal distribution should yield a plot where the values are close to a straight line (if the values are really from a normal distribution, the plot can wiggle away from a straight line but it won't deviate in a consistent fashion from a straight line).

Also, according to the accepted answer here: Percentile vs quantile vs quartile quantiles are in the range [0,1], but QQ plots shows quantiles that are clearly outside the [0, 1] range.

The answers there are worded in a way that it might mislead you. The quantile function takes values between 0 and 1 as its argument, and produces values on the range taken by the original random variable or sample (depending on whether it's population or sample quantiles under discussion).

So with a standard normal distribution (which can take values between $-\infty$ and $\infty$), $q(\frac12)$ is the median ($0$), $q(0.025)$ is the 2.5 percentile ($-1.96$) and so on ... $q=F^{-1}(p)$ for $0< p< 1$ (equality can occur for distributions that are not on an infinite range). For a sample quantile, there are various definitions possible (the package R offers nine different ones) but they all attempt to choose values so as to give approximately the right proportion of data below them.

Best Answer

Related Solutions

Solved – Q-Q plot and sample size

Normal Distribution – Why is the QQ Plot for Normal Distribution a Straight Line?

Related Question