Quantiles – How to Generate QQ Plot for Sets of Different Sizes

qq-plotquantiles

I'm coming from this question Similarity between two sets of random values (which still stands BTW) where @whuber suggested in the comments that I use a Q-Q plot to assess the similarity of two discrete distributions.

To recapitulate, I have two sets of random floats between [0,1] of different sizes:

$$A = \{0.3637852, 0.2330702, 0.1683102, 0.2127219, 0.0152532, …, N_A\}$$
$$B = \{0.4541056, 0.7521812, 0.0266602, 0.5099002, 0.3468181, …, N_B\}$$

where $N_B > N_A$.

and I need to generate the Q-Q plot for thess sets. For what I've read, if the sets where of equal sizes ($N_B = N_A$) I would simply sort both sets from minor to major values and then plot one against the other in a 1:1 correspondence.

My sets are different in sizes and the sources I've found [1], [2] both say that I need to sort both sets and then interpolate the larger set ($B$ in my case). This is counter-intuitive to me since I would expect to have to generate more points from the smaller $A$ set (through interpolation) so as to be able to plot this expanded set $A'$ against the $B$ set.

So obviously I'm not understanding correctly the process. What are the steps I should follow to generate a Q-Q plot for two distributions like the ones I presented above?

Best Answer

Generate some data:

 A <- rgamma(50,4,.1)
 B <- rnorm(30,40,20)

And then:

 plot(sort(B),sort(quantile(A,probs=ppoints(B))))

qq plot with interpolated A

Here I interpolated the larger set (that is, estimated 30 quantiles from 50 points); you can do the same to "interpolate" the smaller set but I don't think that helps at all, since the extra 'information' is really just a function of your quantile interpolation function.

That looks like this:

enter image description here

Here they are both plotted together (with the second plot above now having X and Y swapped so they both have the same variable on the same axis); the greater number of points has been plotted with the smaller, red circles. I think there's not really any additional information of any real value in the plot with more points, but you are free to disagree:

enter image description here

So in short, look at ?quantile and ?ppoints to see what's going on.

edit: sorry, I just noticed you have $N_B>N_A$. Mine is the other way around. I assume you can interchange the roles of A and B.

By the way, you can easily avoid sorting the sets, it's just that ppoints doesn't return things in the same order as its argument and I was being lazy.

If you define alpha=.5 then quantile(A,probs=(rank(B)-alpha)/(length(B)+1-2*alpha)) should reproduce the above plot without calling sort on either argument to plot.

(personally, I prefer values nearer alpha=.375 but not so much that I would bother fiddling with the ppoints default in most cases)

That should be enough detail that you can implement it in Python, I think.

Related Question