Statistical Plots – How to Read the X-Axis of a Q-Q Plot?

normal distributionqq-plot

I am new to qqplot and trying to figure out the labels on the x-axis of this plot:

enter image description here

The left plot is the Verbal SAT score of 105 college students majoring in CS. The right plot is the GPA score. Source of the plot

I understand that the qqplot means "quantile vs quantile", the y-axis is each student's test score, but what is the x-axis and why it is from -2 to 2?


Update: For anybody whose looking for a more visual explanation of QQ-plot, check out this video on youtube: https://www.youtube.com/watch?v=X9_ISJ0YpGw

After reading and watching dozens of articles and videos, let me explain what is the x-axis and the y-axis in layman's term.

The y-axis is the values you recorded. Like the verbal SAT score in my example. Because there are 105 of verbal SAT sores, there should be 105 values on the x-axis as well.

There are three steps involved in order to get these 105 values:

Step 1

Sort the 105 values in ascending order and give each one of them an index number starting from 1, 2, 3 …. all the way up to 105. Let's use i to denote each index number.

Step 2

Use this formula to find a list of percentages (I will explain what are they later, just keep on reading):

$\left(\dfrac{i-0.5}{n}\right)$

So you will get 105 values that looks like this:

$\left(\dfrac{1-0.5}{105}\right)$, $\left(\dfrac{2-0.5}{105}\right)$, $\left(\dfrac{3-0.5}{105}\right)$$\left(\dfrac{105-0.5}{105}\right)$

Step 3

Find the z-score of the values we just calculated in step 2. These z-scores are the x-axis of the qq-plot

Now, plot the verbal SAT scores as the y-axis and the z-scores as the x-axis, you will get a qq-plot.


Ok, now it is the time to explain the meaning of step 2 & 3. Let's start with the definition of cumulative distribution function.

Let's say we have a test and the maximum score a participant can get is 100 points. After taking the test, 50% of the participants scored 60 points or less, 40% of the participants scored 50 points or less, 30% of the participants scored 40 points or less, and so on.

The cumulative distribution function (aka. CDF) takes one input, a test score, and it outputs a percentage which tells you the percentage of people who scored less than or equal to that test score. With the example case above, If I put 60 points into the CDF function, I should get 50% back. If I put 50 points I get 40%.

The inverse of the cumulative distribution function takes one input, a percentage and it outputs the corresponding test score. If I put in 50%, it should give me 60 points. It does nothing fancy other than reverse the function of CDF.

Step 2 basically calculated a list of percentages and when we feed that list of percentages into Step 3 and get back a list of z-scores. So Step 2 + Step 3 = calculating the inverse of the cumulative distribution function.

The list of percentages calculated in Step 2 looks like this: 0.005, 0.014, … 0.995. The list of z-scores in Step 3 looks like this: -2.575829304, -2.197286377, … 2.575829304. It means for participants who scored less than or equal to 0.5% (0.005 x 100) of the total participants, the corresponding z-score is -2.57.

So what does the qq-plot tell us in our example: If the verbal SAT score indeed follows a standard normal distribution (mean=0, variance=1), then the outline of the histogram of the test scores should resemble the bell curve of the standard normal distribution.

But it is really hard to judge if a histogram matches the shape of a bell curve, a straight line is probably a better choice for the naked eyes. Let's match the observed values against the z-scores.

Best Answer

This seems to be a qqplot of the data compared with a standard normal distribution, so I would have thought the $x$ values should the typical values of the population quantiles of a standard normal distribution

So with $105$ observations I would have thought the extreme left $x$ value should be not far away from $\Phi^{-1}\left(\dfrac{0.5}{105}\right) \approx -2.59$ and the one next to it near $\Phi^{-1}\left(\dfrac{1.5}{105}\right) \approx -2.19$, with the extreme right values being the corresponding $\Phi^{-1}\left(\dfrac{104.5}{105}\right) \approx +2.59$ and $\Phi^{-1}\left(\dfrac{103.5}{105}\right) \approx +2.19$. Visually, this seems to be close to what you have in the charts

Related Question