Solved – Formal definition of the qqline used in a Q-Q plot

distributionsfittingheavy-tailedr

I'm doing some distribution fitting work and I'm looking at Q-Q plots and how they can be used visually to interpret goodness of fit.

My data is heavy-tailed so I am looking at Weibull, log-normal, Pareto and log-logistic distributions initially.

For a Weibull distribution, I understand how the points on the Q-Q plot are constructed (using the quantiles of observed data vs. the quantiles of an estimated Weibull distribution). The piece I am not clear on is how the line used in Q-Q plots is calculated/constructed.

The R documentation for the qqplot() function provides the following description:

qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

Another post on Cross Validated seems to indicate that the line is essentially a line constructed from the parameters of the theoretical (estimated) distribution. Is this a true statement and correct interpretation?

If a link to a formal definition could be provided I'd very much appreciate it.

Best Answer

Sort of "both" - the line depends both on the observed quantiles (which define the y-axis of the QQ plot) and the expected/theoretical/reference quantiles (which the define the x-axis). The documentation (which you quote) should always be taken as the canonical reference:

‘qqline’ adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the ‘probs’ quantiles, by default the first and third quartiles.

If in doubt, USTL ("Use the Source, Luke") , which can be found here: here's a slightly abridged and commented version

 ## quantiles (.25 and 0.75 by default) of data
 y <- quantile(y, probs, names=FALSE, type=qtype, na.rm = TRUE)
 ## quantiles of reference/theoretical distribution
 x <- distribution(probs)
 ## ...
 slope <- diff(y)/diff(x)  ## observed slope between quantiles
 int <- y[1L]-slope*x[1L]  ## intercept
 abline(int, slope, ...)   ## draw the line

For what it's worth, I believe that this approach (line connecting central quantiles) is used because it fulfills the following criteria for exploratory/diagnostic approaches:

quick (e.g. no need to run a linear regression, just find the quantiles and draw a straight line)
robust (it only depends on the behavior of the central part of the distribution, won't be thrown off by weird tails)

Related Solutions

r – Extensions to Default Diagnostic Plots for lm in R

Package car has quite a lot of useful functions for diagnostic plots of linear and generalized linear models. Compared to vanilla R plots, they are often enhanced with additional information. I recommend you try example("<function>") on the following functions to see what the plots look like. All plots are described in detail in chapter 6 of Fox & Weisberg. 2011. An R Companion to Applied Regression. 2nd ed.

residualPlots() plots Pearson residuals against each predictor (scatterplots for numeric variables including a Lowess fit, boxplots for factors)
marginalModelPlots() displays scatterplots of the response variable against each numeric predictor, inluding a Lowess fit
avPlots() displays partial-regression plots: for each predictor, this is a scatterplot of a) the residuals from the regression of the response variable on all other predictors against b) the residuals from the regression of the predictor against all other predictors
qqPlot() for a quantile-quantile plot which includes a confidence envelope
influenceIndexPlot() displays each value for Cook's distance, hat-value, p-value for outlier test, and studentized residual in a spike-plot against the observation index
influencePlot() gives a bubble-plot of studentized residuals against hat-values, with the size of the bubble corresponding to Cook's distance, also see dfbetaPlots() and leveragePlots()
boxCox() displays a profile of the log-likelihood for the transformation parameter $\lambda$ in a Box-Cox power-transform
crPlots() is for component + residual plots, a variant of which are CERES plots (Combining conditional Expectations and RESiduals), provided by ceresPlots()
spreadLevelPlot() is for assessing non-constant error variance and displays absolute studentized residuals against fitted values
scatterplot() provides much-enhanced scatterplots inluding boxplots along the axes, confidence ellipses for the bivariate distribution, and prediction lines with confidence bands
scatter3d() is based on package rgl and displays interactive 3D-scatterplots including wire-mesh confidence ellipsoids and prediction planes, make sure to run example("scatter3d")

In addition, have a look at bplot() from package rms for another approach to illustrating the common distribution of three variables.

R Programming – Use of the Line Produced by qqline() in R

As you can see on the picture, enter image description here

obtained by

> y <- rnorm(2000)*4-4
> qqnorm(y); qqline(y, col = 2,lwd=2,lty=2)

the diagonal would not make sense because the first axis is scaled in terms of the theoretical quantiles of a $\mathcal{N}(0,1)$ distribution. I think using the first and third quartiles to set the line gives a robust approach for estimating the parameters of the normal distribution, when compared with using the empirical mean and variance, say. Departures from the line (except in the tails) are indicative of a lack of normality.

Best Answer

Related Solutions

r – Extensions to Default Diagnostic Plots for lm in R

R Programming – Use of the Line Produced by qqline() in R

Related Question