MATLAB: How are the fitted line and the y-axis (probabilities) levels within probplot computed

Statistics and Machine Learning Toolbox

Exactly how are the fitted line and the reference levels on the y axis computed for probplot?
I could not find that piece of information in the documentation.

Best Answer

This information has been added to the documentation as of MATLAB R2017b, as seen in the "algorithms" section of the "probplot" page.
For releases prior to MATLAB R2017b, here is a detailed explanation of how this is implemented within MATLAB:
1. Computation of the reference line:
Suppose random variable X has a lognormal distribution. Then by definition, log(X) has a normal distribution. As a result, there exist constants A and B > 0 such that log(Y) = A + B*log(X) has a normal distribution with mean mu = 0 and standard deviation sigma = 1. Equivalently, Y has a lognormal distribution with parameters mu = 0 and sigma = 1. Since B > 0,
Prob( X <= exp(t) ) = Prob( log(X) <= t ) = Prob( (log(Y) - A)/B <= t ) = Prob( log(Y) <= A + B*t ) = Prob( Y <= exp(A + B*t) ) = FY( exp(A + B*t) )
Where FY( . ) is the CDF of a lognormal distribution with parameters mu = 0 and sigma = 1.
Suppose we are given N observations of X, say x_1, x_2, ..., x_N. To estimate A and B, probplot uses the 25 and 75 percentiles of values x_i. Let q25 = 0.25 and q75 = 0.75. Let exp(t25) and exp(t75) be such that:
Prob( X <= exp(t25) ) = q25
Prob( X <= exp(t75) ) = q75
exp(t25) can be approximated by quantile([x_1,...,x_N],q25). Similar logic holds for approximating exp(t75). Therefore, we can approximate t25 and t75 like this:
t25 ~= log(quantile([x_1,...,x_N],q25)).
t75 ~= log(quantile([x_1,...,x_N],q75)).
Now,
q25 = Prob( X <= exp(t25) ) = FY( exp(A + B*t25) )
q75 = Prob( X <= exp(t75) ) = FY( exp(A + B*t75) )
Equivalently,
invFY(q25) = exp(A + B*t25)
invFY(q75) = exp(A + B*t75)
where invFY( . ) is the inverse CDF of a lognormal distribution with parameters mu = 0 and sigma = 1. Taking natural logs:
log(invFY(q25)) = A + B*t25
log(invFY(q75)) = A + B*t75
These two equations can now be used to solve for A and B. Notice that B > 0 as required.
2. Lines created in probplot and the y-axis labeling:
Suppose x_(1), x_(2),...,x_(N) are the sorted observations in increasing order. I assume no ties in the x_(i) to keep things simple. For very small delta, Prob( X <= x_(i) - delta ) ~= (i-1)/N and Prob( X <= x_(i) + delta ) ~= i/N. Hence, a sensible estimate of
Prob( X <= x_(i) ) ~= average of (i-1)/N and i/N = (i - 0.5)/N (the midpoint rule).
If x_i's do really come from a lognormal distribution then [using t = log(x_(i)) in the first equation]:
(i - 0.5)/N ~= Prob( X <= x_(i) ) = Prob( X <= exp(log(x_(i))) ) = FY( exp(A + B*log(x_(i))) )
or
log( invFY( (i - 0.5)/N ) ) ~= A + B*log(x_(i))
We would expect a plot of log(x_(i)) vs. log( invFY( (i - 0.5)/N ) ) to be a straight line. We plot two lines:
(a) A plot of log(x_(i)) vs. A + B*log(x_(i)) [the expected behavior if x_i's do really come from a lognormal distribution]
(b) A plot of log(x_(i)) vs. log( invFY( (i - 0.5)/N ) ) [the observed behavior]
Then we visually compare (a) and (b). We could plot (a) for a range of x values not necessarily restricted to x_(i). If you put a breakpoint in the probplot subfunction setprobticks, you can see that probplot changes the labels on the y-axis to be on the probability scale. The datatips are converted into probabilities in the probplotDatatipCallback listener.
If y is a value on the y-axis, then for some p, y = log( invFY( p ) ). Inverting FY(exp(y)) = p. The tick label for y is the string p.