Suppose we have the CDF for the Pareto Distribution given by:
$$ P(X \leq x) = 1-\left(\frac{x_m}{x}\right)^\alpha \;\;\;\;\;\;\;\;\;\; x \geq x_m$$
What is the intuitive way to find the alpha for which the 80/20 rule holds?
distributionspareto-distribution
Suppose we have the CDF for the Pareto Distribution given by:
$$ P(X \leq x) = 1-\left(\frac{x_m}{x}\right)^\alpha \;\;\;\;\;\;\;\;\;\; x \geq x_m$$
What is the intuitive way to find the alpha for which the 80/20 rule holds?
By using ecdf
and density
, you're not actually doing the Pareto calculations, but instead using estimates based on a sample that are, by their non-parametric nature, not guaranteed (read: not going to) have the desired property.
Try the following:
x <- seq(0.1,10,by=0.1)
fx <- dpareto(x, 1.5, 0.05)
Fx <- ppareto(x, 1.5, 0.05)
plot((1-Fx)/fx ~ x)
You'll get the nice straight line out:
First, it's well known that $\sum_{i=1}^n \log x_i$ is complete & minimal sufficient for the shape parameter $\alpha$ of an untruncated Pareto distribution—& truncation doesn't affect the completeness or sufficiency of a statistic (Smith, 1957). Consequently (1) you ought to be basing an estimator on the sum of logged observations (or the sample geometric mean, or an equivalent), & (2) you'll be lucky to find an estimator based on the sum of observations (or the sample arithmetic mean, or an equivalent), either derived in papers or textbooks, or implemented in software libraries.
If you do have the full sample, or $\sum_{i=1}^n \log x_i$, Aban et al. (2006) derive the score function for $\alpha$, which you can find the root of to give its maximum-likelihood estimate (for the cases of both known & unknown bounds†), & show that its distribution is asymptotically Gaussian (you can get standard error estimates from the gradient at the root). (They also cite a paper deriving the uniformly minimum-variance unbiased estimator.)
If for some reason all you have to work with is $\sum_{i=1}^n x_i$, then you could try a method-of-moments approach to estimating $\alpha$. Note the arithmetic mean exists for all $\alpha>0$, unlike the case of an untruncated Pareto distribution—see Do all bounded probability distributions have a definite mean?. Given a Pareto distribution with density
$$ f(x)= \frac{\alpha \beta^\alpha}{x^{\alpha+1}} $$
its expectation when truncated above at $\tau$ is
$$ \begin{align} \operatorname{E} X | X < \tau &= \frac{\int^\tau x f(x)\ \mathrm{d} x}{\int^\tau f(x) \ \operatorname{d} x} \\ &= \frac{\int_\beta^\tau x^{-\alpha}\ \mathrm{d}x} {\int_\beta^\tau x^{-(\alpha+1)}\ \mathrm{d}x}\\ &= \frac{\alpha}{1-\alpha}\cdot\frac{[x^{1-\alpha}]^\tau_\beta}{[x^\alpha]^\tau_\beta}\\ &= \frac{\alpha}{1-\alpha}\cdot\frac{\tau^{1-\alpha} - \beta^{1-\alpha}}{\tau^{-\alpha}-\beta^{-\alpha}}\\ &= \frac{\alpha\beta}{\alpha-1}\cdot\frac{1 - \left(\frac{\beta}{\tau}\right)^{\alpha-1}}{1-\left(\frac{\beta}{\tau}\right)^{\alpha}}\\ \end{align} $$
So plug in the sample mean $\bar x$ for $\operatorname{E} X$, & $\hat\alpha$ for $\alpha$, & find the root for $\hat\alpha$ numerically.‡ (You could bootstrap to get confidence interval/standard error/bias estimates.)
# function to generate Pareto variates
rpareto <- function(n, alpha, beta) beta/(1-runif(n))^(1/alpha)
# function to calculate expectation
EX <- Vectorize(function(alpha, beta, tau, epsilon = 1e-7){
if (abs(alpha-1) > epsilon){
(alpha*beta/(alpha-1)) * (1 - (beta/tau)^(alpha-1)) / (1 - (beta/tau)^alpha)
}
else {
alpha*beta*log(tau/beta) / (1 - (beta/tau)^alpha)
}
})
# function that you'll find the root of
f <- function(alpha, x.bar, beta, tau, epsilon = 1e-7) {
x.bar - EX(alpha, beta, tau, epsilon = 1e-7)
}
alpha_0 <- 2 # true value of alpha
beta <- 3
tau <- 9
# make up some data
set.seed(1234)
x <- rpareto(200, alpha_0, beta)
x <- x[x < tau]
length(x)
plot(ecdf(x), verticals=TRUE, pch="")
# find sample mean
(x.bar <- mean(x))
# find roughly where alpha.hat lies
alpha <- seq(1, 3, by=0.1)
plot(alpha, EX(alpha, beta, tau), type="l", ylab="EX")
abline(h = x.bar, lty=2)
# calculate alpha.hat
(alpha.hat <- uniroot(f, x.bar=x.bar, beta = beta, tau = tau, lower = 2, upper = 3)$root)
† With regard to the question in your 2nd paragraph: as you know the bound parameters, the sample minimum & maximum don't enter into the estimation of $\alpha$. If you didn't know the bound parameters, the sample minimum & maximum are their M.L.E.s., & you can substitute them for the known bound parameters in the score equation for $\alpha$ (what changes is the standard errors you'd calculate for the M.L.E. of $\alpha$).
‡ Note $$\lim_{\alpha \rightarrow 1} \frac{1-\left(\frac{\beta}{\tau}\right)^{\alpha-1}}{\alpha-1} = \log \frac{\tau}{\beta}$$
for when $\alpha \approx 1$ in numerical calculations.
Smith (1957), "A Note on Truncation and Sufficient Statistics", Ann. Math. Statist., 28, 1
Best Answer
The basic result is: $\alpha=\log_4 5=1.160964...$
The calculation comes from the Lorenz curve; specifically you're asking for the $\alpha$ for which $L(0.8)=0.2$.
$L$ is defined as
$$L(F)=\frac{\int_{x_\mathrm{m}}^{x(F)}xf(x)\,dx}{\int_{x_\mathrm{m}}^\infty xf(x)\,dx} =\frac{\int_0^F x(F')\,dF'}{\int_0^1 x(F')\,dF'}$$
where $x(F)$ is the inverse of the cdf. The denominator is the mean of the distribution.
For the Pareto distribution, the Lorenz curve is $L(F) = 1-(1-F)^{1-\frac{1}{\alpha}}$, from which we obtain the equation we need to solve:
$$0.2 = 1-(1-0.8)^{1-\frac{1}{\alpha}}.\,$$
Hence
$$1-\frac{1}{\alpha} = \log(0.8)/\log(0.2)$$
$$\alpha=\frac{1}{1-\log(0.8)/\log(0.2)}=1.160964\ldots$$
This is mentioned on the Wikipedia page for the Pareto principle, and on the page on the Pareto distribution
The details are given here.