Distributions – How to Approximate the Distribution of a Linear Combination of Beta-Distributed Independent Random Variables

approximationbeta distributioncentral limit theoremdistributionsmathematical-statistics

This question is related with these other two questions in Cross Validated, which has been already answered:

In short, my question is this: Should I use specific results such as those collected by Gupta and Nadarajah (2004) (see also the answer by @kjetil-b-halvorsen to a previous question) to approximate the distribution of the linear combination of $n=20$ independent beta-distributed random variables, or would the CLT be accurate enough in this case? The context: statistical quality control on the production of a standard industrial setting (not NASA, I mean).


This is my concrete situation:

I have a sequence $X_1, X_2, X_3, \dots$ of independent random variables that can be assumed to follow a beta distribution, each of them with their respective distribution parameters, not necessarily equal — that is:

$$
X_i \sim \mathrm{Beta}(a_i,b_i) \text{,} \quad \forall\; i \text{.}
$$

Actually, all $X_i$'s should have the same distribution. I mean, theoretically speaking, there is an underlying distribution $\mathrm{Beta}(a,b)$ which all the $X_i$'s should come from, but the process is not under statistical control.

I am interested in approximately determining the distribution of the average of $n$ of those $X_i$'s. Without loss of generality, I would like to approximate the distribution of

$$
Y = \frac{1}{n}\sum_{i=1}^n{X_i} \text{.}
$$

An approach based on concrete data is possible (I mean, calculating concrete values for $Y$ from concrete values for the sequence of $X_i$'s and try to fit a distribution) and will be done. But I am also interested in connecting the distribution of $Y$ with the distribution of the $X_i$'s in a more theoretical way, so that we be able to deduce things about $Y$ basing on what happens to the $X_i$'s.

Using the Lindeberg-Feller CLT (see https://stats.stackexchange.com/a/156464/44075), I could state —if I am not wrong— that $Y$ is approximately distributed as a normal variable with mean $\mu_Y$ and standard deviation $\sigma_Y$, where $\mu_Y$ can be estimated as the mean of a sample of $X_i$'s and $\sigma_Y$ can be estimated as the sample (quasi)standard deviation of the $X_i$'s divided by $\sqrt{n}$.

On the other hand, Johannesson and Giri (1995), who are cited by Gupta and Nadarajah (2004), provide two ways to approximate $Y$ using a beta distribution. The more complex of them says that $Y$ is approximately equal to $\rho Z/\gamma$, where $Z$ is a standard beta random variable with parameters $g$ and $h$, and where $\rho$, $\gamma$, $g$ and $h$ can be determined using explicit equations that can be translated to be estimated from a sample of $X_i$'s.

So, which of the approaches should I use? The normal approximation or the beta one?

As I said above, in my concrete case, the value of $n$ is $20$ or so.


EDIT:

I am interested in this matter because I was warned about the fact that the convergence rate of $Y$ to a normal distribution (when $n$ tends to infinity) is not stated by the Lindeberg-Feller CLT.

Best Answer

If the skewness of the beta components are all low, then the absolute third moments should also be low*, and the normal approximation should tend to come in quite quickly (see the Berry-Esseen theorem for non-i.i.d. variates).

* I don't mean this comment as a general one, just in respect of beta variates. For example, if the skewness $\gamma_1$ of a beta variate is small the kurtosis is bounded above and below by $1 +$ a multiple of $\gamma_1^2$ (where both multiples are small), and I believe the absolute third moment of a standardized variate should be smaller than the fourth moment. Those two things together suggest a small third moment implies a small absolute third moment.

However, what we're dealing with "closeness" of in the theorem is cdfs, but bounding the difference in cdfs doesn't necessarily make whatever other properties you want like that for a normal; it may make more sense to identify what properties you're after and investigate those.

On the other hand, if the skewness is high, we would not expect a very rapid approach to normality; indeed, simulation easily establishes that skewness can remain in the standardized mean. For example, here's a histogram for 10000 simulations of standardized means of 20 beta(100,1) variates:

enter image description here

Anyway, these points may help you figure out better when you might just decide to work with normal approximation rather than the more complicated formulas.

Related Question