Solved – The correct way to fit a normal distribution to truncated/trimmed data

estimationnormal distributionquantiles

I have data where I believe that most of it is normally distributed, but a few extreme outliers are drawn from a different distribution. To estimate the mean and variance of the normal part of my mostly normal data, I'd hoped to exclude the outliers by eliminating the top and bottom 10% quantiles from the data, and to use the MASS::fitdist function in R. Perhaps unsurprisingly to experts, this results in a biased (incorrect) estimate of the variance. The problem is easy to demonstrate in R. (Ignore for now the possibility that there are any non-normally distributed subpopulations in the data, and suppose the data is completely normally distributed.)

# generate random data
set.seed(0)
all_normal <- rnorm(1000000, mean=0, sd=10)

# eliminate the bottom 10% and top 10% of the values
trunc <- 0.1
lb <- quantile(all_normal, trunc)
ub <- quantile(all_normal, 1-trunc)
not_normal <- all_normal[all_normal > lb & all_normal < ub]

# fit a distribution to the truncated data
print(MASS::fitdistr(not_normal, 'normal'))

This results in

      mean            sd     
  -0.007818535    6.617770945 
 ( 0.007398893) ( 0.005231807)

The best-fit standard deviation is $6.618\pm0.005$, but the actual standard deviation is obviously 10 in my numeric example. My truncation based approach is not suitable.

Is there a generally recognized approach to fitting truncated data? Is there for example a theoretical correction factor, computable from my trunc value, that I can use to recover an unbiased estimate of the standard deviation from top- and bottom-quantile truncated-data? It appears that there are formulas for recovering an estimated standard deviation from trimmed data, but what if I want to estimate the mean and standard deviation simultaneously? Can I assume that the estimated mean and the estimated standard deviation do not co-vary?

Best Answer

In a 2014 paper in BMC Medical Research Methodology, Wan et al. give a formula for the estimation of the sample standard deviation from assumed-normal data when given only the bottom and top quartiles, i.e. the 25th percentile and 75th percentile. The formula, numbered 16 in their paper, is $\frac{x}{y}$:

$$S \approx \frac{q_3 - q_1}{2 \Phi^{-1}\left(\frac{0.75n - 0.125}{n+0.25}\right)}$$

Here, $q_3$ is the 75th percentile, $q_1$ is the 25th percentile, $n$ is the number of data points, and $\Phi$ is the inverse normal cumulative distribution function. The authors say that this approximation for the sample sd is valid at large $n$.

I was interested in the the case for arbitrary quantiles, not just quartiles. I didn't find a formula for those, but based on guesswork and extrapolation, I tried out:

$$S \approx \frac{q_{1-x} - q_x}{2 \Phi^{-1}\left(\frac{x n - x^2}{n+x} \right)}$$

Here, $x\in(0, \frac{1}{2})$ is the quantile used, e.g. 0.1 for my data truncated at the 10th and 90th percentiles, and the other symbols are as above.

This approximation was excellent in my particular case, but then I noticed that Wikipedia gives a close-form solution for the quantiles of normally distributed data:

$$ q_x = \mu + \sigma \sqrt{2} \DeclareMathOperator\erfinv{erf^{-1}}\erfinv{\left(2x - 1\right)}$$

It's easy to see that if given values for $q_x$ and $q_{1-x}$, then the $\sigma$ parameter can be calculated as:

$$\sigma = \frac{q_{1-x} - q_x}{\sqrt{2}\left[\erfinv(1-2x) - \erfinv(2x-1)\right]}$$

Since the error function is an odd function, that is the same as

$$\sigma = \frac{q_{1-x} - q_x}{2\sqrt{2}\erfinv(1-2x)}$$

This solution doesn't depend on any assumptions (other than the one we have already made about the normality of the data in between the quantiles).

However, since I am not a statistician, I'd like to note the possible deficiences in my answer:

  1. I don't fully understand where the approximation used by Wan et al. came from.

  2. Their formula is for the sample standard deviation, but the $\sigma$ parameter is the standard deviation of the population. I am probably eliding an important distinction between the two.

  3. For my application, I am estimating $\mu$ by maximum likelihood, and then using the formula above to estimate $\sigma$. That means I don't know how the errors in the two estimates are correlated. In fact I don't even have a good estimate of the error in my estimate of $\sigma$ above.

I'd appreciate more expert answers!

Related Question