Solved – The correct way to fit a normal distribution to truncated/trimmed data

estimationnormal distributionquantiles

I have data where I believe that most of it is normally distributed, but a few extreme outliers are drawn from a different distribution. To estimate the mean and variance of the normal part of my mostly normal data, I'd hoped to exclude the outliers by eliminating the top and bottom 10% quantiles from the data, and to use the MASS::fitdist function in R. Perhaps unsurprisingly to experts, this results in a biased (incorrect) estimate of the variance. The problem is easy to demonstrate in R. (Ignore for now the possibility that there are any non-normally distributed subpopulations in the data, and suppose the data is completely normally distributed.)

# generate random data
set.seed(0)
all_normal <- rnorm(1000000, mean=0, sd=10)

# eliminate the bottom 10% and top 10% of the values
trunc <- 0.1
lb <- quantile(all_normal, trunc)
ub <- quantile(all_normal, 1-trunc)
not_normal <- all_normal[all_normal > lb & all_normal < ub]

# fit a distribution to the truncated data
print(MASS::fitdistr(not_normal, 'normal'))

This results in

      mean            sd     
  -0.007818535    6.617770945 
 ( 0.007398893) ( 0.005231807)

The best-fit standard deviation is $6.618\pm0.005$, but the actual standard deviation is obviously 10 in my numeric example. My truncation based approach is not suitable.

Is there a generally recognized approach to fitting truncated data? Is there for example a theoretical correction factor, computable from my trunc value, that I can use to recover an unbiased estimate of the standard deviation from top- and bottom-quantile truncated-data? It appears that there are formulas for recovering an estimated standard deviation from trimmed data, but what if I want to estimate the mean and standard deviation simultaneously? Can I assume that the estimated mean and the estimated standard deviation do not co-vary?

Best Answer

In a 2014 paper in BMC Medical Research Methodology, Wan et al. give a formula for the estimation of the sample standard deviation from assumed-normal data when given only the bottom and top quartiles, i.e. the 25th percentile and 75th percentile. The formula, numbered 16 in their paper, is $\frac{x}{y}$:

$$S \approx \frac{q_3 - q_1}{2 \Phi^{-1}\left(\frac{0.75n - 0.125}{n+0.25}\right)}$$

Here, $q_3$ is the 75th percentile, $q_1$ is the 25th percentile, $n$ is the number of data points, and $\Phi$ is the inverse normal cumulative distribution function. The authors say that this approximation for the sample sd is valid at large $n$.

I was interested in the the case for arbitrary quantiles, not just quartiles. I didn't find a formula for those, but based on guesswork and extrapolation, I tried out:

$$S \approx \frac{q_{1-x} - q_x}{2 \Phi^{-1}\left(\frac{x n - x^2}{n+x} \right)}$$

Here, $x\in(0, \frac{1}{2})$ is the quantile used, e.g. 0.1 for my data truncated at the 10th and 90th percentiles, and the other symbols are as above.

This approximation was excellent in my particular case, but then I noticed that Wikipedia gives a close-form solution for the quantiles of normally distributed data:

$$ q_x = \mu + \sigma \sqrt{2} \DeclareMathOperator\erfinv{erf^{-1}}\erfinv{\left(2x - 1\right)}$$

It's easy to see that if given values for $q_x$ and $q_{1-x}$, then the $\sigma$ parameter can be calculated as:

$$\sigma = \frac{q_{1-x} - q_x}{\sqrt{2}\left[\erfinv(1-2x) - \erfinv(2x-1)\right]}$$

Since the error function is an odd function, that is the same as

$$\sigma = \frac{q_{1-x} - q_x}{2\sqrt{2}\erfinv(1-2x)}$$

This solution doesn't depend on any assumptions (other than the one we have already made about the normality of the data in between the quantiles).

However, since I am not a statistician, I'd like to note the possible deficiences in my answer:

I don't fully understand where the approximation used by Wan et al. came from.
Their formula is for the sample standard deviation, but the $\sigma$ parameter is the standard deviation of the population. I am probably eliding an important distinction between the two.
For my application, I am estimating $\mu$ by maximum likelihood, and then using the formula above to estimate $\sigma$. That means I don't know how the errors in the two estimates are correlated. In fact I don't even have a good estimate of the error in my estimate of $\sigma$ above.

I'd appreciate more expert answers!

Related Solutions

Quantiles from Normal Distributions – Combining Quantiles from Multiple Normal Distributions

Unfortunately, the standard normal (from which all others can be determined, since the normal is a location-scale family) quantile function does not admit a closed form (i.e. a 'pretty formula'). The closest thing to a closed form is that the standard normal quantile function is the function, $w$, that satisfies the differential equation

$$ \frac{d^2 w}{d p^2} = w \left(\frac{d w}{d p}\right)^2 $$

and the initial conditions $w(1/2) = 0$ and $w'(1/2) = \sqrt{2 \pi}$. In most computing environments there is a function that numerically calculates the normal quantile function. In R, you would type

qnorm(p, mean=mu, sd=sigma)

to get the $p$'th quantile of the $N(\mu, \sigma^2)$ distribution.

Edit: With a modified understanding of the problem, the data is generated from a mixture of normals, so that the density of the observed data is:

$$ p(x) = \sum_{i} w_{i} p_{i}(x) $$

where $\sum_{i} w_{i} = 1$ and each $p_{i}(x)$ is some normal density with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. It follows that the CDF of the observed data is

$$ F(y) = \int_{-\infty}^{y} \sum_{i} w_{i} p_{i}(x) dx = \sum_{i} w_{i} \int_{-\infty}^{y} p_{i}(x) = \sum_{i} w_{i} F_{i}(y) $$

where $F_{i}(x)$ is the normal CDF with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. Integration and summation can be interchanged because these integrals are finite. This CDF is continuous and easy enough to calculate on a computer, so the inverse CDF, $F^{-1}$, also known as the quantile function, can be calculated by doing a line search. I default to this option because no simple formula for the quantile function of a mixture of normals, as a function of the quantiles of the constituent distributions, comes to mind.

The following R code numerically calculates $F^{-1}$ using bisection for the line search. The function F_inv() is the quantile function, you need to supply the vector containing each $w_{i}, \mu_{i}, \sigma_{i}$ and the quantile to be solved for, $p$.

# evaluate the function at the point x, where the components 
# of the mixture have weights w, means stored in u, and std deviations
# stored in s - all must have the same length.
F = function(x,w,u,s) sum( w*pnorm(x,mean=u,sd=s) )

# provide an initial bracket for the quantile. default is c(-1000,1000). 
F_inv = function(p,w,u,s,br=c(-1000,1000))
{
   G = function(x) F(x,w,u,s) - p
   return( uniroot(G,br)$root ) 
}

#test 
# data is 50% N(0,1), 25% N(2,1), 20% N(5,1), 5% N(10,1)
X = c(rnorm(5000), rnorm(2500,mean=2,sd=1),rnorm(2000,mean=5,sd=1),rnorm(500,mean=10,sd=1))
quantile(X,.95)
    95% 
7.69205 
F_inv(.95,c(.5,.25,.2,.05),c(0,2,5,10),c(1,1,1,1))
[1] 7.745526

# data is 20% N(-5,1), 45% N(5,1), 30% N(10,1), 5% N(15,1)
X = c(rnorm(5000,mean=-5,sd=1), rnorm(2500,mean=5,sd=1),
      rnorm(2000,mean=10,sd=1), rnorm(500, mean=15,sd=1))
quantile(X,.95)
     95% 
12.69563 
F_inv(.95,c(.2,.45,.3,.05),c(-5,5,10,15),c(1,1,1,1))
[1] 12.81730

Solved – Is the data distribution normal? (Tried Shapiro and Kolmogornov-Smirnov tests)

It's definitely not normal, and it's not just the large sample size. That qq-plot is really clear. The wikipedia page about kurtosis may be useful to you; it mentions the Pearson type VII family of distributions, with an image of densities quite similar to yours.

Best Answer

Related Solutions

Quantiles from Normal Distributions – Combining Quantiles from Multiple Normal Distributions

Solved – Is the data distribution normal? (Tried Shapiro and Kolmogornov-Smirnov tests)

Related Question