Solved – Mean value of truncated normal distribution

I have a bunch of data where each observation represents an error $\in [0,1]$ (computed error between a variable and it's ground truth).

Extra info: These are the results of the difference between a computed shortest path and the shortest path given by a known heuristic. Since no path can be shortest than the shortest, the differences will always be positive.

Edit: My data set is available here: data, and can be loaded with:

study <- subset(heuristic_export, select = c(Case, Heuristic, Error))    
chebyshev <- subset(study, Heuristic == "CHEBYSHEV", select = c(Error))

The variable of interest is chebyshev$Error.

You can visualize the distribution in the following histogram: enter image description here

Same visualization with more bins (50): enter image description here

At first I naively tried to transform the data into a normal distribution (without any degree of success), then I tough that the data would follow a exponential distribution and tried a bunch of fit tests.
But after some thought and investigation it makes sense that my errors follow a truncated normal distribution, a segment of a normal distribution between [0,1].

If that's the case, how can I for example test for the population mean?!

E.g. $H_0\!:\ \mu = 0.1,\quad H_{\rm alt}\!:\ \mu <0.1$

Or even, how can I compute the sample mean? Any recommendations?

Any help would be appreciated.

EDIT: Is the CLT applicable for this problem?! And if so, can I apply a simple hypothesis test on the resulting dataset?! Data is cheap.

EDIT2: I'm following gung's advice, and tried fitting the data to both log normal and gamma. From the following plots, my interpretation is that the data better fits a gamma distribution. I have no experience with mixture models, or even the gamma distribution, so help, examples or resources would be appreciated regarding understanding the distribution, hypothesis testing or probability calculation for mixture models. Here are the plots:

Log Normal fit:

Gamma fit:
enter image description here

Best Answer

With enough data, you can use an ordinary $t$ method to compute a CI for the mean, with the CLT as a justification for doing so. However, with the distribution you show, the mean does not seem very meaningful (sorry for the pun).

For example -- It seems like there are a lot of zeros. The probability of a nonzero error seems more descriptive than the mean. The probability of an error greater than some meaningful threshold seems more descriptive. You can estimate either of these, and obtain a confidence interval, using standard methods for estimating proportions.

Best Answer

Related Solutions

Solved – Box-Cox transformation for residuals in R

Solved – Relationships between t distributions and normal distributions

Related Question