I am doing a classification task and obtain an accuracy of 97.5%. Now, I calculated the confidence interval, assuming a normal distribution, at the 95% confidence level with: Accuracy +/- 1.96*Standard Deviation (see Alex A Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms (2002)). I am getting a value of 0.975 +/- 0.048. As this would be more than 1, can that be right?
Solved – Can a confidence interval be greater than 1
accuracyclassificationconfidence intervalstandard deviation
Related Solutions
You should read Spreadsheet Adiction and the links from that page before trusting any results from Excel.
From your question it appears that you don't have a firm grasp on what confidence intervals and prediction intervals are. You should really consult a good intro stats book, and/or take a class or meet with a consultant to get these concepts down. But here is a short explanation:
The condifence interval is a statement about where we believe the true population parameter (the mean above) to be based on the sample data. So not knowing the population mean does not mean that you cannot do a confidence interval. If your sample is large and you are willing to assume that the population is not overly skewed or would produce outliers, then the Central Limit Theorem says that a confidence interval on the mean based on the assumption of a normal population will be a good approximation even if the population is not normal. So you can use normal based theory without knowing if the population is normal as long as you are willing to make the above assumptions.
The prediction interval is a statement about where we expect future individual data points to be. This prediction will depend much more on the shape of the distribution.
The big difference in concept is whether you are talking about the mean of all future data, or individual data points (I could not tell which you are interested in from the question).
The norminv function in Excel does not fit a normal distribution, but gives the x-value for a given area under the curve (probability) for a normal with the specified mean and standard deviation. That function could be used as part of the computations to get either of the intervals, but that assumes that you know the population standard deviation, if you are using the sample standard deviation then it is more appropriate to use the t distribution rather than the normal. Also note that the prediction interval takes into account the uncertainty in you estimate of the mean and standard deviation in addition to the randomness of the individual data points, so norminv probably is not what you want.
You can compute the arithmetic mean of the log growth rate:
- Let $V_t$ be the value of your portfolio at time $t$
- Let $R_t = \frac{V_t}{V_{t-1}}$ be the growth rate of your portfolio from $t-1$ to $t$
The basic idea is to take logs and do your standard stuff. Taking logs transforms multiplication into a sum.
- Let $r_t = \log R_t$ be the log growth rate.
$$\bar{r} = \frac{1}{T} \sum_{t=1}^T r_t \quad \quad s_r = \sqrt{\frac{1}{T-1} \sum_{t=1}^T \left( r_t - \bar{r}\right)^2}$$
Then your standard error $\mathit{SE}_{\bar{r}}$ for your sample mean $\bar{r}$ is given by:
$$ \mathit{SE}_{\bar{r}} = \frac{s_r}{\sqrt{T}}$$
The 95 percent confidence interval for $\mu_r = {\operatorname{E}[r_t]}$ would be approximately: $$\left( \bar{r} - 2 \mathit{SE}_{\bar{r}} , \bar{r} + 2 \mathit{SE}_{\bar{r}} \right)$$.
Exponentiate to get confidence interval for $e^{\mu_r}$
Since $e^x$ is a strictly increasing function, a 95 percent confidence interval for $e^{\mu_r}$ would be:
$$\left( e^{\bar{r} - 2 \mathit{SE}_{\bar{r}}} , e^{\bar{r} + 2 \mathit{SE}_{\bar{r}}} \right)$$
And we're done. Why are we done?
Observe $\bar{r} = \frac{1}{T} \sum_t r_t$ is the log of the geometric mean
Hence $e^{\bar{r}}$ is geometric mean of your sample. To show this, observe the geometric mean is given by:
$$ \mathit{GM} = \left(R_1R_2\ldots R_T\right)^\frac{1}{T}$$
Hence if we take the log of both sides:
\begin{align*} \log \mathit{GM} &= \frac{1}{T} \sum_{t=1}^T \log R_t \\ &= \bar{r} \end{align*}
Some example to build intuition:
- Let's say you compute the mean log growth rate is $.02$. Then the geometric mean is $\exp(.02) \approx 1.0202$.
- Let's say you compute the mean log growth rate is $-.05$, then the geometric mean is $\exp(-.05) = .9512$
For $x \approx 1$, we have $\log(x) \approx x - 1$ and for $y \approx 0$, we have $\exp(y) \approx y + 1$. Further away though, those tricks breka down:
- Let's say you compute the mean log growth rate is $.69$, then the geometric mean mean is $\exp(.69) \approx 2$ (i.e. the value doubles every period).
If all your log growth rates $r_t$ are near zero (or equivalently $\frac{V_t}{V_{t-1}}$ is near 1, then you'll find that the geometric mean and the arithmetic mean will be quite close
Another answer that might be useful:
As this answer discusses, log differences are basically percent changes.
Comment: it's useful in finance to get comfortable thinking in logs. It's similar to thinking in terms of percent changes but mathematically cleaner.
Best Answer
This sounds like you use normal approximation interval which is not optimal in any case and especially unsuited for probalities close to 0 and 1 (e.g. 97.5%).
Look at the following graph.
For the first histogram a normal distribution would work fairly well. In the second case you can see that the distribution has considerable skew, which would makes the normal distribution inappropiate.
In either case, there is no need to use normal approximations for confidence intervals as more exact answers can be derived (in contrast to other more complex statistics, where sometimes a normal approximation is needed). Better options to construct confidence intervals for binomial proportions are described in the link above as well.