Solved – Mean and variance of multiple trials from normal dist

normal distributionprobability

Imagine I have some process with mean $\mu$ and variance $\sigma^2$, which are both known empirically. If I sample from this process 1000 times, what's the probability that the mean of those 1000 samples is less than $\mu+\epsilon$?

More details..

I have a binary classification model (returns probabilities using logistic regression). I can estimate the empirical mean log-loss and the variance of this log-loss, $\mu$ and $\sigma^2$ respectively. Let's say these are $\mu = 0.692$ and $\sigma=0.01$. If I then use my classifier on 1000 new sample points, I'd like to know the probability of my mean log-loss of my classifier across those samples being less than 0.693.

At the moment I have a pretty clumsy numerical method using the binomial distribution. I compute the CDF for a normal distribution using $\mu$ and $\sigma$ above to find the probability of any one point having log loss less than 0.693, then I sample the binomial distribution with this probability and aggregate the times when more than half the samples are below 0.693.

Best Answer

You are asking about using standard error. If the variable of interest is normally distributed with mean $\mu$ and standard deviation $\sigma$, then the distribution of sample means in samples of size $n$, would follow a normal distribution with mean $\mu$ and standard deviation $\tfrac{\sigma}{\sqrt{n}}$. From here, you can calculate the probability using cumulative distribution function of normal distribution with given parameters.

Notice however that this assumes normal distribution, with known $\mu,\sigma$ parameters, and that the new samples are drawn from the same distribution. All the assumptions do not have to hold in case of empirical errors, especially the errors calculated on predictions on the unseen data when compared to errors from the training set.

Related Question