Calculate the Mean and Standard Deviation on a Rate

meansstandard deviationstatistics

I got a set of data Time (s) and Size (Bytes) for each I calculate a Rate (s/Bytes). I'm trying to predict a Time knowing a Size and using the Mean and the 3σ of the Rates. Something like 11 minutes 48 seconds ±3σ 2 minutes 30 seconds

For now, to calculate the mean I Sum the Times and the Sizes and calculate the Mean using these two values. But to calculate de Standard Deviation I need to perform the calculation on the calculated Rates of my set of data.

Does that make any sense to calculate the Mean on Times – Sizes sums and the Standard Deviation on each calculated Rates?
Or should I calculate the Mean of each calculated Rates knowing that I'm using a Standard Deviation on each calculated Rates to give ±3σ estimation?

Are both method correct or should I apply the second solution even if the Mean calculation of the first solution is more accurate?

Best Answer

The seeking Mean and Standard Deviation are on a ratio, and we can't Mean all Ratio directly. So to calculate the Mean the only way is by summing Times and Sizes and to calculate the Ratio of these two values. And to calculate the 3σ estimator of the Standard Deviation, the only ways is to use all calculated Ratios. I have no other way accomplished these two tasks.

Related Solutions

Statistics – Sample Standard Deviation vs Population Standard Deviation

There are, in fact, two different formulas for standard deviation here: The population standard deviation $\sigma$ and the sample standard deviation $s$.

If $x_1, x_2, \ldots, x_N$ denote all $N$ values from a population, then the (population) standard deviation is $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2},$$ where $\mu$ is the mean of the population.

If $x_1, x_2, \ldots, x_N$ denote $N$ values from a sample, however, then the (sample) standard deviation is $$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2},$$ where $\bar{x}$ is the mean of the sample.

The reason for the change in formula with the sample is this: When you're calculating $s$ you are normally using $s^2$ (the sample variance) to estimate $\sigma^2$ (the population variance). The problem, though, is that if you don't know $\sigma$ you generally don't know the population mean $\mu$, either, and so you have to use $\bar{x}$ in the place in the formula where you normally would use $\mu$. Doing so introduces a slight bias into the calculation: Since $\bar{x}$ is calculated from the sample, the values of $x_i$ are on average closer to $\bar{x}$ than they would be to $\mu$, and so the sum of squares $\sum_{i=1}^N (x_i - \bar{x})^2$ turns out to be smaller on average than $\sum_{i=1}^N (x_i - \mu)^2$. It just so happens that that bias can be corrected by dividing by $N-1$ instead of $N$. (Proving this is a standard exercise in an advanced undergraduate or beginning graduate course in statistical theory.) The technical term here is that $s^2$ (because of the division by $N-1$) is an unbiased estimator of $\sigma^2$.

Another way to think about it is that with a sample you have $N$ independent pieces of information. However, since $\bar{x}$ is the average of those $N$ pieces, if you know $x_1 - \bar{x}, x_2 - \bar{x}, \ldots, x_{N-1} - \bar{x}$, you can figure out what $x_N - \bar{x}$ is. So when you're squaring and adding up the residuals $x_i - \bar{x}$, there are really only $N-1$ independent pieces of information there. So in that sense perhaps dividing by $N-1$ rather than $N$ makes sense. The technical term here is that there are $N-1$ degrees of freedom in the residuals $x_i - \bar{x}$.

For more information, see Wikipedia's article on the sample standard deviation.

[Math] Calculating mean and standard deviation of very large sample sizes

Posting as an answer in response to comments.

Here's a way to compute the mean and standard deviation in one pass over the file. (Pseudocode.)

n = r1 = r2 = 0;
while (more_samples()) {
    s = next_sample();
    n += 1;
    r1 += s;
    r2 += s*s;
}
mean = r1 / n;
stddev = sqrt(r2/n - (mean * mean));

Essentially, you keep a running total of the sum of the samples and the sum of their squares. This lets you easily compute the standard deviation at the end.

Best Answer

Related Solutions

Statistics – Sample Standard Deviation vs Population Standard Deviation

[Math] Calculating mean and standard deviation of very large sample sizes

Related Question