Weighted average of MTBF vs failure rates

probabilityreliability

Background

Let's say I have $m$ machines. Machine $i$ was seen running for a total of $u_i$ time in some observation window and the number of failures observed for it was $n_i$. Now, I want an estimate of the overall failure rate, across the machines. Assuming a constant failure rate implies that we want to fit the best Poisson process to this data. And using maximum likelihood estimation on that process gives us the failure rate (which happens to be the parameter of the Poisson process):

$$\hat{\lambda} = \frac{\sum\limits_1^m n_i}{\sum\limits_1^m u_i} \tag{1}$$.

And the mean time between failures becomes:

$$\hat{\tau} = \frac{1}{\hat{\lambda}} = \frac{\sum\limits_1^m u_i}{\sum\limits_1^m n_i}$$

Using individual machines

Now let's say someone told me the failure rates of the individual machines,

$$\lambda_i = \frac{n_i}{u_i}$$

The way to combine them and produce the same result as equation (1) is to take the weighted average of the individual rates, weighted by how long we observed them for. This makes sense.

$$\hat{\lambda} = \sum\limits_1^m \lambda_i \frac{u_i}{\sum u_j}$$

This produces the same result as equation (1).

The question: why can't I do this with MTBF

Now, what if I got instead the MTBF's (mean time between failures) of the individual machines,

$$\tau_i = \frac{u_i}{n_i}$$

Now, how do I combine these to get the overall MTBF. It turns out, I can't take the weighted average of the individual MTBF's. Why does this work with the rates but not with the MTBF's? And is there a formula (with justification) for going from the individual MTBF's to the overall MTBF (like there was for failure rates)?

Best Answer

You already established that $$\hat \lambda = \sum_{i=1}^m \lambda_i w_i,$$ where $$w_i = \frac{u_i}{\sum u_j}$$ is a weighting factor representing the proportion of uptime observed for machine $i$. Moreover, you also established $$\hat \tau = \frac{1}{\hat \lambda}, \quad \tau_i = \frac{1}{\lambda_i}.$$ So all that remains to do is to write $$\hat \tau = \frac{1}{\hat \lambda} = \left(\sum_{i=1}^m \frac{w_i}{\tau_i}\right)^{-1}.$$ This is the desired relationship.

The reason why there is no formula of the form $$\hat \tau = \sum_{i=1}^m \tau_i w_i'$$ for alternative suitable weights $w_i'$ is because the mean times between failures for each individual machine $\tau_i$ is a reciprocal rate, where time is in the numerator. Other examples of reciprocal rates might be:

Minutes per mile traveled
Years per job completed
Hours between successive bus arrivals.

Consequently, you cannot add two reciprocal rates together to get a meaningful quantity--for instance, if car A takes $1$ minute to travel a mile, and car B takes $2$ minutes to travel a mile, it makes no sense to say their average time to travel a mile is $1 + 2 = 3$ minutes. Instead, you have to observe that car A's rate is $1$ mile per minute, and car B's rate is $0.5$ miles per minute, and so their average rate is equal to some weighted average, where the weighting is the proportion of time each car travels.

Best Answer

Related Solutions

Mean time to sharded data being unavailable in a distributed storage system

MTTR and MTBF formulas for series system must be associative

Related Question