[Math] System Availability of a server cluster in terms of MTTF, MTTR and RMT

probability

Consider the System Availability (A) of a server cluster in terms of
three parameters: namely the mean time to failure (MTTF), the mean
time to repair (MTTR), and a regular maintenance time (RMT). The MTTF
reflects the average uptime between two adjacent natural failures. The
MTTR is the downtime due to natural failure. The RMT refers to
scheduled down time for hardware/software maintenance or updates. (a)
Given a cloud system with a demanded availability A = 98%. If the MTTF
is known to be two years (or 365 × 24 × 2 = 17,520 hours) and the MTTR
is known 24 hours, what is the value of RMT in hours per month that
you can schedule for this cloud system?

So if $$A = \frac{MTBF}{MTBF – MTTR} $$
$$A = \frac{17520}{17520+24}$$
$$A = \frac{17520}{17544} = .998632$$
Then 24 hours times 30 days: $$24 x 30 = 720$$
$$1-.998632 = .001368$$
$$.001368 x 720 = 0.98496$$
$$0.98496×24 = 23.63$$
So is the answer 23.63 hours? That doesn't seen right.

(b) Consider a cloud cluster of three servers. The cluster is
considered available (or acceptable with a satisfactory performance
level), if at least k servers are operating normally where k ≤ 3.
Assume that each server has an availability rate of p (or a failure
rate of 1 − p). Derive a formula to calculate the total cluster
availability A (i.e., the probability that the cluster is available
satisfactorily). Note that A is a function of k and p.

So for all three servers to be available, the availability of $A = p*p*p$; the probablity of one and the only one server is available is $A = 3p(1-p)(1-p)$, so the probability of one server being up and the other two being down. Any one of three servers can be the one that is up, so I'm supposed to add the three cases together?

The cited formula looks like this which I'm struggling to understand.
formula from the reading

Best Answer

Intuitively, you have spent almost none of your inavailability budget on random failures because $0.998$ is so much closer to $1$ than $0.98$. You are allowed to spend $0.02$ of the time down, which would be $0.02 \cdot 720=14.4$ hours per month. You should reduce the $0.02$ to account for the random failures. Your $0.001368\cdot 720$ gives the average number of hours per month that the system is down due to random failure. There is no sense multiplying that by $24$.

For 2, the chance of exactly one server being up is $3p(1-p)^2$ as you say. That is exactly the $i=1$ term of the sum. If you understand where $3p(1-p)^2$ came from, you should be able to identify the terms in the sum. The $i=2$ term is ${3\choose 2}p^2(1-p)^1$

Related Question