[Math] Probability of system failure in a distributed network

approximation-theorybinomial distributioncomputational complexitypr.probabilitysoft-question

I am trying to build a mathematical model of the availability of a file in a distributed file-system. The system works like this: a node $x$ stores a file $f$ (encoed using erasure codes) at $rb$ remotes nodes, $r$ is the replication-factor where $b$ is a constant. Erasure-coded files have the property that the file can be restored $iff$ at least $b$ of the remote nodes are available and return its part of the file.

The simplest approach to this is to assume that all remote nodes are independent of each other and have the same availability $p$. With these assumptions the availability of a file becomes $$pf = \sum_{i=b}^{rb} \binom{rb}{i} p^i(1 – p)^{rb – i}$$

Unfortunately these assumptions can introduce a non-neligible error, as shown by this paper: http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf.

The other extreme is to calculate the probability of each possible combination of availaible/non-available node and take the sum of all these outcomes (which is sort of what they suggest in the paper above, just more formally than what I just described). This approach is more correct but has a computational cost of $O(2^{rb})$.

Do you guys have any ideas of a good approximation which introduces less error than the binomial distribution-aproximation but with better computational cost than $O(2^{rb})$?

You can assume that the availability-data of each node is a set of tuples consisting of $(measurement-date, node\ measuring, node\ being\ measured, succes/failure-bit)$

Best Answer

Instead of summing from $b$ to $rb$ and calculating the probability of finding the file, why not decrease the number of possibilities to consider by summing from $0$ to $b-1$ and calculating the probability of not finding the file? As long as the replication factor is greater than 1, you will see a savings in calculation time.

$$p_{fail} = \sum_{i=0}^{b} \binom{rb}{i} p^{rb-i} (1 - p)^{i}$$

In the same manner, instead of trying to deal with all probabilites of success over $rb$ nodes, you only have to deal with the first $b$ binomial coefficients summed number of possibilities, rather than $2^{rb}$ possibilities for failure.

Related Question