Predict the next number in a sequence of decreasing probabilities

probabilityprobability distributionssequences-and-series

Apologies if the question is phrased incorrectly, i've struggled to understand the right terminology which would have allowed me to research this myself.

I have a series of probabilities in a table, where the probability of an event happening three times is less than the probability of it happening twice.

I have historical data of predictions showing the predicted probability of an event happening two and three times. I then have more predictions which only show the probability of it happening twice, but are missing the predictions for the event happening three times.

The complete data that had predictions for the event happening both two and three times appears to increase in a fairly logical pattern and I think I should be able to estimate what the probability of the event happening three times is where that data is missing.

two_events three_events diff percent_increase
20.78 5.42 15.36 73.917228
21.14 5.58 15.56 73.604541
23.78 6.76 17.02 71.572750
23.96 6.89 17.07 71.243740
24.98 7.32 17.66 70.696557
24.99 7.32 17.67 70.708283
25.22 7.43 17.79 70.539255
26.65 8.14 18.51 69.455910
28.02 8.85 19.17 68.415418
29.57 9.67 19.90 67.297937
30.94 10.42 20.52 66.321913
31.05 10.49 20.56 66.215781
31.29 10.62 20.67 66.059444
32.00 11.03 20.97 65.531250
32.71 11.44 21.27 65.025986
34.21 12.33 21.88 63.957907
34.31 12.39 21.92 63.888079
35.56 13.17 22.39 62.964004
36.13 13.52 22.61 62.579574
36.46 13.74 22.72 62.314866
36.58 13.81 22.77 62.247130
38.86 15.31 23.55 60.602162
39.97 16.06 23.91 59.819865
41.62 17.22 24.40 58.625661
41.83 17.37 24.46 58.474779
42.95 18.22 24.73 57.578580
45.15 19.35 25.80 57.142857
46.33 20.73 25.60 55.255774
48.45 22.42 26.03 53.725490
48.52 22.49 26.03 53.647980
49.80 23.54 26.26 52.730924
50.78 24.37 26.41 52.008665
52.55 25.90 26.65 50.713606
53.11 26.40 26.71 50.291847
61.85 34.84 27.01 43.670170
62.34 35.36 26.98 43.278794
64.98 38.22 26.76 41.181902
69.28 43.17 26.11 37.687644
72.29 46.91 25.38 35.108590
75.13 50.63 24.50 32.610142

I think I should be able to use this to predict that if the probability of two events happening is 21%, then the probability of three events happening is going to be roughly 5.50% for example.

two_events three_events diff percent_increase
20.78 5.42 15.36 73.917228
21 (example) ??? (presumably 5.50 or 5.51) ??? ???
21.14 5.58 15.56 73.604541

To be clear the first column is not sequential in anyway, I am only looking at predicting what the second column would be based on the first.

It also does not need to be a perfect formula, only roughly correct

But I do not know how to come up with a formula to calculate this or the right terminology to use to search for it.


Thanks to a great link in the answers i've been given the following formula:

$$y=a(x+b)^d+c$$

With these variables:

a = 0.00000154534
b = 75.0975
c = 3.47686
d = 6.59594

But I am not sure how to write that out in something like Excel or code such as Python, Javascript or SQL.

If I use

((0.00000154534(21+75.0975))^3.47686)+6.59594

to work out the value of 21, which should be 5.50, they all result in an error and seem to require something between a(x+b).

I assume there is a function for whatever a(x+b) is doing that you would have to use in Excel or similar, but i'm afraid I have no idea what it is called and have found it impossible to Google.

Confusingly typing ((0.00000154534(21+75.0975))^3.47686)+6.59594 into Google comes out with 6.59594 so i'm clearly making a basic error somewhere.

Could you please provide an example either using an Excel formula or code?

Best Answer

Welcome to MSE!

What you're describing sounds like you're looking to interpolate or extrapolate from this data to other possible points.

You have two possibilities. In both cases, though, the difference $P_2-P_3$ isn't all that useful. Nor is the increase from one data point to the next. What you'd really like is to find a function such that $P_3=f(P_2)$.

Since this data set looks relatively smooth, you could just do two-point interpolation. In other words, if you have a $P_2$ within the bounds of the dataset, take the points above and below it, and pretend the line they define is the function (for this part of the data). You don't even really have to "graph" it, just find the formula of the line and plug in your $P_2$ value.

Sadly, that only works for interpolation. If you need something outside the bounds of the dataset, you'll have to do more.

Better than two-point interpolation is to use a curve-fitting program. There are several places online to do a curve-fit; personally I like Desmos because it plays well, though sometimes it doesn't like copy/paste. You also need some guess as to the form of the equation for Desmos--other systems likely do, Excel being an example.

This looks approximately quadratic, at least in the middle of its range. Possibly a different exponent would be useful. Letting the $x$-axis be $P_2$ and the $y$-axis be $P_3$, I'd try fitting to the forms: $$y=ax^2+c$$ or more complex: $$y=a(x-b)^d+c$$ to account for a possible offset on the $x$-axis.

Good luck!

(I say approximately quadratic because, for instance, $16=40\% \cdot 40$, and $24 \approx 50\%\cdot 50$.)