Solved – I need someone to check the conditional probability calculation function

conditional probabilitypandaspythonself-study

I am following the book "Think Stats. Probability and Statistics for Programmers" and doing the exercises using numpy + pandas.

Currently I am on exercise 2.7 on conditional probability:

Exercise 2.7 Write a function that implements either of these
algorithms and computes the probability that a baby will be born
during Week 39, given that it was not born prior to Week 39.
Generalize the function to compute the probability that a baby will be
born during Week x, given that it was not born prior to Week x, for
all x. Plot this value as a function of x for first babies and others.

So for example I have a series first_babies["preglength"] that looks like

preglength
37
39
39
41
46

Each number is the amount of weeks a pregnancy lasted. So the first record means a first baby was born on week 37.

I need to create a function that takes a week number x and calculates the probability of birth on week x given that the baby was not born prior to week x.

I made the following solution:

def probability_of_birth_on_week(x, df):
    filtered = df[ df >= x ]
    hist = np.histogram(filtered, bins=len(filtered.drop_duplicates()), normed = True)
    try:
        return hist[0][0]
    except:
        return 0



weeks = sorted(first_babies["prglength"].drop_duplicates())
#print(probability_of_birth_on_week(46, first_babies["prglength"]))
plt.plot(weeks, [ probability_of_birth_on_week(x, first_babies["prglength"]) for x in weeks] )
plt.plot(weeks, [ probability_of_birth_on_week(x, other_babies["prglength"]) for x in weeks] )

You can check the full notebook here: https://www.dropbox.com/s/f0166s5o4lb08sx/firstbabies.ipynb?dl=0

So what I do is remove all births on weeks less than x and normalize the rest using the numpy.histogram function with week numbers as bins.

I am receiving numbers that look somewhat okay, but I am not sure I did it right. The book doesn't provide a right answer and I am using a different dataset (the book uses a 2006 study and I decided to take a 2011 study).

Can you take a look at my solution and tell me if I have any mistakes?

Update:
Following @BrentKerby advice I edit the code to separate data in only two bins: first with x values only, second with all values bigger than x.
Updated code:

def probability_of_birth_on_week(x, df):
    filtered = df[ df >= x ]
    hist = np.histogram(filtered, bins=[x, x+1, max(filtered)+1], density = True)
    try:
        return hist[0][0]
    except:
        return 0

Best Answer

Your approach of using a histogram is clever. But why select the number of bins to be equal to the length of the filtered data? It seems that you really only need two bins, one for x, and one for the data larger than x, so why not just use that? If I understand the documentation of np.histogram correctly, your current choice of bins does not ensure that the x data alone goes into the first bin; other data just above x could also go into the first bin, giving incorrect results.

To give an example, suppose x=20, and the data (after filtering) is [20, 21, 25, 40] (unrealistic in this context, but let's ignore that). Then the histogram will be constructed using 4 bins, 20-25, 25-30, 30-35, and 35-40, and both 20 and 21 will go into the first bin, which is not what is desired.

Related Question