Solved – Fitting Beta Distributions to Data

beta distributionfittingnumpyscipy

I am trying to reproduce some beta distribution parameters found in this published paper. I have two data sets, y1 and y2, that are generated in the following way (NumPy code):

size = 5000
x = np.arange(size)

# Dataset #1
I = np.random.randint(0, size, size=size)
k = I.shape[0]
nnmark = np.zeros(k)
y1 = np.zeros(k)

for i in range(k):
    j = I[i]
    nnmark[min(i, j)] = nnmark[min(i, j)] + 1
    nnmark[max(i, j)] = nnmark[max(i, j)] - 1

y1 = np.cumsum(nnmark)


# Dataset #2
I = np.empty(size, dtype=np.int)
for i in range(I.shape[0]):
    I[i] = np.random.randint(i, size)

for i in range(k):
    j = I[i]
    nnmark[min(i, j)] = nnmark[min(i, j)] + 1
    nnmark[max(i, j)] = nnmark[max(i, j)] - 1

y2 = np.cumsum(nnmark)

Plotted Data

y1 is shown in black and y2 is shown in red above. According to the paper, both curves can be approximated by a beta distribution. In the original paper, they claim that since y1 has a width that is equal to the size and a height that is equal to 0.5*size then this is a special case of the beta distribution with beta(2, 2, a, c). Finally, their code shows that this any value from this distribution can be computed from:

y = 2*(x)*(size-z)/size

It isn't clear how this is derived. I am only familiar with basic beta distributions with alpha and beta parameters bounded between [0,1] and some references explain that a and c often refer to the location and scale of the distribution. I also noticed that the height of y1 and y2 are quite large and, unfortunately, I don't know how to create, fit, or rescale a beta distribution so as to obtain the correct parameters that would fit either of these curves.

My final goal would be to change the size (currently it is set to 5,000 for demonstration purposes but it can be smaller or larger) and be able to represent that dataset with an approximate beta distribution (albeit, it will likely need to be stretched horizontally and vertically).

I have been looking at the SciPy beta distribution function but the documentation is vague. I've gotten as far as:

a1, b1, c1, d1 = beta.fit(y1, loc=0, scale=size)
a2, b2, c2, d2 = beta.fit(y2, loc=0, scale=size)

But neither of the PDFs look like the original data when plotted next to it.

Best Answer

To generate an accurate curve fit, you can sample the histogram via the scipy.stats.rv_histogram function and then average over the fitted parameters from multiple iterations:

n_iter = 1000
params = np.empty((n_iter, 2))
for i in range(n_iter):    
    n_samples = 1000
    hist_dist = scipy.stats.rv_histogram((y2, np.append(np.arange(k), k)))
    data = hist_dist.rvs(size=n_samples)
    a, b, c, d = scipy.stats.beta.fit(data, floc=0, fscale=k)

    params[i, 0] = a
    params[i, 1] = b

a_mean = np.round(np.mean(params[:, 0]), 2)
b_mean = np.round(np.mean(params[:, 1]), 2)
print(a_mean, b_mean)

y2_normalized_fit = scipy.stats.beta.pdf(np.arange(k), a_mean, b_mean, loc=0, scale=k)
scaling_factor, _, _, _ = np.linalg.lstsq(y2_normalized_fit.reshape(-1,1), y2, rcond=None)

Then, you multiply the scaling_factor by the fitted curve to get the right curve.

y2_normalized_fit * scaling_factor

enter image description here

Related Question