I am trying to reproduce some beta distribution parameters found in this published paper. I have two data sets, y1
and y2
, that are generated in the following way (NumPy code):
size = 5000
x = np.arange(size)
# Dataset #1
I = np.random.randint(0, size, size=size)
k = I.shape[0]
nnmark = np.zeros(k)
y1 = np.zeros(k)
for i in range(k):
j = I[i]
nnmark[min(i, j)] = nnmark[min(i, j)] + 1
nnmark[max(i, j)] = nnmark[max(i, j)] - 1
y1 = np.cumsum(nnmark)
# Dataset #2
I = np.empty(size, dtype=np.int)
for i in range(I.shape[0]):
I[i] = np.random.randint(i, size)
for i in range(k):
j = I[i]
nnmark[min(i, j)] = nnmark[min(i, j)] + 1
nnmark[max(i, j)] = nnmark[max(i, j)] - 1
y2 = np.cumsum(nnmark)
y1
is shown in black and y2
is shown in red above. According to the paper, both curves can be approximated by a beta distribution. In the original paper, they claim that since y1
has a width that is equal to the size
and a height that is equal to 0.5*size
then this is a special case of the beta distribution with beta(2, 2, a, c)
. Finally, their code shows that this any value from this distribution can be computed from:
y = 2*(x)*(size-z)/size
It isn't clear how this is derived. I am only familiar with basic beta distributions with alpha and beta parameters bounded between [0,1] and some references explain that a
and c
often refer to the location
and scale
of the distribution. I also noticed that the height of y1
and y2
are quite large and, unfortunately, I don't know how to create, fit, or rescale a beta distribution so as to obtain the correct parameters that would fit either of these curves.
My final goal would be to change the size
(currently it is set to 5,000 for demonstration purposes but it can be smaller or larger) and be able to represent that dataset with an approximate beta distribution (albeit, it will likely need to be stretched horizontally and vertically).
I have been looking at the SciPy beta distribution function but the documentation is vague. I've gotten as far as:
a1, b1, c1, d1 = beta.fit(y1, loc=0, scale=size)
a2, b2, c2, d2 = beta.fit(y2, loc=0, scale=size)
But neither of the PDFs look like the original data when plotted next to it.
Best Answer
To generate an accurate curve fit, you can sample the histogram via the
scipy.stats.rv_histogram
function and then average over the fitted parameters from multiple iterations:Then, you multiply the
scaling_factor
by the fitted curve to get the right curve.