I have some data that follows some unknown probability function. I would like to roughly extract that probability function.
My approach is to plot the data in a histogram and smooth it out using LOWESS. I implemented this following this post. I then use interpolation to create my cumulative distribution function (or at least I think so). But here I am stuck. I know I have to normalize the distribution somehow, but simply dividing by the number of data points does not seem to do the trick (read: something is completely wrong). How do I do this? Is there a better approach?
I would like to implement all of this in Python. Here is my code so far:
import numpy as np
from scipy import interpolate, integrate
from scipy.stats import rv_continuous
import statsmodels.api as sm
import matplotlib.pyplot as plt
class CustomDistribution(rv_continuous):
def __init__(self, custom_cdf):
super().__init__()
self.custom_cdf = custom_cdf
def _pdf(self, x, *args):
return self.custom_cdf(x)
# def _ppf(self, q, **kwargs):
# return
# create some normally distributed values and make a histogram
a = np.random.normal(size=10000)
counts, bins = np.histogram(a, bins=100, density=True)
bin_widths = np.diff(bins)
bin_centers = bins[:-1] + bin_widths
# roughly try to fit some distribution
lowess = sm.nonparametric.lowess(counts, bin_centers, frac=0.1)
fit = interpolate.interp1d(
lowess[:, 0], lowess[:, 1], kind='cubic', fill_value=0,
bounds_error=False
)
# integral = integrate.quad(fit, -np.inf, np.inf)[0]
# print(integral)
distr = CustomDistribution(fit)
# print(distr.rvs(size=1))
plt.hist(a, 100, density=True, label='original data')
plt.plot(bin_centers, fit(bin_centers), label='fit')
plt.hist(distr.rvs(size=100), 100, density=True, label='data from fit')
plt.legend()
plt.show()
In order to be able to draw random variates from this distribution later on, I also need to implement the inverse cdf / _ppf()
, according to the rv_continuous documentation.
EDIT
So I have been able to draw some random variates from the fit, but it takes incredibly long. The main reason seems to be that in order to draw the variates, I need to have a properly implemented inverse cdf / _ppf()
. From the documentation of rv_continuous
:
The default method _rvs relies on the inverse of the cdf, _ppf, applied to a uniform random variate. In order to generate random variates efficiently, either the default _ppf needs to be overwritten (e.g. if the inverse cdf can expressed in an explicit form) or a sampling method needs to be implemented in a custom _rvs method.
Because I don't have that, the cdf is created by integrating over the pdf, but for some reason, the integral does not seem to converge:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
So I guess I need a good way to create my own implementation of at least the cdf, but better yet of the inverse cdf.
Best Answer
So after lots of research and many iterations of trial & error, I found a solution that is satisfying to me. I am going to list some of the posts and articles that I came across during my research in the hopes that it might be helpful to somebody else.
First of all, I would like to mention this answer, which lead me on my original path in the first place. I still don't fully understand the presented solution though, which is mainly why I tried my own approach shown in the question. It also mentions the possibility to add Gaussian tails, which I would have liked, but fails to explain how to do so.
Then there is this post, which asks about drawing random variates from a known discrete distribution. This is very easy to implement in Python using
numpy.random.choice()
. However, this is not exactly what I am looking for, although there is an answer that talks about drawing samples from a known continuous distribution. Since I don't know my distribution though, this does not work for me.Eventually, I found this answer, which explains the approach that works best for me. The technique is called 'kernel density estimation' (KDE), which I guess is probably a concept well known in Mathematics, but was new to me. The author describes the general concept as follows:
Here is an article talking about how to work with splines in Python. This was still too complicated for me though, but then I found this answer, explaining how to use a Gaussian KDE in Python. As far as I understand it, a Gaussian KDE just means to fit a bunch of Gaussian curves to your dataset, which is easy enough for me to undestand and in hindsight a pretty obvious solution.
My Solution
I know of two libraries that offer Gaussian KDE fitting in Python. The one that I came across first, is the
scipy.stats.gaussian_kde
function. Its advantages are that it offers several ways to determine the binwidth of the Gaussian kernels and comes with a method that directly generates the pdf. It didn't 100% work for me at fist though, which is why I looked for standard ways to fit other types of kernels.That lead me to this article, which mentions the
sklearn.neighbors.KernelDensity
class. Its advantage is that next to the Gaussian kernel, it also offers a whole bunch of other kernels (’tophat’, ’epanechnikov’, ’exponential’, ’linear’, ’cosine’). However, generating the pdf is more cumbersome for some reason.Both versions come with their own set of different parameters that can be optimized, although I haven't fully explored them yet.
Below you find code which compares the two methods using some sample data:
Here is the dataset that I used for the plot above: