Solved – statsmodels KDE icdf

kernel-smoothingstatsmodels

GOAL

I am interested in the inverse cumulative distribution function – aka quantile function – my plan is to randomize the distribution using the inverse probability integral transform, aka feeding the inverse cum. distribution function samples from a uniform [0,1] distribution.

WHAT I TRIED SO FAR

I have fit a univariate KDE to a dataset

from statsmodels.nonparametric.api import KDEUnivariate
kde = KDEUnivariate(X.values)
kde.fit()

Now, the guide states that KDE is a function.
However, what I get is a np array, which is also fine – happy to interpolate the data myself.
However, what I do not get is where those numbers come from.

To begin with, kde.icdf is bigger in size that my sample X.values – suggesting that the lib has imposed somehow an arbitrarily finer grained mesh (in the source code gridsize = len(self.density))

In[88]: kde.icdf
Out[88]: 
array([-0.08641907, -0.08623651, -0.07847717, ...,  0.09789788,
        0.10873394,  0.10898888])

Secondly, I found odd the chart here.
I would have expected $icdf: [0,1] \rightarrow X$ where $X$ is the support.

Something like:

import numpy as np
from statsmodels.nonparametric.api import KDEUnivariate
sample_data = np.random.normal(size=1000)
kde = KDEUnivariate(sample_data)
kde.fit()
quantiles_mesh = np.linspace(0,1,len(kde.density))
plt.plot(quantiles_mesh, kde.icdf)

What is the reasoning beyond that chart?

Bottom Line: given my goal I am not sure I am not going in the right direction.
How to get there on the path of least resistance?

Best Answer

Since you have not given reproducible code (and some of your links are dead), I am going to focus only on the statistical issue here ---i.e., the choice of technique for estimating the quantile function.


Rather than trying to reinvent an existing field of statistics, I would suggest you start by reading some of the existing statistical literature on non-parametric estimation of quantile functions. A good starting point for this would be Sheather and Marron (1990), which reviews and analyses some standard kernel-based quantile estimators (which are a subclass of L-estimators), and also discusses bandwidth estimation for these kernels.

A common non-parametric method due to Parzen (1979) is to take an initial quantile function $\tilde{Q}$ (which is commonly taken to be the empirical quantile function) and a kernel density $K$ and form the corresponding smoothed quantile estimator:

$$\hat{Q}(u) \equiv \frac{1}{h} \int \limits_0^1 Q(p) \cdot K \Big( \frac{u-p}{h} \Big) \ dp.$$

If one takes $\tilde{Q}$ to be the empirical density of the data then we have:

$$\tilde{Q}(p) = \max \Big\{ X_{(i)} \Big| i \leqslant p \cdot n \Big\} = X_{(\lfloor{p \cdot n} \rfloor)},$$

which gives:

$$\begin{equation} \begin{aligned} \hat{Q}(u) &= \frac{1}{h} \int \limits_0^1 X_{(\lfloor{p \cdot n} \rfloor)} \cdot K \Big( \frac{u-p}{h} \Big) \ dp\\[6pt] &= \sum_{i=1}^n \frac{X_{(i)}}{h} \int \limits_{(i-1)/n}^{i/n} K \Big( \frac{u-p}{h} \Big) dp. \\[6pt] \end{aligned} \end{equation}$$

Related Question