[Math] Non-linear regression for cumulative distribution function

interpolationlogistic regressionregressionstatistics

I have twenty probability distributions based on a simulation. The corresponding cumulative distribution plot for one distribution looks like this:

Simulated result

I believe that most of the results will look something like this, not necessarily symmetric though.

I would like to have an approximate simple analytic formula for this curve which has small errors around the percentiles 10% and 90%, and for the median. It would be nice if the function could be defined with as few points as possible, maybe the three points.

I have constructed an approximation with a 6th order polynomial. I am not very satisfied with this as I need too many points (101) to have small errors in the extremes, as shown in the picture below.

Polynomial regression

I also tried with a sigmoid function

$$
\frac{1}{1+e^{k(x-x_{\text{med}})}}
$$

where $k$ is a constant and $x_{\text{med}}$ is the median of the curve. It looks smooth, but I had to adjust the constant $k$ manually. And as far as I can see, this function will be symmetric. The simulated distributions may have skewness.

Hoping for some suggestions for possible functions. Thanks in advance.

EDIT:
As Neal points out, I can determine the constant $k$ by calculating the slope at the median. The problem with the sigmoid function is that it doesn't handle skewness. User121049 proposes that the generalized logistic function

$$
Y(t) = A + \frac{K-A}{(C+Qe^{-B(t-M)})^{1/\nu}}
$$

might be an option, but I have problems determining the constants.

I added the data for two of the distributions for the percentiles 0% to 100% with 1% interval. I multiplied the numbers with 100 to make them better looking for sharing. In the final model I do not want to extract these many percentiles per distribution. Hopefully it would be enough with less than 10, or to use some other parameters like the variance. The reason: I have many distributions and this is an Excel model, so extracting 101 points will make the model slow. I only use the whole data set beneath to test the approximated formula. I need a compact formula for the CDF to use further on in the model AND to share in digestible written format as a report.

Best Answer

The given data is graphically represented on the next figure :

First, we will look how the logistic function fit to the given data : $$(x_1,y_1),(x_2,y_2),...,(x_i,y_i),...,(x_n,y_n)$$ $$ y_i\simeq\frac{1}{1-e^{k(x_i-x_{\text{m}})}} \tag 1$$ $$x_i\simeq x_{\text{m}}+\frac{1}{k}\ln\left(\frac{1}{y_i}-1\right) $$ We compute $\quad (z_1),(z_2),...,(z_i),...,(z_n)\quad $ with : $$\quad z_i=\ln\left(\frac{1}{y_i}-1\right)$$ Then, the points $(x_i,z_i)$ are plotted on the next figure (respectively in BLUE and RED for the two given examples).

If the function $(1)$ was perfectly convenient, the points would have been on a straight line. We observe a non-negligible deviation for small and large values of $x$.

This draw to add a corrective term which has to be a symmetrical odd function. The simplest one has the form $\quad \alpha (x-c)^3\quad$ where $\alpha$ is a small coefficient to be determined. $c$ is directly found in the given data for $y(c)=0.5$ Don't confuse $c$ with $x_m$ above, even if they are on same order of magnitude.

For the first data : $c=-7.819$ and for the second : $c=7.048$ (no adjustment is necessary).

We see on the figure that with this kind of corrective term, the points (plotted in BLACK) can become nicely aligned.

In fact, the proposed function is : $$y(x)\simeq\frac{1}{1-e^{k(x-x_{\text{m}})+\alpha(x-c)^3}} \tag 2$$ where there are three parameters to adjust : $k$ , $x_{\text{m}}$ and $\alpha$.

What is more, the computation of those three parameters is very easy, in fact a simple linear regression ( no need for recursive calculus, no initial guess).

Consider the data : $$(x_1,z_1),(x_2,z_2),...,(x_i,z_i),...,(x_n,z_n)$$ and the linear relationship (with the above known value of $c$ ) : $$z=kx+\beta+\alpha (x-c)^3$$ where $\beta=-kx_{\text{m}} \quad\to\quad x_{\text{m}}=-\frac{\beta}{k} $

An usual linear regression for $k$ , $\beta$ , $\alpha$ leads to the wanted parameters of equation (2).

The result is shown on the next figure :

Of course, no need to take all the digits given by the computer. Only three or four significant digits are largely sufficient.

Note : The value of $c$ is not critical. It comes from the $50^{th}$ point in the given data. But one can take any other point around. For example for the first data, instead of $-7.819$ one can take $7$ or $8$ without a signifiant change of the final fitting.

Note: In the regression calculus, the points $(x_0,y_0=0)$ and $(x_{100},y_{100}=1)$ are excluded since they are obviously deviant for finite value of $x$.

Best Answer

Related Solutions

[Math] Prediction Model for forecasting using Linear regression

Related Question