Curve Fitting – Fitting the Second Integral of a Normal Distribution

curve fittingnonlinear regressionnormal distributionregression

I have a data set of points that I need to fit to a normal distribution, where the points approximate the curve of the distribution's second integral. Examples of such curves are given below in the first image, and an example of the data set is given in the second image.

How would one approach curve fitting here? I was thinking of implementing non-linear regression (Gauss-Newton), but that method assumes that the partial derivatives of the target function are known. Is there any clean way to approximate the residuals and jacobian matrix for this problem? Any existing software libraries, or references of interest?

Thank you

second integral of normal distribution

https://i.stack.imgur.com/VC5fQ.png

Best Answer

Let $\Psi(z) = \int_{- \infty}^z \Phi(t) \> dt$. Note $\Phi$ is approximately constant as $\vert z \vert \to \infty$. This means that that $\Psi$ is going to be linear in these same regions.

Because of this behaviour, I think it is reasonable to use a natural spline to approximate $\Psi$. A Natural spline is linear in the tails of the data (just like our target function). The question becomes: where the knots should be placed? I'm sure the knots could be placed intelligently leveraging properties of $\Phi$ and the Gaussian density, but I'll just let the rms library pick them for me today. If you were able to intelligently select knot locations, then you could pass them to rcs using the parms argument.

Using R...

library(pracma)
library(rms)

# Generate data
x = seq(-8, 8, 0.25)
f = pnorm(x)
Psi = cumtrapz(x, f)

train_ix = abs(x)<=5
test_ix = abs(x)>5

xtrain = x[train_ix]
xtest = x[test_ix]

ytrain = Psi[train_ix]
ytest = Psi[test_ix]


model = ols(ytrain ~ rcs(xtrain))

plot(x, predict(model, newdata=list(xtrain=x)), col='red', type='l')
points(xtrain, ytrain)
points(xtest, ytest, col='blue', pch=2)

Which produces the following fit

enter image description here

The triangles here are data which were not used to fit the model. Because we have used a natural spline (which is linear in its tails) we do fairly well. Shown below is the absolute error between model and integral (pay attention to those $x$ which are larger than 5 in absolute value, those are the test points)

enter image description here

Because $\Phi$ is not actually linear in the tails (although nearly) our model suffers slightly as evidenced by the growth in absolute error. The relative error is quite good in the right tail, but appears to explode in the left tail since $\Psi$ is quite small in those regions.

As for a model formula, here it is

enter image description here

Apologies for the awful formatting, I the latex function from rms doesn't seem to play nice with typesetting here.

Related Question