Curve Fitting – Fitting the Second Integral of a Normal Distribution

curve fittingnonlinear regressionnormal distributionregression

I have a data set of points that I need to fit to a normal distribution, where the points approximate the curve of the distribution's second integral. Examples of such curves are given below in the first image, and an example of the data set is given in the second image.

How would one approach curve fitting here? I was thinking of implementing non-linear regression (Gauss-Newton), but that method assumes that the partial derivatives of the target function are known. Is there any clean way to approximate the residuals and jacobian matrix for this problem? Any existing software libraries, or references of interest?

Thank you

Best Answer

Let $\Psi(z) = \int_{- \infty}^z \Phi(t) \> dt$. Note $\Phi$ is approximately constant as $\vert z \vert \to \infty$. This means that that $\Psi$ is going to be linear in these same regions.

Because of this behaviour, I think it is reasonable to use a natural spline to approximate $\Psi$. A Natural spline is linear in the tails of the data (just like our target function). The question becomes: where the knots should be placed? I'm sure the knots could be placed intelligently leveraging properties of $\Phi$ and the Gaussian density, but I'll just let the rms library pick them for me today. If you were able to intelligently select knot locations, then you could pass them to rcs using the parms argument.

Using R...

library(pracma)
library(rms)

# Generate data
x = seq(-8, 8, 0.25)
f = pnorm(x)
Psi = cumtrapz(x, f)

train_ix = abs(x)<=5
test_ix = abs(x)>5

xtrain = x[train_ix]
xtest = x[test_ix]

ytrain = Psi[train_ix]
ytest = Psi[test_ix]


model = ols(ytrain ~ rcs(xtrain))

plot(x, predict(model, newdata=list(xtrain=x)), col='red', type='l')
points(xtrain, ytrain)
points(xtest, ytest, col='blue', pch=2)

Which produces the following fit

The triangles here are data which were not used to fit the model. Because we have used a natural spline (which is linear in its tails) we do fairly well. Shown below is the absolute error between model and integral (pay attention to those $x$ which are larger than 5 in absolute value, those are the test points)

Because $\Phi$ is not actually linear in the tails (although nearly) our model suffers slightly as evidenced by the growth in absolute error. The relative error is quite good in the right tail, but appears to explode in the left tail since $\Psi$ is quite small in those regions.

As for a model formula, here it is

Apologies for the awful formatting, I the latex function from rms doesn't seem to play nice with typesetting here.

Related Solutions

Solved – Dividing and forecasting a normal distribution

The straight answer to Q1 is "yes", it is definitely possible to cut up an underlying normally distributed continuous variable into an ordinal variable with 1 to 10 levels. You need something that can tell you the cumulative distribution function (often called CDF) of a normal distribution with a given mean and variance (you only need these two parameters to characterise a normal distribution). Then you need to calculate the differences between the values this returns for your various bin cutoffs (as its straight return will be the cumulative probability of a value at X or lower).

I'm sorry I don't use C# but in R this would be something like the below. This is for a 10 point example, if the normal distribution you think is your underlying latent variable has a mean of 5 and variance of 2; and my bins are minus infinity to 1.5, 1.5 to 2.5, 2.5 to 3.5, ... , 9.5 to infinity. You only need the mean and variance to characterise a normal distribution.

> options(digits=2)
> x <- pnorm(1:10+0.5, 5, 2)*100
> x[10] <- 100            # otherwise is just 9.5 to 10.5, not infinity
> x                       # ie cumulative prob (in %) to each bin
 [1]   4  11  23  40  60  77  89  96  99 100    
> c(x[1], diff(x))        # differences between the cumulative probs
 [1]  4.0  6.6 12.1 17.5 19.7 17.5 12.1  6.6  2.8  1.2

Subsequently, the straight answer to Q2 is also "yes" there are definitely such methods but they should be used with caution and it is probably a little difficult just here to summarise all the pros and cons of the different ways of doing this.

It's also worth knowing that there are other methods for analysing this sort of ordinal data.

Solved – How to programmatically detect segments of a data series to fit with different curves

My interpretation of the question is that the OP is looking for methodologies that would fit the shape(s) of the examples provided, not the HAC residuals. In addition, automated routines that don't require significant human or analyst intervention are desired. Box-Jenkins may not be appropriate, despite their emphasis in this thread, since they do require substantial analyst involvement.

R modules exist for this type of non-moment based, pattern matching. Permutation distribution clustering is such a pattern matching technique developed by a Max Planck Institute scientist that meets the criteria you've outlined. Its application is to time series data, but it's not limited to that. Here's a citation for the R module that's been developed:

pdc: An R Package for Complexity-Based Clustering of Time Series by Andreas Brandmaier

In addition to PDC, there's the machine learning, iSax routine developed by Eamon Keogh at UC Irvine that's also worth comparison.

Finally, there's this paper on Data Smashing: Uncovering Lurking Order in Data by Chattopadhyay and Lipson. Beyond the clever title, there is a serious purpose at work. Here's the abstract: "From automatic speech recognition to discovering unusual stars, underlying almost all automated discovery tasks is the ability to compare and contrast data streams with each other, to identify connections and spot outliers. Despite the prevalence of data, however, automated methods are not keeping pace. A key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning. We demonstrate the application of this principle to the analysis of data from a number of real-world challenging problems, including the disambiguation of electro-encephalograph patterns pertaining to epileptic seizures, detection of anomalous cardiac activity fromheart sound recordings and classification of astronomical objects from raw photometry. In all these cases and without access to any domain knowledge, we demonstrate performance on a par with the accuracy achieved by specialized algorithms and heuristics devised by domain experts. We suggest that data smashing principles may open the door to understanding increasingly complex observations, especially when experts do not know what to look for."

This approach goes way beyond curvilinear fit. It's worth checking out.

Best Answer

Related Solutions

Solved – Dividing and forecasting a normal distribution

Solved – How to programmatically detect segments of a data series to fit with different curves

Related Question