Solved – Fit a Gaussian to data with R with optim and nls

curve fittingfittingnormal distributionr

I want to fit a Gaussian to the following data:

    nf.marry <- c(617,10173,19878,14882,8339,5252,3727,5861)
    nf.marry  <- nf.marry/sum(nf.marry)
    mean.age <- c(18, 22.5, 27.5, 32.5, 37.5, 42.5, 47.5, 57.5)

I tried with the optim function as

    r <- nf.marry
    x <- seq_along(r)
    f <- function(par)
    { 
    m <- par[1]
    sd <- par[2]
    k <- par[3]
    rhat <- k * exp(-0.5 * ((mean.age - m)/sd)^2)
    sum((r - rhat)^2)
    }

    optim(c(26.5, 3, 1), f, method="BFGS", control=list(reltol=1e-9))

enter image description here

I also wanted to try if I can improve my fit (even if for these data I cannot expect much more better). So I did

   tab <- data.frame(x=seq_along(r), r=r)
   res <- nls(r ~ k*exp(-1/2*(x-mu)^2/sigma^2), start=c(mu=26.5,sigma=3,k=0.2) , data = tab)

But I got an error about singular gradient.
How can I correct it? I have no clue actually…

Best Answer

Solving your direct question, singular gradient

The seq_along(r) returns c(1:8) which is much different from your original mean.age variable.

The change of the independent variable requires a change of the parameterization. And in this case your starting conditions are wrong.

either change the starting conditions

tab <- data.frame(x=seq_along(r), r=r)
res <- nls(r ~ k*exp(-1/2*(x-mu)^2/sigma^2),start=c(mu=3,sigma=1,k=0.2) , data = tab)

or use the mean.age variable

tab <- data.frame(x=mean.age, r=r)
res <- nls(r ~ k*exp(-1/2*(x-mu)^2/sigma^2), start=c(mu=26.5,sigma=3,k=0.2) , data = tab)

Other comments regarding your fit

A Gaussian distribution may not be the proper model. Or at least, I do not believe that this is a distribution of married people, or otherwise there are very few married people at older age (it seems more like a number of weddings which seems more logical to decrease in time as people that are already married have less probability to get in a wedding) .

You could model this with an differential equation where the marriage rate is related to the number of bachelors $B$. Note that the marriage rate is also related to the rate of change in bachelors. Let the rate be a linearly increasing function in time (you can get more sophisticated, which would make more sense if more data, or background, is available).

$$\frac{\partial B}{\partial t} = a(t-t_0) B$$

Dividing both sides by B

$$\frac{\partial B/ \partial t}{B} = \frac{\partial ln(B)}{\partial t} = a(t-t_0) t$$

Integrating both sides and taking the exponent

$$ B = e^{c +\frac{1}{2} a(t-t_0)^2}$$

and the marriage rate is

$$- \frac{\partial B}{\partial t} = c^\prime a(t-t_0) e^{\frac{1}{2} a(t-t_0)^2}$$

where we change the position of the integration constant by using $c^\prime= -e^c$.

This leads to:

$t_0 = 17.5$ and $a=0.008$

at 17.5 years the marriage rate is zero and increases yearly by 0.008. Thus at age 18.5, 0.008 of the non married population gets married, at age 19.5 this is 0.016, etc.

Weakness of this model is that the marriage rate growing linearly is a rough approximation and it may be more like a curved shape. Also, the divorces are not taken into account, and the population is considered constant (no people dying, as well as older ages belonging to the same cohort as younger ages).

comparison Gaussian curve with differential equation model

# data and libraries
library(stats)

nf.marry <- c(617,10173,19878,14882,8339,5252,3727,5861)
nf.marry  <- nf.marry/sum(nf.marry)
mean.age <- c(18, 22.5, 27.5, 32.5, 37.5, 42.5, 47.5, 57.5)

# model 1
mar <- nf.marry
x <- seq_along(r)

f <- function(age,par)
{
  t_0 <- par[1]
  rate <- par[2]
  h <- par[3]
  rhat <- h*rate*(age-t_0)*exp(-rate*0.5*(age-t_0)^2)
  rhat
}

d <- function(par)
{ 
  rhat <- f(mean.age,par)
  sum((mar - rhat)^2)
}

result <- optim(c(17, 0.25,4), d, method="BFGS", control=list(reltol=1e-9))
result 

# model 2
f2 <- function(age,par)
{
  m <- par[1]
  sd <- par[2]
  k <- par[3]
  rhat <- k * exp(-0.5 * ((age - m)/sd)^2)
  rhat
}

d2 <- function(par)
{ 
  rhat <- f2(mean.age,par)
  sum((mar - rhat)^2)
}

result2 <- optim(c(26.5, 3, 1), d2, method="BFGS", control=list(reltol=1e-9))

#comparison
plot(mean.age,nf.marry,xlab="age",ylab="fraction marrying")
lines(18:60,f2(18:60,result2$par),col=1)
lines(18:60,f(18:60,result$par),col=2)
legend(x=45,y=0.28,legend=c("Gaussian","differential equation"),col=c(1,2),lty=c(1,1))