Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

least squaresmaximum likelihoodmixed modelnonlinear regressionr

I'm trying to calculate the log-likelihood for a generalized nonlinear least squares regression for the function $f(x)=\frac{\beta_1}{(1+\frac x\beta_2)^{\beta_3}}$ optimized by the gnls function in the R package nlme, using the variance covariance matrix generated by distances on a a phylogenetic tree assuming Brownian motion (corBrownian(phy=tree) from the ape package). The following reproducible R code fits the gnls model using x,y data and a random tree with 9 taxa:

require(ape)
require(nlme)
require(expm)
tree <- rtree(9)
x <- c(0,14.51,32.9,44.41,86.18,136.28,178.21,262.3,521.94)
y <- c(100,93.69,82.09,62.24,32.71,48.4,35.98,15.73,9.71)
data <- data.frame(x,y,row.names=tree$tip.label)
model <- y~beta1/((1+(x/beta2))^beta3)
f=function(beta,x) beta[1]/((1+(x/beta[2]))^beta[3])
start <- c(beta1=103.651004,beta2=119.55067,beta3=1.370105)
correlation <- corBrownian(phy=tree)
fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

I would like to calculate the log-likelihood "by hand" (in R, but without use of the logLik function) based on the estimated parameters obtained from gnls so it matches the output from logLik(fit). NOTE: I am not trying to estimate parameters; I just want to calculate log-likelihood of the parameters estimated by the gnls function (although if someone has a reproducible example of how to estimate parameters without gnls, I would be very interested in seeing it!).

I'm not really sure how to go about doing this in R. The linear algebra notation described in Mixed-Effects Models in S and S-Plus (Pinheiro and Bates) is very much over my head and none of my attempts have matched logLik(fit). Here are the details described by Pinheiro and Bates:

The log-likelihood for the generalized nonlinear least squares model $y_i=f_i(\phi_i,v_i)+\epsilon_i$ where $\phi_i=A_i\beta$ is calculated as follows:

$l(\beta,\sigma^2,\delta|y)=-\frac 12 \Bigl\{ N\log(2\pi\sigma^2)+\sum\limits_{i=1}^M{\Bigl[\frac{||y_i^*-f_i^*(\beta)||^2}{\sigma^2}+\log|\Lambda_i|\Bigl]\Bigl\}}$

where $N$ is the number of observations, and $f_i^*(\beta)=f_i^*(\phi_i,v_i)$.

$\Lambda_i$ is positive-definite, $y_i^*=\Lambda_i^{-T/2}y_i$ and $f_i^*(\phi_i,v_i)=\Lambda_i^{-T/2}f_i(\phi_i,v_i)$

For fixed $\beta$ and $\lambda$, the ML estimator of $\sigma^2$ is

$\hat\sigma(\beta,\lambda)=\sum\limits_{i=1}^M||y_i^*-f_i^*(\beta)||^2 / N$

and the profiled log-likelihood is

$l(\beta,\lambda|y)=-\frac12\Bigl\{N[\log(2\pi/N)+1]+\log\Bigl(\sum\limits_{i=1}^M||y_i^*-f_i^*(\beta)||^2\Bigl)+\sum\limits_{i=1}^M\log|\Lambda_i|\Bigl\}$

which is used with a Gauss-Seidel algorithm to find the ML estimates of $\beta$ and $\lambda$. A less biased estimate of $\sigma^2$ is used:

$\sigma^2=\sum\limits_{i=1}^M\Bigl|\Bigl|\hat\Lambda_i^{-T/2}[y_i-f_i(\hat\beta)]\Bigl|\Bigl|^2/(N-p)$

where $p$ represents the length of $\beta$.

I have compiled a list of specific questions that I am facing:

What is $\Lambda_i$? Is it the distance matrix produced by big_lambda <- vcv.phylo(tree) in ape, or does it need to be somehow transformed or parameterized by $\lambda$, or something else entirely?
Would $\sigma^2$ be fit$sigma^2, or the equation for the less biased estimate (the last equation in this post)?
Is it necessary to use $\lambda$ to calculate log-likelihood, or is that just an intermediate step for parameter estimation? Also, how is $\lambda$ used? Is it a single value or a vector, and is it multiplied by all of $\Lambda_i$ or just off-diagonal elements, etc.?
What is $||y-f(\beta)||$? Would that be norm(y-f(fit$coefficients,x),"F") in the package Matrix? If so, I'm confused about how to calculate the sum $\sum\limits_{i=1}^M||y_i^*-f_i^*(\beta)||^2$, because norm() returns a single value, not a vector.
How does one calculate $\log|\Lambda_i|$? Is it log(diag(abs(big_lambda))) where big_lambda is $\Lambda_i$, or is it logm(abs(big_lambda)) from the package expm? If it is logm(), how does one take the sum of a matrix (or is it implied that it is just the diagonal elements)?
Just to confirm, is $\Lambda_i^{-T/2}$ calculated like this: t(solve(sqrtm(big_lambda)))?
How are $y_i^*$ and $f_i^*(\beta)$ calculated? Is it either of the following:

y_star <- t(solve(sqrtm(big_lambda))) %*% y

and

f_star <- t(solve(sqrtm(big_lambda))) %*% f(fit$coefficients,x)

or would it be

y_star <- t(solve(sqrtm(big_lambda))) * y

and

f_star <- t(solve(sqrtm(big_lambda))) * f(fit$coefficients,x) ?

If all of these questions are answered, in theory, I think the log-likelihood should be calculable to match the output from logLik(fit). Any help on any of these questions would be greatly appreciated. If anything needs clarification, please let me know. Thanks!

UPDATE: I have been experimenting with various possibilities for the calculation of the log-likelihood, and here is the best I have come up with so far. logLik_calc is consistently about 1 to 3 off from the value returned by logLik(fit). Either I'm close to the actual solution, or this is purely by coincidence. Any thoughts?

  C <- vcv.phylo(tree) # variance-covariance matrix
  tC <- t(solve(sqrtm(C))) # C^(-T/2)
  log_C <- log(diag(abs(C))) # log|C|
  N <- length(y)
  y_star <- tC%*%y 
  f_star <- tC%*%f(fit$coefficients,x)
  dif <- y_star-f_star  
  sigma_squared <-  sum(abs(y_star-f_star)^2)/N
  # using fit$sigma^2 also produces a slightly different answer than logLik(fit)
  logLik_calc <- -((N*log(2*pi*(sigma_squared)))+
       sum(((abs(dif)^2)/(sigma_squared))+log_C))/2

Best Answer

Let's start with the simpler case where there is no correlation structure for the residuals:

fit <- gnls(model=model,data=data,start=start)
logLik(fit)

The log likelihood can then be easily computed by hand with:

N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))

Since the residuals are independent, we can just use dnorm(..., log=TRUE) to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:

sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))

Note that fit$sigma is not the "less biased estimate of $\sigma^2$" -- so we need to make the correction manually first.

Now for the more complicated case where the residuals are correlated:

fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:

N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)

Related Solutions

Solved – Finding the MLE for a univariate exponential Hawkes process

The Nelder-Mead simplex algorithm seems to work well.. It is implemented in Java by the Apache Commons Math library at https://commons.apache.org/math/ . I've also written a paper about the Hawkes processes at Point Process Models for Multivariate High-Frequency Irregularly Spaced Data .

felix, using exp/log transforms seems to ensure positivity of the parameters. As for the small alpha thing, search the arxiv.org for a paper called "limit theorems for nearly unstable hawkes processes"

Solved – How to use the EM algorithm to calculate MLEs for a latent variable formulation of a zero inflated Poisson model

The root of the difficulty you are having lies in the sentence:

Then using the EM algorithm, we can maximize the second log-likelihood.

As you have observed, you can't. Instead, what you maximize is the expected value of the second log likelihood (known as the "complete data log likelihood"), where the expected value is taken over the $z_i$.

This leads to an iterative procedure, where at the $k^{th}$ iteration you calculate the expected values of the $z_i$ given the parameter estimates from the $(k-1)^{th}$ iteration (this is known as the "E-step",) then substitute them into the complete data log likelihood (see EDIT below for why we can do this in this case), and maximize that with respect to the parameters to get the estimates for the current iteration (the "M-step".)

The complete-data log likelihood for the zero-inflated Poisson in the simplest case - two parameters, say $\lambda$ and $p$ - allows for substantial simplification when it comes to the M-step, and this carries over to some extent to your form. I'll show you how that works in the simple case via some R code, so you can see the essence of it. I won't simplify as much as possible, since that might cause a loss of clarity when you think of your problem:

# Generate data
# Lambda = 1,  p(zero) = 0.1
x <- rpois(10000,1)
x[1:1000] <- 0

# Sufficient statistic for the ZIP
sum.x <- sum(x)

# (Poor) starting values for parameter estimates
phat <- 0.5
lhat <- 2.0

zhat <- rep(0,length(x))
for (i in 1:100) {
  # zhat[x>0] <- 0 always, so no need to make the assignment at every iteration
  zhat[x==0] <- phat/(phat +  (1-phat)*exp(-lhat))

  lhat <- sum.x/sum(1-zhat) # in effect, removing E(# zeroes due to z=1)
  phat <- mean(zhat)   

  cat("Iteration: ",i, "  lhat: ",lhat, "  phat: ", phat,"\n")
}

Iteration:  1   lhat:  1.443948   phat:  0.3792712 
Iteration:  2   lhat:  1.300164   phat:  0.3106252 
Iteration:  3   lhat:  1.225007   phat:  0.268331 
...
Iteration:  99   lhat:  0.9883329   phat:  0.09311933 
Iteration:  100   lhat:  0.9883194   phat:  0.09310694

In your case, at each step you'll do a weighted Poisson regression where the weights are 1-zhat to get the estimates of $\beta$ and therefore $\lambda_i$, and then maximize:

$\sum (\mathbb{E}z_i\log{p_i} + (1-\mathbb{E}z_i)\log{(1-p_i)})$

with respect to the coefficient vector of your matrix $\mathbf{G}$ to get the estimates of $p_i$. The expected values $\mathbb{E}z_i = p_i/(p_i+(1-p_i)\exp{(-\lambda_i)})$, again calculated at each iteration.

If you want to do this for real data, as opposed to just understanding the algorithm, R packages already exist; here's an example http://www.ats.ucla.edu/stat/r/dae/zipoisson.htm using the pscl library.

EDIT: I should emphasize that what we are doing is maximizing the expected value of the complete-data log likelihood, NOT maximizing the complete-data log likelihood with the expected values of the missing data/latent variables plugged in. As it happens, if the complete-data log likelihood is linear in the missing data, as it is here, the two approaches are the same, but otherwise, they aren't.

Best Answer

Related Solutions

Solved – Finding the MLE for a univariate exponential Hawkes process

Solved – How to use the EM algorithm to calculate MLEs for a latent variable formulation of a zero inflated Poisson model

Related Question