Solved – Maximum likelihood estimation for state space models using BFGS

kalman filteroptimization

I have written some code that can do Kalman filtering (using a number of different Kalman-type filters [Information Filter et al.]) for Linear Gaussian State Space Analysis for an n-dimensional state vector.

Let us consider the linear Gaussian State Space model

$$y_t = \mathbf{Z}_{t}\alpha_{t} + \epsilon_{t},$$
$$\alpha_{t + 1} = \mathbf{T}_{t}\alpha_{t} + \mathbf{R}_{t}\eta_{t},$$

where $y_{t}$ is our observation vector, $\alpha_{t}$ our state vector at time step $t$. The quantities in bold are the transformation matrices of the state space model which are set according to the characteristics of the system under consideration. We also have

$$\epsilon_{t} \sim NID(0, \mathbf{H}_{t}),$$
$$\eta_{t} \sim NID(0, \mathbf{Q}_{t}),$$
$$\alpha_{1} \sim NID(a_{1}, \mathbf{P}_{1}).$$

where $t = 1,\ldots, n$. Now, I have derived and implemented the recursion for the Kalman Filter for this generic state space model by guessing the initial parameters and variance matrices $\mathbf{H}_{1}$ and $\mathbf{Q}_{1}$.

where the points are the Nile River water levels for Jan over 100 years, the line is the Kalamn Estimated state, and the dashed lines are the 90% confidence levels.

Now, in order to estimate the initial parameters of my model ($\mathbf{H}_{1}$ and $\mathbf{Q}_{1}$) I have adopted the Expectation Maximisation method to maximise the loglikelihood function

$$\log L(Y_{n}|\psi) = -\frac{np}{2}\log(2\pi) – \frac{1}{2}\sum^{n}_{t = 1}(log|\mathbf{F}_{t}| + v^{\mathsf{T}}_{t}\mathbf{F}_{t}^{-1}v_{t}), \;\;\;\; (1)$$

with respect to parameter vector $\psi$. Where $v_{t}$ is the state error and $\mathbf{F}_{t}$ is the state error variance. The vector $\psi$ contains the unknown parameters (i.e., either the measurement and state disturbance variances themselves, or a re-parametrisation of these same variances, see below)

A crucial role in the maximisation of (1) is played by the gradient or score vector

$$\partial_{1}(\psi) = \partial \log L(Y_{n}|\psi)/\partial\psi,\;\;\;\; (2)$$

At least two algorithms can be used to maximise (1). The first is the so-called EM (Expectation-Maximisation) algorithm, and the second is the BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm. I have successfully implemented the EM algorithm and this is providing correct results but is very slow; in light of this I now wish to implement the BFGS algorithm as this has better performance.

So, from (2) using Taylor's theorem we can write

$$\partial_{1}(\psi) \simeq \tilde{\partial}_{1}(\psi) + \tilde{\partial}_{2}(\psi)(\psi – \tilde{\psi}),$$

where $\partial_{2}(\psi)$ is the Hessian matrix or second derivative of the loglikelihood function wrt $\psi$ and $\tilde{\partial}_{N}(\psi) = \partial_{N}(\psi)|_{\psi = \tilde{\psi}}$. So now we can write our updated parameter vector as

$$\bar{\psi} = \tilde{\psi} – \tilde{\partial}_{2}(\psi)^{-1}\tilde{\partial}_{1}(\psi) = \tilde{\psi} – s\tilde{\pi(\psi)}.\;\;\;\; (3)$$

where here $\tilde{\pi(\psi)} = – \tilde{\partial}_{2}(\psi)^{-1}\tilde{\partial}_{1}(\psi)$ and $s$ is our step size. So the basic outline of the BFGS algorithm to get our parameter vector and maximize (1) using BFGS is:

  1. Initialise parameter vector $\psi = \psi^{*}$.

  2. Then apply the Kalman and disturbance smoothing filters and thus for score vector (2) at $\psi = \psi^{*}$. When running the Kalman Filter evaluate the log-likelihood function.

  3. Use (3) to obtain new values for $\psi$ given by $\psi^{+}$ and go back to step 2 until the value of loglikelihood function (1) no longer improves.

This seems fine, but my problem is how to calculate the step size $s$ in (3) above. I know a line search is required, but remember $\psi$ is a vector and it is not clear what function I can use in a traditional line search technique in order to evaluate $s$.

Question: what line search method should I use in order to evaluate $s$?


Ps. Assume that I would like to use the zoom line search algorithm (algorithm 3.5 of Numerical Optimization by Nocedal & Wright).

Best Answer

I will give you an answer from my experience developing the R package stsm that is introduced in this document. By default ithe package uses the function optimize, which is a combination of a golden section search and successive parabolic interpolation. Other line search algorithms, such as the zoom line search algorithm will do the job similarly well. The algorithm in function optimize is convenient because it does not require the gradient of the function to be minimized. Here is some pseudo code to compute the optimal step size:

fcn.step <- function(step, model, direction)
{
  trial.pars <- model$pars + step * direction
  -logLik(model, pars = trial.pars)
}
s <- line.search(f = fcn.step, interval = c(0, 1))

fcn.step is the function to be minimized, the only parameter is the step size; the parameters of the model are fixed to the values from the last iteration of the BFGS optimization algorithm. The negative of the log-likelihood function is minimized with respect to the step size; direction is the optimal direction vector computed at the current iteration of the BFGS algorithm, $\tilde{\pi(\psi)}$ in your notation.

It is also important the choice of the upper limit of the interval where the optimal value of the step size s is searched. By default, the package stsm calculates the maximum value of s that is compatible with positive values for the variance parameters updated in equation (3). If this value is lower than $1$ then it is taken as the upper bound passed to the line search algorithm, otherwise this bound is set equal to $1$. You may look at function step.maxsize in the package or the function calc.constraint in this implementation of the BFGS algorithm.

With this approach you can apply the BFGS algorithm without the risk of reaching a local optimum with negative variance parameters. Otherwise, I would recommend you either the L-BFGS-B algorithm, which allows setting a lower bound equal to zero in the parameter space where the algorithm searches for the optimum; or a reparameterization of the model, e.g. variance = theta^2 or variance = exp(theta) and maximize the likelihood with respect to theta.

As regards the EM algorithm, as you mention, it is slow to converge. Strangely enough, it converges slower as it approaches the local optimum. For some insight into this issue and a modified EM algorithm you may look at this document. I am the author of this document, you may send me an e-mail if you have any questions about it or if you want a copy (the link is not always available).

Optimization of the likelihood function of state space models can be hard in practice. Some enhancements to the general-purpose optimization algorithm are recommended. For example, one of the parameters of the model can be concentrated out of the likelihood function; maximum likelihood in the frequency domain enjoys some advantages from a computational point of view and may provide good values to be used as starting values in the optimization of the time domain likelihood function.

Related Question