Solved – Maximum Likelihood in Multivariate Linear Regression

linear modelmaximum likelihoodmultivariate normal distributionoptimizationregression

This question is about fitting a multivariate linear regression by maximum likelihood, under a specific parameterization of the covariance matrix, when the number of observations is smaller than the number of responses. It arises in an applied project that I'm part of.

Let $Y_i \in \mathbb R^r, i=1, \dots, n$ be independent multivariate normal with (non-stochastic) mean $\beta'X_i$ and covariance matrix $\Sigma = \Lambda (R_1 \otimes R_2)\Lambda$. $\Lambda$ is a diagonal matrix and the $R_i$ are correlation matrices of sizes $r_1$ and $r_2$, respectively, where $r_1 \times r_2 = r$.

The number of predictors, $p$, is small enough that the MLE of $\beta$ may be computed as $\hat{\beta} = (X'X)^{-1}X'Y$, where $Y$ and $X$ are $n\times r$ and $n\times p$ matrices with the $Y_i$ and $X_i$ as rows.

Now, after profiling out $\beta$ the profile log-likelihood is:

$$
\ell(\Lambda, R_1, R_2) = -\frac{nr}{2}\log(2\pi) – n\log\vert \Lambda\vert – \frac{nr_2}{2}\log\vert R_1\vert – \frac{nr_1}{2}\log\vert R_2\vert – \frac{n}{2}\mathrm{tr}\left[\Lambda^{-1}(R_1^{-1}\otimes R_2^{-1})\Lambda^{-1}S\right],
$$

where $S = (Y – X\hat{\beta})'(Y – X\hat{\beta})/n$.

Question: Can this be optimized over $\Lambda, R_1, R_2$ subject to the constraint that $\Lambda$ is diagonal and $R_1,R_2$ are correlation matrices? I do have (unconstrained) gradients for all parameters.

Best Answer

I have solved this problem using a blockwise coordinate descent-type algorithm. Given the low interest in this question I leave the details out but if anyone is interested in this or a similar problem and run into this question and answer I'd be happy to expand.

Repeat until convergence:

Initialize $k = 0$, $\Lambda^0 = I$, $R_1^0 = I$ and $R_2^0 = I$
Update $R_1^{k + 1}$ by solving the first order condition $\partial \ell(\Lambda^{k}, R_1, R_2^k)/\partial R_1^{-1} = 0$
Update $R_2^{k + 1}$ by solving the first order condition $\partial \ell(\Lambda^{k}, R_1^{k + 1}, R_2)/\partial R_2^{-1} = 0$
Rescale $R_1^{k + 1}$, $R_2^{k + 1}$ and $C^k$ to satisfy the constraints without changing the likelihood value. E.g., $C^k \leftarrow C^k[(R_1^{k + 1} \circ I) \otimes (R_2^{k + 1} \circ I)]^{1/2}$, where $\circ$ denotes Hadamard product and $\otimes$ denotes Kronecker product.
Let $\Lambda = \mathrm{diag}(1/\theta_1, \dots, 1/\theta_m)$ and for $i = 1, \dots, m$ update $\theta_i$ by solving the first order condition $\partial \ell(\theta_1^{k + 1}, \dots, \theta_{i-1}^{k + 1}, \theta_i, \theta_{i + 1}^{k}, \dots \theta_m^k, R_1^{k +1}, R_2^{k + 1}) / \partial\theta_i = 0$
$k \leftarrow k + 1$.

Some notes: every update is available in closed form, and every update is convex but the full problem is not convex in general, i.e. there is no guarantee that the point of convergence is a unique global maximum of the likelihood function.

Related Solutions

Maximum Likelihood – How to Ensure Properties of Covariance Matrix When Fitting Multivariate Normal Model

As it turns out you can use profile maximum likelihood to ensure the necessary properties. You can prove that for given $\hat\theta$, $l(\hat\theta,\Sigma)$ is maximised by

$$\hat\Sigma=\frac{1}{n}\sum_{i=1}^n\hat{\varepsilon}_i\hat{\varepsilon}_i',$$

where

$$\hat{\varepsilon}_i=y_i-f(x_i,\hat\theta)$$

Then it is possible to show that

$$\sum_{i=1}^n(y_i-f(x_i,\hat\theta))'\hat\Sigma^{-1}(y-f(x_i,\hat\theta)))=const,$$

hence we only need to maximise

$$l_R(\theta,\Sigma)=-\frac{n}{2} \log\det\hat\Sigma.$$

Naturally in this case $\Sigma$ will satisfy all the necessary properties. The proofs are identical for the case when $f$ is linear which can be found in Time Series Analysis by J. D. Hamilton page 295, hence I omitted them.

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Let's start with the simpler case where there is no correlation structure for the residuals:

fit <- gnls(model=model,data=data,start=start)
logLik(fit)

The log likelihood can then be easily computed by hand with:

N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))

Since the residuals are independent, we can just use dnorm(..., log=TRUE) to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:

sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))

Note that fit$sigma is not the "less biased estimate of $\sigma^2$" -- so we need to make the correction manually first.

Now for the more complicated case where the residuals are correlated:

fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:

N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)

Best Answer

Related Solutions

Maximum Likelihood – How to Ensure Properties of Covariance Matrix When Fitting Multivariate Normal Model

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Related Question