Solved – Maximum Likelihood in Multivariate Linear Regression

linear modelmaximum likelihoodmultivariate normal distributionoptimizationregression

This question is about fitting a multivariate linear regression by maximum likelihood, under a specific parameterization of the covariance matrix, when the number of observations is smaller than the number of responses. It arises in an applied project that I'm part of.

Let $Y_i \in \mathbb R^r, i=1, \dots, n$ be independent multivariate normal with (non-stochastic) mean $\beta'X_i$ and covariance matrix $\Sigma = \Lambda (R_1 \otimes R_2)\Lambda$. $\Lambda$ is a diagonal matrix and the $R_i$ are correlation matrices of sizes $r_1$ and $r_2$, respectively, where $r_1 \times r_2 = r$.

The number of predictors, $p$, is small enough that the MLE of $\beta$ may be computed as $\hat{\beta} = (X'X)^{-1}X'Y$, where $Y$ and $X$ are $n\times r$ and $n\times p$ matrices with the $Y_i$ and $X_i$ as rows.

Now, after profiling out $\beta$ the profile log-likelihood is:

$$
\ell(\Lambda, R_1, R_2) = -\frac{nr}{2}\log(2\pi) – n\log\vert \Lambda\vert – \frac{nr_2}{2}\log\vert R_1\vert – \frac{nr_1}{2}\log\vert R_2\vert – \frac{n}{2}\mathrm{tr}\left[\Lambda^{-1}(R_1^{-1}\otimes R_2^{-1})\Lambda^{-1}S\right],
$$

where $S = (Y – X\hat{\beta})'(Y – X\hat{\beta})/n$.

Question: Can this be optimized over $\Lambda, R_1, R_2$ subject to the constraint that $\Lambda$ is diagonal and $R_1,R_2$ are correlation matrices? I do have (unconstrained) gradients for all parameters.


Best Answer

I have solved this problem using a blockwise coordinate descent-type algorithm. Given the low interest in this question I leave the details out but if anyone is interested in this or a similar problem and run into this question and answer I'd be happy to expand.

Repeat until convergence:

  1. Initialize $k = 0$, $\Lambda^0 = I$, $R_1^0 = I$ and $R_2^0 = I$
  2. Update $R_1^{k + 1}$ by solving the first order condition $\partial \ell(\Lambda^{k}, R_1, R_2^k)/\partial R_1^{-1} = 0$
  3. Update $R_2^{k + 1}$ by solving the first order condition $\partial \ell(\Lambda^{k}, R_1^{k + 1}, R_2)/\partial R_2^{-1} = 0$
  4. Rescale $R_1^{k + 1}$, $R_2^{k + 1}$ and $C^k$ to satisfy the constraints without changing the likelihood value. E.g., $C^k \leftarrow C^k[(R_1^{k + 1} \circ I) \otimes (R_2^{k + 1} \circ I)]^{1/2}$, where $\circ$ denotes Hadamard product and $\otimes$ denotes Kronecker product.
  5. Let $\Lambda = \mathrm{diag}(1/\theta_1, \dots, 1/\theta_m)$ and for $i = 1, \dots, m$ update $\theta_i$ by solving the first order condition $\partial \ell(\theta_1^{k + 1}, \dots, \theta_{i-1}^{k + 1}, \theta_i, \theta_{i + 1}^{k}, \dots \theta_m^k, R_1^{k +1}, R_2^{k + 1}) / \partial\theta_i = 0$
  6. $k \leftarrow k + 1$.

Some notes: every update is available in closed form, and every update is convex but the full problem is not convex in general, i.e. there is no guarantee that the point of convergence is a unique global maximum of the likelihood function.