Solved – score function of bivariate/multivariate normal distribution

covariance-matrixmathematical-statisticsmatrixmultivariate normal distributionnormal distribution

I tried to obtain the score vector (1st derivative of density function w.r.t. parameters) of multivariate normal distribution.

Density function of multivariate normal distribution:
\begin{align*}
f(x) = \frac{1}{\sqrt{(2\pi)^d |\boldsymbol{\Sigma}|}} \exp\left[-\frac{1}{2}(\boldsymbol{x} -\boldsymbol{\mu})^{\rm{T}} \boldsymbol{\Sigma}^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\right].
\end{align*}

By using matrix algebra, I can obtain this quite easily.

\begin{align*}
\boldsymbol{u}(y, \theta) = \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} =
\begin{bmatrix}
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_1}\\
\vdots\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_{2d + \frac{d(d-1)}{2}}}\\
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \mu_1}\\
\vdots\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \mu_d}\\
\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_1^2}\\
\vdots\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_d^2}\\
\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{1,2}}\\
\vdots\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{d-1,d}}\\
\end{bmatrix}
=
\begin{bmatrix}
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\mu}}\\
\text{diagonal elements of} \left(\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\Sigma}}\right)\\
\text{upper triangular elements of} \left(\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\Sigma}}\right)
\end{bmatrix}
\end{align*}
where
\begin{align*}
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\mu}} = \boldsymbol{\Sigma}^{-1}(\boldsymbol{y} – \boldsymbol{\mu}) \quad \mbox{and} \quad \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\Sigma}} = -\frac{1}{2} \left(\boldsymbol{\Sigma}^{-1} -\boldsymbol{\Sigma}^{-1}(\boldsymbol{y} – \boldsymbol{\mu})(\boldsymbol{y} – \boldsymbol{\mu})^{\rm{T}}\boldsymbol{\Sigma}^{-1}\right).
\end{align*}

As a check up, I tried to derive this again. This time with non-matrix calculus in the case of bivariate normal.

The bivariate normal density:
\begin{align*}
f(x_1, x_2) &= \frac{1}{2\pi\sigma_1 \sigma_2 \sqrt{1-\rho^2}} \exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x_1 -\mu_1)^2}{\sigma_1^2} -\frac{2\rho(x_1 -\mu_1)(x_2 -\mu_2)}{\sigma_1 \sigma_2} +\frac{(x_2 -\mu_2)^2}{\sigma_2^2} \right)\right]\\
&= \frac{1}{2\pi\sigma_1 \sigma_2 \sqrt{1-\frac{\sigma_{1,2}^2}{\sigma_1^2 \sigma_2^2}}} \exp\left[-\frac{1}{2 \left(1-\frac{\sigma_{1,2}^2}{\sigma_1^2 \sigma_2^2}\right)}\left(\frac{(x_1 -\mu_1)^2}{\sigma_1^2} -\frac{2\sigma_{1,2}(x_1 -\mu_1)(x_2 -\mu_2)}{\sigma_1^2 \sigma_2^2} +\frac{(x_2 -\mu_2)^2}{\sigma_2^2} \right)\right]\\
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2}} \exp\left[-\frac{1}{2 (\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2)}\left(\sigma_2^2 (x_1 -\mu_1)^2 -2\sigma_{1,2}(x_1 -\mu_1)(x_2 -\mu_2) +\sigma_1^2 (x_2 -\mu_2)^2 \right)\right]
\end{align*}
Log density:
\begin{align*}
\log f(x_1, x_2) &= -\log 2\pi -\frac{1}{2} \log \left(\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2\right) \\
&\quad -\frac{1}{2 (\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2)}\left(\sigma_2^2 (x_1 -\mu_1)^2 -2\sigma_{1,2}(x_1 -\mu_1)(x_2 -\mu_2) +\sigma_1^2 (x_2 -\mu_2)^2 \right)
\end{align*}

Elements of the score vector:
\begin{align*}
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_1} &= \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \mu_1}
= \frac{1}{(1-\rho^2)}\left(\frac{x_1 -\mu_1}{\sigma_1^2} -\frac{\rho(x_2 -\mu_2)}{\sigma_1 \sigma_2}\right)\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_2} &= \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \mu_2}
= \frac{1}{(1-\rho^2)}\left(\frac{x_2 -\mu_2}{\sigma_2^2} -\frac{\rho(x_2 -\mu_1)}{\sigma_1 \sigma_2}\right)\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_3} &= \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_1^2}
= \frac{1}{2\left(\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2 \right)} \left(-\sigma_2^2 +\frac{\sigma_2^2}{\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2}z -(x_2-\mu_2)^2 \right)\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_4} &= \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_2^2}
= \frac{1}{2\left(\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2 \right)} \left(-\sigma_1^2 +\frac{\sigma_1^2}{\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2}z -(x_1-\mu_1)^2 \right)\\
\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \theta_5} &= \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{1,2}}
= \frac{1}{\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2} \left(\sigma_{1,2} +(x_1 -\mu_1)(x_2 -\mu_2) -\frac{\sigma_{1,2}}{\sigma_1^2 \sigma_2^2 -\sigma_{1,2}^2}z\right)
\end{align*}
where
\begin{align*}
z = \left(\sigma_2^2 (x_1 -\mu_1)^2 -2\sigma_{1,2}(x_1 -\mu_1)(x_2 -\mu_2) +\sigma_1^2 (x_2 -\mu_2)^2 \right).
\end{align*}

Then, I implemented both versions in R.

Implementation of the matrix version:

score.mvn.func = function(y, mu, Sigma) {
  # This function analytically computes the score vector of the multivariate normal distribution.

  # Measure the dimension of the data
  d = length(y)

  # Safety check
  {
    # Check whether the input has a proper dimension.
    if (
      (length(mu) != d) | (length(y) != d) | (dim(Sigma)[1] != d) | (dim(Sigma)[2] != d)
    ) {
      stop("The dimension of the input is not proper.")
    }
  }

  # Frame to write down the result
  score.vec = rep(NA, 2*d + choose(d,2))

  # Differentiate with respect to 'mu'.
  score.vec[1:d] = solve(Sigma)%*%(y - mu)
  # Differentiate with respect to 'Sigma'.
  logf.by.Sigma = -(1/2)*(solve(Sigma) - solve(Sigma)%*%(y - mu)%*%t(y - mu)%*%solve(Sigma))
  # Differentiated by variance
  score.vec[(d+1):(2*d)] = diag(logf.by.Sigma)
  # Differentiated by covariance
  score.vec[(2*d+1):length(score.vec)] = logf.by.Sigma[upper.tri(logf.by.Sigma)]

  return(score.vec)
}

Implementation of the non-matrix version:

score.bvn.func = function(y, mu, Sigma) {
  # This function analytically computes the score vector of the bivariate normal distribution.

  # Measure the dimension of the data
  d = 2

  # Safety check
  {
    # Check whether the input has a proper dimension.
    if (
      (length(mu) != d) | (length(y) != d) | (dim(Sigma)[1] != d) | (dim(Sigma)[2] != d)
    ) {
      stop("The dimension of the input is not proper.")
    }
  }

  # Transform covariate to correlation
  rho = Sigma[1,2]/sqrt(Sigma[1,1]*Sigma[2,2])

  # Frame to write down the result
  score.vec = rep(NA, 2*d + choose(d,2))

  # Differentiate with respect to 'mu'.
  score.vec[1] = 1/(1 - rho^2)*
    (
      (y[1] - mu[1])/(Sigma[1,1]) - (rho*(y[2] - mu[2]))/sqrt(Sigma[1,1]*Sigma[2,2])
    )

  score.vec[2] = 1/(1 - rho^2)*
    (
      (y[2] - mu[2])/(Sigma[2,2]) - (rho*(y[1] - mu[1]))/sqrt(Sigma[1,1]*Sigma[2,2])
    )

  score.vec[3] = 1/(2*(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2))*(
    -Sigma[2,2] + Sigma[2,2]/(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2)*(
      Sigma[2,2]*(y[1] - mu[1])^2 - 2*Sigma[1,2]*(y[1] - mu[1])*(y[2] - mu[2]) + Sigma[1,1]*(y[2] - mu[2])^2
    ) -
      (y[2] - mu[2])^2
  )

  score.vec[4] = 1/(2*(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2))*(
    -Sigma[1,1] + Sigma[1,1]/(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2)*(
      Sigma[2,2]*(y[1] - mu[1])^2 - 2*Sigma[1,2]*(y[1] - mu[1])*(y[2] - mu[2]) + Sigma[1,1]*(y[2] - mu[2])^2  
    ) -
      (y[1] - mu[1])^2
  )

  score.vec[5] = (1/(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2))*(
    Sigma[1,2] + (y[1] - mu[1])*(y[2] - mu[2]) - (Sigma[1,2]/(Sigma[1,1]*Sigma[2,2] - Sigma[1,2]^2))*(
      Sigma[2,2]*(y[1] - mu[1])^2 - 2*Sigma[1,2]*(y[1] - mu[1])*(y[2] - mu[2]) + Sigma[1,1]*(y[2] - mu[2])^2
    )
  )

  return(score.vec)
}

Then, I compared the 2 methods:

> # Simulate data
> set.seed(1)
> X = cbind(rnorm(100,3,1),rnorm(100,2,2))
> mu = colMeans(X)
> Sigma = cov(X)
> 
> # Compute score vector with 2 different methods
> score.vec.method.1 = score.mvn.func(y = X[1,], mu = mu, Sigma = Sigma)
> score.vec.method.2 = score.bvn.func(y = X[1,], mu = mu, Sigma = Sigma)
> 
> # Display the result
> score.vec.method.1
[1] -0.91214631 -0.31788468 -0.20375656 -0.08570979  0.14468974
> score.vec.method.2
[1] -0.91214631 -0.31788468 -0.20375656 -0.08570979  0.28937949

As expected, they give the same result, except for $u_5 = \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{1,2}}$

I tried it with different data points and it seems like that $u_5$ from the non-matrix method is exactly 2 times of $u_5$ from the matrix method.

So,
\begin{align*}
2\left(\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\Sigma}}\right)_{1,2} = \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{1,2}}
\end{align*}

instead of
\begin{align*}
\left(\frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \boldsymbol{\Sigma}}\right)_{1,2} = \frac{\partial \log f(y, \boldsymbol{\theta})}{\partial \sigma_{1,2}}
\end{align*}

What did I do wrong?
I triple checked both my analytical derivation and the code, but I couldn't find anything that can cause this problem…

Best Answer

The problem is in the matrix differentiation. As the covariance matrix is symmetric, we have

$ \frac{\partial l}{\partial \Sigma}=-\Sigma^{-1}+\frac{diag(\Sigma^{-1})}{2}+\Sigma^{-1}(x-\mu)(x-\mu)'\Sigma^{-1}-\frac{diag(\Sigma^{-1}(x-\mu)(x-\mu)'\Sigma^{-1})}{2} $

where $l$ is the log-likelihood function.

Related Question