Regression – Comparing Multivariate Linear Regression vs. Several Univariate Models

multivariate analysismultivariate regressionregression

In the univariate regression settings, we try to model

$$y = X\beta +noise$$

where $y \in \mathbb{R}^n$ a vector of $n$ observations and $X \in \mathbb{R}^{n \times m}$ the design matrix with $m$ predictors. The solution is $\beta_0 = (X^TX)^{-1}Xy$.

In the multivariate regression settings, we try to model

$$Y = X\beta +noise$$

where $y \in \mathbb{R}^{n \times p}$ is a matrix of $n$ observations and $p$ different latent variables. The solution is $\beta_0 = (X^TX)^{-1}XY$.

My question is how is that different than performing $p$ different univariate linear regression? I read here that in the latter case we take into consideration correlation between the dependent variables, but I don't see it from the math.

Best Answer

In the setting of classical multivariate linear regression, we have the model:

$$Y = X \beta + \epsilon$$

where $X$ represents the independent variables, $Y$ represents multiple response variables, and $\epsilon$ is an i.i.d. Gaussian noise term. Noise has zero mean, and can be correlated across response variables. The maximum likelihood solution for the weights is equivalent to the least squares solution (regardless of noise correlations) [1][2]:

$$\hat{\beta} = (X^T X)^{-1} X^T Y$$

This is equivalent to independently solving a separate regression problem for each response variable. This can be seen from the fact that the $i$th column of $\hat{\beta}$ (containing weights for the $i$th output variable) can be obtained by multiplying $(X^T X)^{-1} X^T$ by the $i$th column of $Y$ (containing values of the $i$th response variable).

However, multivariate linear regression differs from separately solving individual regression problems because statistical inference procedures account for correlations between the multiple response variables (e.g. see [2],[3],[4]). For example, the noise covariance matrix shows up in sampling distributions, test statistics, and interval estimates.

Another difference emerges if we allow each response variable to have its own set of covariates:

$$Y_i = X_i \beta_i + \epsilon_i$$

where $Y_i$ represents the $i$th response variable, and $X_i$ and $\epsilon_i$ represents its corresponding set of covariates and noise term. As above, the noise terms can be correlated across response variables. In this setting, there exist estimators that are more efficient than least squares, and cannot be reduced to solving separate regression problems for each response variable. For example, see [1].

References

  1. Zellner (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias.
  2. Helwig (2017). Multivariate linear regression [Slides]
  3. Fox and Weisberg (2011). Multivariate linear models in R. [Appendix to: An R Companion to Applied Regression]
  4. Maitra (2013). Multivariate Linear Regression Models. [Slides]
Related Question