Solved – Multivariate regression vs. multiple univariate regression models

definitionhypothesis testingmodelingmultivariate regressionregression

This is a naive question, but I am a little confused over the term "multivariate" regression. And note this question does not (to my knowledge) pertain to "multiple" regression.

When people use the term "multivariate" regression, are they referring to the case where multiple output variables simultaneously depend on a set of input variables? Or are they referring to the case where output variables individual depend on a set of input variables?

For example, say you have a single input variable $x$ and 2 output variables $y_1, y_2$.

The first case refers to some relationship that looks like (assuming a linear relationship):
$$
\boldsymbol{y} = \begin{bmatrix}
y_1 \\
y_2
\end{bmatrix} = wx+b
$$

where $w$ is the weight and $b$ is the offset. The key here is both output variables have the same weight and bias.

The second case refers to some relationship that looks like this:
$$
\boldsymbol{y} = \begin{bmatrix}
y_1 \\
y_2
\end{bmatrix} = x\begin{bmatrix}
w_1 \\
w_2
\end{bmatrix}+\begin{bmatrix}
b_1 \\
b_2
\end{bmatrix}
\\
\text{or}
\\
y_1 = w_1x+b_1 \\
y_2 = w_2x+b_2
$$

The former suggests that the output variables are coupled, while the latter does not.

My readings have suggested that "multivariate regression" typically means the former, but I've seen some literature where they imply the latter. Are both of these "multivariate" regression? When is one used over the other? Do we use the former when we care about the relations between the output variables, and the latter when we dont'?

Best Answer

Multivariate regression means that one have multiple response variables $Y$'s. In matrix form this is $$ Y = X B + E $$ where $Y$ is $n\times m$ ($m$ responses, observations on $n$ units,) $X$ is $n\times p$, $B$ is $p\times m$ and finally $E$ is $n\times m$. This is formally very similar to $m$ multiple regressions. The model then must be completed by assumptions on the error term $E$. If the errors in the $m$ equations are independent, then the model is close to $m$ separate regressions. This is discussed in answers here: Explain the difference between multiple regression and multivariate regression, with minimal use of symbols/math

So why the tussle?

  1. Dependence between error terms in separate equations, leads to SUR seemingly unrelated regressions.

  2. Some coefficients in separate equations might be shared, or some other restrictions on the $B$ coefficient matrix. This can lead to more efficient estimation, or in the case of restricted rank regression better predictions.

  3. Null hypothesis for testing might involve multiple equations simultaneously. MANOVA might be an example.

(these include among them both your examples)

Related Question