Multivariate Linear Models – Casting as Multiple Regression for Better Analysis

linear modelmultiple regressionmultivariate regressionregression

Is recasting a multivariate linear regression model as a multiple linear regression entirely equivalent? I'm not referring to simply running $t$ separate regressions.

I have read this in a few places (Bayesian Data Analysis — Gelman et al., and Multivariate Old School — Marden) that a multivariate linear model can easily be reparameterized as multiple regression. However, neither source elaborates on this at all. They essentially just mention it, then continue using the multivariate model. Mathematically, I'll write the multivariate version first,

$$ \underset{n \times t}{\mathbf{Y}} = \underset{n \times k}{\mathbf{X}} \hspace{2mm}\underset{k \times t}{\mathbf{B}} + \underset{n \times t}{\mathbf{R}},
$$
where the bold variables are matrices with their sizes below them. As usual, $\mathbf{Y}$ is data, $\mathbf{X}$ is the design matrix, $\mathbf{R}$ are normally distributed residuals, and $\mathbf{B}$ is what we are interested in making inferences with.

To reparameterize this as the familiar multiple linear regression, one simply rewrites the variables as:

$$ \underset{nt \times 1}{\mathbf{y}} = \underset{nt \times nk}{\mathbf{D}} \hspace{2mm} \underset{nk \times 1}{\boldsymbol{\beta}} + \underset{nt \times 1}{\mathbf{r}},
$$

where the reparameterizations used are $\mathbf{y} = row(\mathbf{Y}) $, $\boldsymbol\beta = row(\mathbf{B})$, and $\mathbf{D} = \mathbf{X} \otimes \mathbf{I}_{n}$. $row()$ means that the rows of the matrix are arranged end to end into a long vector, and $\otimes$ is the kronecker, or outer, product.

So, if this is so easy, why bother writing books on multivariate models, test statistics for them etc.? It is most effective to just transform the variables first and use common univariate techniques. I'm sure there is a good reason, I just am having a hard time thinking of one, at least in the case of a linear model. Are there situations with the multivariate linear model and normally distributed random errors where this reparameterization does not apply, or limits the possibilities of the analysis you can undertake?

Sources I have seen this:
Marden – Multivariate Statistics: Old School. See sections 5.3 – 5.5. The book is available free from: http://istics.net/stat/

Gelman et al. – Bayesian Data Analysis. I have the second edition, and in this version there is a small paragraph in Ch. 19 'Multivariate Regression Models' titled: "The equivalent univariate regression model"

Basically, can you do everything with the equivalent linear univariate regression model that you could with the multivariate model? If so, why develop methods for multivariate linear models at all?

What about with Bayesian approaches?

Best Answer

Basically, can you do everything with the equivalent linear univariate regression model that you could with the multivariate model?

I believe the answer is no.

If your goal is simply either to estimate the effects (parameters in $\mathbf{B}$) or to further make predictions based on the model, then yes it does not matter to adopt which model formulation between the two.

However, to make statistical inferences especially to perform the classical significance testing, the multivariate formulation seems practically irreplaceable. More specifically let me use the typical data analysis in psychology as an example. The data from $n$ subjects are expressed as

$$ \underset{n \times t}{\mathbf{Y}} = \underset{n \times k}{\mathbf{X}} \hspace{2mm}\underset{k \times t}{\mathbf{B}} + \underset{n \times t}{\mathbf{R}}, $$

where the $k-1$ between-subjects explanatory variables (factor or/and quantitative covariates) are coded as the columns in $\mathbf{X}$ while the $t$ repeated-measures (or within-subject) factor levels are represented as simultaneous variables or the columns in $\mathbf{Y}$.

With the above formulation, any general linear hypothesis can be easily expressed as

$$\mathbf{L} \mathbf{B} \mathbf{M} = \mathbf{C},$$

where $\mathbf{L}$ is composed of the weights among the between-subjects explanatory variables while $\mathbf{L}$ contains the weights among levels of the repeated-measures factors, and $\mathbf{C}$ is a constant matrix, usually $\mathbf{0}$.

The beauty of the multivariate system lies in its separation between the two types of variables, between- and within-subject. It is this separation that allows for the easy formulation for three types of significance testing under the multivariate framework: the classical multivariate testing, repeated-measures multivariate testing, and repeated-measures univariate testing. Furthermore, Mauchly testing for sphericity violation and the corresponding correction methods (Greenhouse-Geisser and Huynh-Feldt) also become natural for univariate testing in the multivariate system. This is exactly how the statistical packages implemented those tests such as car in R, GLM in IBM SPSS Statistics, and REPEATED statement in PROC GLM of SAS.

I'm not so sure whether the formulation matters in Bayesian data analysis, but I doubt the above testing capability could be formulated and implemented under the univariate platform.

Related Question