Multiple Regression – Methods to Predict Multiple Dependent Variables

multiple regressionmultivariate regression

I have a situation in which I have $n$ observations, each with $p$ independent variables and $q$ dependent variables. I would like to build a model or series of models to obtain predictions of the $q$ dependent variables for a new observation.

One way is to build multiple models, each one predicting a single dependent variable. An alternative approach is to build a single model to predict all the dependent variables at one go (multivariate regression or PLS etc).

My question is: does taking into account multiple DV's simultaneously lead to a more robust/accurate/reliable model? Given the fact that some of the $q$ dependent variables might be correlated with each other, does this fact hamper or help a single model approach? Are there references that I could look up on this topic?

Best Answer

You need to check for correlations amongst your dependent variables (edit: @BilalBarakat's answer is right, the residuals are what's important here). If all or some are independent, you can run separate analyses on each. If they are not independent, or whichever ones aren't, you could run a multivariate analysis. This will maximize your power while holding the type I error rate at your alpha level.

You should know, however, that this will not make your analysis more accurate/robust. This is a different issue than simply whether or not your model predicts the data better than the null model. In fact, with so much going on, unless you have a lot of data, it is likely that you could get very different parameter estimates with a new sample. It is even possible that the sign on a beta will flip. Much depends on the size of p and q and the nature of their correlation matrices, but the volume of data required for robustness can be massive. Remember that, although many people use 'significant' and 'reliable' as synonyms, they actually aren't. It is one thing to know that a variable is not independent of another variable, but another thing entirely to specify the nature of that relationship in your sample as it is in the population. It can be easy to run a study twice and find a predictor significant both times, but with the parameter estimate sufficiently different to be theoretically meaningful.

Furthermore, unless you are doing structural equation modeling, you can't very well incorporate your theoretical knowledge regarding the variables. That is, techniques like MANOVA tend to be rawly empirical.

Another approach is to utilize what you know about the issue at hand. For example, if you have several different measures of the same construct (you could check this with a factor analysis), you can combine them. This can be done by turning them into z scores and averaging them. Knowledge of other sources of correlation (e.g., common cause or mediation) could also be utilized. Some people are uncomfortable with putting so much weight on domain knowledge, and I acknowledge that this is a philosophical issue, but I think it can be a mistake to require the analyses to do all of the work and assume that this is the best answer.

As for a reference, any good multivariate textbook should discuss these issues. Tabachnick and Fidell is well regarded as a simple and applied treatment of this topic.