Multiple Regression – Shared Variance Between Independent Variables in a Linear Multiple Regression Equation

multiple regressionsums-of-squares

In a linear multiple regression equation, if the beta weights reflect the contribution of each individual independent variable over and above the contribution of all the other IVs, where in the regression equation is the variance shared by all the IVs that predicts the DV?

For example, if the Venn diagram displayed below (and taken from CV's 'about' page here: https://stats.stackexchange.com/about) were relabeled to be 3 IVs and 1 DV, where would the area with the asterisk enter into the multiple regression equation?

enter image description here

Best Answer

To understand what that diagram could mean, we have to define some things. Let's say that Venn diagram displays the overlapping (or shared) variance amongst 4 different variables, and that we want to predict the level of $Wiki$ by recourse to our knowledge of $Digg$, $Forum$, and $Blog$. That is, we want to be able to reduce the uncertainty (i.e., variance) in $Wiki$ from the null variance down to the residual variance. How well can that be done? That is the question that a Venn diagram is answering for you.

Each circle represents a set of points, and thereby, an amount of variance. For the most part, we are interested in the variance in $Wiki$, but the figure also displays the variances in the predictors. There are a few things to notice about our figure. First, each variable has the same amount of variance--they are all the same size (although not everyone will use Venn diagrams quite so literally). Also, there is the same amount of overlap, etc., etc. A more important thing to notice is that there is a good deal of overlap amongst the predictor variables. This means that they are correlated. This situation is very common when dealing with secondary (i.e., archival) data, observational research, or real-world prediction scenarios. On the other hand, if this were a designed experiment, it would probably imply poor design or execution. To continue with this example for a little bit longer, we can see that our predictive ability will be moderate; most of the variability in $Wiki$ remains as residual variability after all the variables have been used (eyeballing the diagram, I would guess $R^2\approx.35$). Another thing to note is that, once $Digg$ and $Blog$ have been entered into the model, $Forum$ accounts for none of the variability in $Wiki$.

Now, after having fit a model with multiple predictors, people often want to test those predictors to see if they are related to the response variable (although it's not clear this is as important as people seem to believe it is). Our problem is that to test these predictors, we must partition the Sum of Squares, and since our predictors are correlated, there are SS that could be attributed to more than one predictor. In fact, in the asterisked region, the SS could be attributed to any of the three predictors. This means that there is no unique partition of the SS, and thus no unique test. How this issue is handled depends on the type of SS that the researcher uses and other judgments made by the researcher. Since many software applications return type III SS by default, many people throw away the information contained in the overlapping regions without realizing they have made a judgment call. I explain these issues, the different types of SS, and go into some detail here.

The question, as stated, specifically asks about where all of this shows up in the betas / regression equation. The answer is that it does not. Some information about that is contained in my answer here (although you'll have to read between the lines a little bit).