Multiple Regression – Preparing Independent Variables for Multiple Linear Regression

multiple regression

I have 3 independent variables A, B, C and want to run a multiple linear regression to predict Y. After studying the correlations between:

1) A, Y
2) B, Y
3) A/B, Y
4) (A-B)/(A+B), Y

It turns out that 4) has the highest correlation than all other cases by at least > 0.10. Both 3) and 4) make sense to me as variables A and B are complementary: that is, they represent the items bought in store A versus store B and there are only 2 stores in this problem.

Now, in a simple linear regression I have little doubt that the higher the correlation of the single independent variable the better the fit. But in a multiple linear regression, does it make sense to select the independent variables by looking at the formulas or ratios that shows the highest correlations against Y? In this case using 4) in the regression over 1), 2) or 3) because it has the highest correlation.

Best Answer

Although manipulating variables to maximize correlation has an intuitive appeal, it is not usually a good approach, for several reasons:

In multiple regression the individual correlations can be low between independent and dependent variables, yet the least-correlated IVs can be the most important predictors of the DVs. See this thread for an example and this one for a theoretical discussion. This suggests that looking at the individual (bivariate) correlations can be useless or misleading.
You can (accidentally or arbitrarily) make the correlation arbitrarily close to $\pm 1$ by means of a transformation of the dependent values that creates a single extreme outlier. In the following example $A$ (blue) and $B$ (red) are normally distributed--but always positive--and $C=A+B$ plus normally distributed error, except for a single outlying value at $(A,B,C)=(1, 1/16,20)$. Despite this strong relationship between $C$ and untransformed values of $A$ and $B$, the correlation of $A$ and $C$ is -0.2 (notice the sign is wrong!), the correlation of $B$ and $C$ is -0.25 (wrong sign again), and the correlation of $(A-B)/(A+B)$ and $C$ is +.45 (much stronger than either of the other correlations).
Typically, one re-expresses independent variables in order to establish a more linear relationship with the dependent variable. You can test that visually if you like, making sure to discount high-leverage outlying values that might appear with some re-expressions.

Related Solutions

Regression – Minimum Number of Observations for Multiple Linear Regression

The general rule of thumb (based on stuff in Frank Harrell's book, Regression Modeling Strategies) is that if you expect to be able to detect reasonable-size effects with reasonable power, you need 10-20 observations per parameter (covariate) estimated. Harrell discusses a lot of options for "dimension reduction" (getting your number of covariates down to a more reasonable size), such as PCA, but the most important thing is that in order to have any confidence in the results dimension reduction must be done without looking at the response variable. Doing the regression again with just the significant variables, as you suggest above, is in almost every case a bad idea.

However, since you're stuck with a data set and a set of covariates you're interested in, I don't think that running the multiple regression this way is inherently wrong. I think the best thing would be to accept the results as they are, from the full model (don't forget to look at the point estimates and confidence intervals to see whether the significant effects are estimated to be "large" in some real-world sense, and whether the non-significant effects are actually estimated to be smaller than the significant effects or not).

As to whether it makes any sense to do an analysis without the predictor that your field considers important: I don't know. It depends what kind of inferences you want to make based on the model. In the narrow sense, the regression model is still well-defined ("what are the marginal effects of these predictors on this response?"), but someone in your field might quite rightly say that the analysis just doesn't make sense. It would help a little bit if you knew that the predictors you have are uncorrelated from the well-known predictor (whatever it is), or that well-known predictor is constant or nearly constant for your data: then at least you could say that something other than the well-known predictor does have an effect on the response.

Solved – Interpreting results of a multiple linear regression (categorical independent variables)

@juod provides a great explanation of the interpretation of the regression coefficients. I want to add that for models with categorical predictors with more than two levels, you may find an ANOVA table more informative than typical regression output.

The ANOVA style output will give you an F test for each effect, whereas the regression output gives you tests for each regression coefficient; a categorical variable with k levels will have k-1 coefficients (from k-1 dummy codes), so a single variable will be represented across multiple lines of output. Any interactions with those categorical variables will also be represented across multiple lines of output. This can make it difficult to tell at a glance whether, for example, whether there is a significant interaction between GENDER and YEAR. For factors with two levels, the F-test in the ANOVA output will be equivalent to the t-test in the standard regression output ($F=t^2$).

To get ANOVA style output, you can use aov in base R, or Anova in the car package --- I recommend the latter. Note that aov will give you Type 1 sums of squares, which may not make sense unless you have a balanced design. Anova lets you select the type of sums of squares you want to calculate. See this previous answer for relevant discussion.

In addition, you'll note in the help documentation for aov and Anova that they recommend you use orthogonal contrast codes for your categorical predictors. By default, R uses traditional dummy coding, which sets the first level of a factor as the reference group and then tests each other level against that --- those are not orthogonal comparisons. If you want to use an ANOVA output summary, first make sure you're using orthogonal contrasts when you estimate the model:

(The dataset you provided is actually too small to test the model you use, so I'm creating a new dataset here with more cases)

set.seed(24601)
SCORE <- sample(15:25, 30, replace = TRUE)
GENDER <- gl(n=2,k=1, length=30, labels=c("m", "f"))
YEAR <- gl(n=3, k=1, length = 30, labels=c("1", "2", "3"))
result <- lm(SCORE~GENDER*YEAR, contrasts = list(GENDER = contr.helmert, YEAR = contr.helmert))
library(car)
Anova(result, type=2) # type 2 sums of squares (in this case it's a balanced design, so the type of SS won't make a difference)

Here's the output:

Anova Table (Type II tests)

Response: SCORE
             Sum Sq Df F value Pr(>F)
GENDER        8.533  1  0.9143 0.3485
YEAR          0.067  2  0.0036 0.9964
GENDER:YEAR  37.267  2  1.9964 0.1577
Residuals   224.000 24

Best Answer

Related Solutions

Regression – Minimum Number of Observations for Multiple Linear Regression

Solved – Interpreting results of a multiple linear regression (categorical independent variables)

Related Question