The general rule of thumb (based on stuff in Frank Harrell's book, Regression Modeling Strategies) is that if you expect to be able to detect reasonable-size effects with reasonable power, you need 10-20 observations per parameter (covariate) estimated. Harrell discusses a lot of options for "dimension reduction" (getting your number of covariates down to a more reasonable size), such as PCA, but the most important thing is that in order to have any confidence in the results dimension reduction must be done without looking at the response variable. Doing the regression again with just the significant variables, as you suggest above, is in almost every case a bad idea.
However, since you're stuck with a data set and a set of covariates you're interested in, I don't think that running the multiple regression this way is inherently wrong. I think the best thing would be to accept the results as they are, from the full model (don't forget to look at the point estimates and confidence intervals to see whether the significant effects are estimated to be "large" in some real-world sense, and whether the non-significant effects are actually estimated to be smaller than the significant effects or not).
As to whether it makes any sense to do an analysis without the predictor that your field considers important: I don't know. It depends what kind of inferences you want to make based on the model. In the narrow sense, the regression model is still well-defined ("what are the marginal effects of these predictors on this response?"), but someone in your field might quite rightly say that the analysis just doesn't make sense. It would help a little bit if you knew that the predictors you have are uncorrelated from the well-known predictor (whatever it is), or that well-known predictor is constant or nearly constant for your data: then at least you could say that something other than the well-known predictor does have an effect on the response.
@juod provides a great explanation of the interpretation of the regression coefficients. I want to add that for models with categorical predictors with more than two levels, you may find an ANOVA table more informative than typical regression output.
The ANOVA style output will give you an F test for each effect, whereas the regression output gives you tests for each regression coefficient; a categorical variable with k
levels will have k-1
coefficients (from k-1
dummy codes), so a single variable will be represented across multiple lines of output. Any interactions with those categorical variables will also be represented across multiple lines of output. This can make it difficult to tell at a glance whether, for example, whether there is a significant interaction between GENDER and YEAR. For factors with two levels, the F-test in the ANOVA output will be equivalent to the t-test in the standard regression output ($F=t^2$).
To get ANOVA style output, you can use aov
in base R, or Anova
in the car
package --- I recommend the latter. Note that aov
will give you Type 1 sums of squares, which may not make sense unless you have a balanced design. Anova
lets you select the type of sums of squares you want to calculate. See this previous answer for relevant discussion.
In addition, you'll note in the help documentation for aov
and Anova
that they recommend you use orthogonal contrast codes for your categorical predictors. By default, R uses traditional dummy coding, which sets the first level of a factor as the reference group and then tests each other level against that --- those are not orthogonal comparisons. If you want to use an ANOVA output summary, first make sure you're using orthogonal contrasts when you estimate the model:
(The dataset you provided is actually too small to test the model you use, so I'm creating a new dataset here with more cases)
set.seed(24601)
SCORE <- sample(15:25, 30, replace = TRUE)
GENDER <- gl(n=2,k=1, length=30, labels=c("m", "f"))
YEAR <- gl(n=3, k=1, length = 30, labels=c("1", "2", "3"))
result <- lm(SCORE~GENDER*YEAR, contrasts = list(GENDER = contr.helmert, YEAR = contr.helmert))
library(car)
Anova(result, type=2) # type 2 sums of squares (in this case it's a balanced design, so the type of SS won't make a difference)
Here's the output:
Anova Table (Type II tests)
Response: SCORE
Sum Sq Df F value Pr(>F)
GENDER 8.533 1 0.9143 0.3485
YEAR 0.067 2 0.0036 0.9964
GENDER:YEAR 37.267 2 1.9964 0.1577
Residuals 224.000 24
Best Answer
Although manipulating variables to maximize correlation has an intuitive appeal, it is not usually a good approach, for several reasons:
In multiple regression the individual correlations can be low between independent and dependent variables, yet the least-correlated IVs can be the most important predictors of the DVs. See this thread for an example and this one for a theoretical discussion. This suggests that looking at the individual (bivariate) correlations can be useless or misleading.
You can (accidentally or arbitrarily) make the correlation arbitrarily close to $\pm 1$ by means of a transformation of the dependent values that creates a single extreme outlier. In the following example $A$ (blue) and $B$ (red) are normally distributed--but always positive--and $C=A+B$ plus normally distributed error, except for a single outlying value at $(A,B,C)=(1, 1/16,20)$. Despite this strong relationship between $C$ and untransformed values of $A$ and $B$, the correlation of $A$ and $C$ is -0.2 (notice the sign is wrong!), the correlation of $B$ and $C$ is -0.25 (wrong sign again), and the correlation of $(A-B)/(A+B)$ and $C$ is +.45 (much stronger than either of the other correlations).
Typically, one re-expresses independent variables in order to establish a more linear relationship with the dependent variable. You can test that visually if you like, making sure to discount high-leverage outlying values that might appear with some re-expressions.