My first reaction to Jelle's comments given is "bias-schmias". You have to be careful about what you mean by "large amount of predictors". This could be "large" with respect to:
- The number of data points ("big p small n")
- The amount of time you have to investigate the variables
- The computational cost of inverting a giant matrix
My reaction was based on "large" with respect to point 1. This is because in this case it is usually worth the trade-off in bias for the reduction in variance that you get. Bias is only important "in-the-long-run". So if you have a small sample, then who care's about "the-long-run"?
Having said all that above, $R^2$ is probably not a particularly good quantity to calculate, especially when you have lots of variables (because that's pretty much all $R^2$ tells you: you have lots of variables). I would calculate something more like a "prediction error" using cross validation.
Ideally this "prediction error" should be based on the context of your modeling situation. You basically want to answer the question "How well does my model reproduce the data?". The context of your situation should be able to tell you what "how well" means in the real world. You then need to translate this into some sort of mathematical equation.
However, I have no obvious context to go off from the question. So a "default" would be something like PRESS:
$$PRESS=\sum_{i=1}^{N} (Y_{i}-\hat{Y}_{i,-i})^2$$
Where $\hat{Y}_{i,-i}$ is the predicted value for $Y_{i}$ for a model fitted without the ith data point ($Y_i$ doesn't influence the model parameters). The terms in the summation are also known as "deletion residuals". If this is too computationally expensive to do $N$ model fits (although most programs usually gives you something like this with the standard output), then I would suggest grouping the data. So you set the amount of time you are prepared to wait for $T$ (preferably not 0 ^_^), and then divide this by the time it takes to fit your model $M$. This will give a total of $G=\frac{T}{M}$ re-fits, with a sample size of $N_{g}=\frac{N\times M}{T}$.
$$PRESS=\sum_{g=1}^{G}\sum_{i=1}^{N_{g}} (Y_{ig}-\hat{Y}_{ig,-g})^2$$
A way you can get an idea of how important each variable is, is to re-fit an ordinary regression (variables in the same order). Then check proportionately how much each estimator has been shrunk towards zero $\frac{\beta_{LASSO}}{\beta_{UNCONSTRAINED}}$. Lasso, and other constrained regression can be seen as "smooth variable selection", because rather than adopt a binary "in-or-out" approach, each estimate is brought closer to zero, depending on how important it is for the model (as measured by the errors).
I'm not sure that multicollinearity is what's going on here. It certainly could be, but from the information given I can't conclude that, and I don't want to start there. My first guess is that this might be a multiple comparisons issue. That is, if you run enough tests, something will show up, even if there's nothing there.
One of the issues that I harp on is that the problem of multiple comparisons is always discussed in terms of examining many pairwise comparisons—e.g., running t-tests on every unique pairing of levels. (For a humorous treatment of multiple comparisons, look here.) This leaves people with the impression that that is the only place this problem shows up. But this is simply not true—the problem of multiple comparisons shows up everywhere. For instance, if you run a regression with 4 explanatory variables, the same issues exist. In a well-designed experiment, IV's can be orthogonal, but people routinely worry about using Bonferroni corrections on sets of a-priori, orthogonal contrasts, and don't think twice about factorial ANOVA's. To my mind this is inconsistent.
The global F test is what's called a 'simultaneous' test. This checks to see if all of your predictors are unrelated to the response variable. The simultaneous test provides some protection against the problem of multiple comparisons without having to go the power-losing Bonferroni route. Unfortunately, my interpretation of what you report is that you have a null finding.
Several things mitigate against this interpretation. First, with only 43 data, you almost certainly don't have much power. It's quite possible that there is a real effect, but you just can't resolve it without more data. Second, like both @andrea and @Dimitriy, I worry about the appropriateness of treating 4-level categorical variables as numeric. This may well not be appropriate, and could have any number of effects, including diminishing your ability to detect what is really there. Lastly, I'm not sure that significance testing is quite as important as people believe. A $p$ of $.11$ is kind of low; is there really something going on there? maybe! who knows?—there's no 'bright line' at .05 that demarcates real effects from mere appearance.
Best Answer
I decided to calculate a weighted mean coefficient and standard error (using the
diagis
package in R) for the null model (this seems reasonable given the distribution of the coefficients is roughly normally distributed). I then compared the two coefficients and their standard errors using the following Z test (more details in this other CV post):$Z = \frac{\beta_1-\beta_2}{\sqrt{(SE\beta_1)^2+(SE\beta_2)^2}}$
I then calculated the p-value using the
pnorm
function in R.