Solved – Significance contradiction in linear regression: significant t-test for a coefficient vs non-significant overall F-statistic

hypothesis testingmultiple regressionmultiple-comparisonsregressiont-test

I'm fitting a multiple linear regression model between 4 categorical variables (with 4 levels each) and a numerical output. My dataset has 43 observations.

Regression gives me the following $p$-values from the $t$-test for every slope coefficient: $.15, .67, .27, .02$. Thus, the coefficient for the 4th predictor is significant at $\alpha = .05$ confidence level.

On the other hand, the regression gives me a $p$-value from an overall $F$-test of the null hypothesis that all my slope coefficients are equal to zero. For my dataset, this $p$-value is $.11$.

My question: how should I interpret these results? Which $p$-value should I use and why? Is the coefficient for the 4th variable significantly different from $0$ at the $\alpha = .05$ confidence level?

I've seen a related question, $F$ and $t$ statistics in a regression, but there was an opposite situation: high $t$-test $p$-values and low $F$-test $p$-value. Honestly, I don't quite understand why we would need an $F$-test in addition to a $t$-test to see if linear regression coefficients are significantly different from zero.

Best Answer

I'm not sure that multicollinearity is what's going on here. It certainly could be, but from the information given I can't conclude that, and I don't want to start there. My first guess is that this might be a multiple comparisons issue. That is, if you run enough tests, something will show up, even if there's nothing there.

One of the issues that I harp on is that the problem of multiple comparisons is always discussed in terms of examining many pairwise comparisons—e.g., running t-tests on every unique pairing of levels. (For a humorous treatment of multiple comparisons, look here.) This leaves people with the impression that that is the only place this problem shows up. But this is simply not true—the problem of multiple comparisons shows up everywhere. For instance, if you run a regression with 4 explanatory variables, the same issues exist. In a well-designed experiment, IV's can be orthogonal, but people routinely worry about using Bonferroni corrections on sets of a-priori, orthogonal contrasts, and don't think twice about factorial ANOVA's. To my mind this is inconsistent.

The global F test is what's called a 'simultaneous' test. This checks to see if all of your predictors are unrelated to the response variable. The simultaneous test provides some protection against the problem of multiple comparisons without having to go the power-losing Bonferroni route. Unfortunately, my interpretation of what you report is that you have a null finding.

Several things mitigate against this interpretation. First, with only 43 data, you almost certainly don't have much power. It's quite possible that there is a real effect, but you just can't resolve it without more data. Second, like both @andrea and @Dimitriy, I worry about the appropriateness of treating 4-level categorical variables as numeric. This may well not be appropriate, and could have any number of effects, including diminishing your ability to detect what is really there. Lastly, I'm not sure that significance testing is quite as important as people believe. A $p$ of $.11$ is kind of low; is there really something going on there? maybe! who knows?—there's no 'bright line' at .05 that demarcates real effects from mere appearance.