I'm comparing the addition of a variable in my linear model using R and "anova F-test":
anova(fit1,fit2)
What I noticed is that this F-test produces different results when comparing a model with a categorical value in factor()
form and in continuous form.
So I do the anova comparisons:
> anova(fit3,fit4)
Analysis of Variance Table
Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink),
7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink),
7) + factor(dta$sukup) + factor(dta$HISEI)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 985 3067966
2 929 2857404 56 210562 1.2225 0.1314
Since Pr(>F)
> 0.05, then fit3 and fit4 are not significantly different.
However, with the continuous model:
> anova(fit3,fit5)
Analysis of Variance Table
Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink),
7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink),
7) + factor(dta$sukup) + dta$HISEI
Res.Df RSS Df Sum of Sq F Pr(>F)
1 985 3067966
2 984 3049605 1 18361 5.9244 0.01511 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You see that fit5
is the same as fit4
but just with the categorical variable HISEI
as continuous. So switching from factored to continuous is enough to make F-test give Pr(>F)
< 0.05, which means that fit5
is significantly different from fit3
.
Why does it produce differing results for factor()
variables and continuous variables?
Best Answer
Let's consider a simple example -- single independent variable (IV) that it might possibly make sense to treat as a factor or as a continuous variable (depending on what you think about the suitability of possible models, and on what you want to find out) and a single dependent variable (DV, response).
In this example, the response is expenditure on medications and the IV is age group (20-30],(30-40], ..., (70-80] coded as 1,2,3,4,5,6. We have six people in each age category:
When you fit that as a continuous you have a single IV for that variable, which is the age group as a numeric quantity (i.e. it assumes the rate of change per decade is roughly constant), while when you fit the factor it fits a different mean for each age group (which model could pick up any change in any direction):
The second model will always have a closer fit but will always use many more parameters to do so. Which it makes sense to do depends on the circumstances.
If the linear model is close to suitable, the saving in df will not cost a lot of model sums of squares so the model MSE will be high relative to error MSE. If the linear model is not a good fit the model MSE will be relatively low and the error MSE will be inflated.