Solved – Why does anova F-test give different results for a categorical variable added as factor and as continuous

anovacategorical dataf-testp-valuer

I'm comparing the addition of a variable in my linear model using R and "anova F-test":

anova(fit1,fit2)

What I noticed is that this F-test produces different results when comparing a model with a categorical value in factor() form and in continuous form.

So I do the anova comparisons:

> anova(fit3,fit4)
Analysis of Variance Table

Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup) + factor(dta$HISEI)
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    985 3067966                           
2    929 2857404 56    210562 1.2225 0.1314

Since Pr(>F) > 0.05, then fit3 and fit4 are not significantly different.

However, with the continuous model:

> anova(fit3,fit5)
Analysis of Variance Table

Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup) + dta$HISEI
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1    985 3067966                              
2    984 3049605  1     18361 5.9244 0.01511 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You see that fit5is the same as fit4but just with the categorical variable HISEI as continuous. So switching from factored to continuous is enough to make F-test give Pr(>F) < 0.05, which means that fit5is significantly different from fit3.

Why does it produce differing results for factor() variables and continuous variables?

Best Answer

Let's consider a simple example -- single independent variable (IV) that it might possibly make sense to treat as a factor or as a continuous variable (depending on what you think about the suitability of possible models, and on what you want to find out) and a single dependent variable (DV, response).

In this example, the response is expenditure on medications and the IV is age group (20-30],(30-40], ..., (70-80] coded as 1,2,3,4,5,6. We have six people in each age category:

Age     Medication cost
Grp     relative to index
 1   95  103  112  114  110  108
 2  113  126  119  121  144  121
 3  118  127  113  127  124  128
 4  131  111  134  120  140  134
 5  132  160  146  144  159  154
 6  157  176  176  165  170  168

When you fit that as a continuous you have a single IV for that variable, which is the age group as a numeric quantity (i.e. it assumes the rate of change per decade is roughly constant), while when you fit the factor it fits a different mean for each age group (which model could pick up any change in any direction):

Linear regression vs factor fit

The second model will always have a closer fit but will always use many more parameters to do so. Which it makes sense to do depends on the circumstances.

If the linear model is close to suitable, the saving in df will not cost a lot of model sums of squares so the model MSE will be high relative to error MSE. If the linear model is not a good fit the model MSE will be relatively low and the error MSE will be inflated.