Solved – Why does anova F-test give different results for a categorical variable added as factor and as continuous

I'm comparing the addition of a variable in my linear model using R and "anova F-test":

anova(fit1,fit2)

What I noticed is that this F-test produces different results when comparing a model with a categorical value in factor() form and in continuous form.

So I do the anova comparisons:

> anova(fit3,fit4)
Analysis of Variance Table

Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup) + factor(dta$HISEI)
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    985 3067966                           
2    929 2857404 56    210562 1.2225 0.1314

Since Pr(>F) > 0.05, then fit3 and fit4 are not significantly different.

However, with the continuous model:

> anova(fit3,fit5)
Analysis of Variance Table

Model 1: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup)
Model 2: dta$X.U.FEFF..mpist. ~ relevel(factor(dta$matem), 7) + relevel(factor(dta$aidink), 
    7) + factor(dta$sukup) + dta$HISEI
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1    985 3067966                              
2    984 3049605  1     18361 5.9244 0.01511 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You see that fit5is the same as fit4but just with the categorical variable HISEI as continuous. So switching from factored to continuous is enough to make F-test give Pr(>F) < 0.05, which means that fit5is significantly different from fit3.

Why does it produce differing results for factor() variables and continuous variables?

Age Medication cost Grp relative to index 1 95 103 112 114 110 108 2 113 126 119 121 144 121 3 118 127 113 127 124 128 4 131 111 134 120 140 134 5 132 160 146 144 159 154 6 157 176 176 165 170 168

Best Answer

Let's consider a simple example -- single independent variable (IV) that it might possibly make sense to treat as a factor or as a continuous variable (depending on what you think about the suitability of possible models, and on what you want to find out) and a single dependent variable (DV, response).

In this example, the response is expenditure on medications and the IV is age group (20-30],(30-40], ..., (70-80] coded as 1,2,3,4,5,6. We have six people in each age category:

When you fit that as a continuous you have a single IV for that variable, which is the age group as a numeric quantity (i.e. it assumes the rate of change per decade is roughly constant), while when you fit the factor it fits a different mean for each age group (which model could pick up any change in any direction):

The second model will always have a closer fit but will always use many more parameters to do so. Which it makes sense to do depends on the circumstances.

If the linear model is close to suitable, the saving in df will not cost a lot of model sums of squares so the model MSE will be high relative to error MSE. If the linear model is not a good fit the model MSE will be relatively low and the error MSE will be inflated.

Best Answer

Related Solutions

Statistical Analysis – ANOVA vs. Regression with Continuous and Categorical Factors

Solved – Reporting an ANOVA with a continuous predictor (multiple regression)

Related Question