Solved – Model for continuous response and a mix of continuous and categorical predictors

generalized linear modelpredictive-models

Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?

Here is my dataset and model in R:

df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L, 
2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L, 
6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F", 
"G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26, 
33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28, 
27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20, 
21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18, 
18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA, 
-57L), class = "data.frame")

names(df)[1] <- "pred" ## fix up the names to match formula

model <- glm(res ~ pred, data = df)

summary(model)

Call:
glm(formula = res ~ pred, data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.625  -3.000  -0.625   2.000  10.000  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.300e+01  9.080e-01  25.331   <2e-16 ***
predB        2.625e+00  1.471e+00   1.784   0.0806 .  
predC        4.714e+00  1.971e+00   2.391   0.0207 *  
predD        3.500e+00  3.397e+00   1.030   0.3080    
predE        6.333e+00  2.823e+00   2.243   0.0294 *  
predF       -8.283e-15  4.718e+00   0.000   1.0000    
predG        1.000e+00  4.718e+00   0.212   0.8330    
predH        3.000e+00  4.718e+00   0.636   0.5278    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 21.43562)

    Null deviance: 1277.6  on 56  degrees of freedom
Residual deviance: 1050.3  on 49  degrees of freedom
AIC: 345.85  

Best Answer

If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the lm() function for example in R.

As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via glm() should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting via glm(...., family = gaussian) would be exceedingly inefficient compared to fitting via lm().

Note the lm() function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.

The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.

Update

In light of the OP posting data, we can show the equivalence. In the below, model is as per the OP's question, whilst model2 is the same model fitted via lm() instead of glm()

> anova(model, test = "F")
Analysis of Deviance Table

Model: gaussian, link: identity

Response: res

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev      F Pr(>F)
NULL                    56     1277.6              
pred  7   227.23        49     1050.3 1.5144 0.1846
> anova(model2)
Analysis of Variance Table

Response: res
          Df  Sum Sq Mean Sq F value Pr(>F)
pred       7  227.23  32.462  1.5144 0.1846
Residuals 49 1050.35  21.436

Notice the Deviance of the model is the same as the sums of squares in model2 and the rest of the important numbers, F and its p-value are the same. Likewise, the estimated values of rht model coefficients are the same

> coef(model)
  (Intercept)         predB         predC         predD         predE 
 2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00 
        predF         predG         predH 
-8.283393e-15  1.000000e+00  3.000000e+00 
> coef(model2)
  (Intercept)         predB         predC         predD         predE 
 2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00 
        predF         predG         predH 
-8.283393e-15  1.000000e+00  3.000000e+00 
> all.equal(coef(model), coef(model2))
[1] TRUE

There appears to be some small differences between between groups A and C and A and E but overall pred does not explain a significant amount of variance in the response.