Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?
Here is my dataset and model in R
:
df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L,
2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L,
6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F",
"G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26,
33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28,
27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20,
21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18,
18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA,
-57L), class = "data.frame")
names(df)[1] <- "pred" ## fix up the names to match formula
model <- glm(res ~ pred, data = df)
summary(model)
Call:
glm(formula = res ~ pred, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.625 -3.000 -0.625 2.000 10.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.300e+01 9.080e-01 25.331 <2e-16 ***
predB 2.625e+00 1.471e+00 1.784 0.0806 .
predC 4.714e+00 1.971e+00 2.391 0.0207 *
predD 3.500e+00 3.397e+00 1.030 0.3080
predE 6.333e+00 2.823e+00 2.243 0.0294 *
predF -8.283e-15 4.718e+00 0.000 1.0000
predG 1.000e+00 4.718e+00 0.212 0.8330
predH 3.000e+00 4.718e+00 0.636 0.5278
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 21.43562)
Null deviance: 1277.6 on 56 degrees of freedom
Residual deviance: 1050.3 on 49 degrees of freedom
AIC: 345.85
Best Answer
If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the
lm()
function for example in R.As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via
glm()
should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting viaglm(...., family = gaussian)
would be exceedingly inefficient compared to fitting vialm()
.Note the
lm()
function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.
Update
In light of the OP posting data, we can show the equivalence. In the below,
model
is as per the OP's question, whilstmodel2
is the same model fitted vialm()
instead ofglm()
Notice the
Deviance
of the model is the same as the sums of squares inmodel2
and the rest of the important numbers,F
and its p-value are the same. Likewise, the estimated values of rht model coefficients are the sameThere appears to be some small differences between between groups A and C and A and E but overall
pred
does not explain a significant amount of variance in the response.