Let's say I have the following logistic regression models:
df=data.frame(income=c(5,5,3,3,6,5),
won=c(0,0,1,1,1,0),
age=c(18,18,23,50,19,39),
home=c(0,0,1,0,0,1))
> md1 = glm(factor(won) ~ income + age + home,
+ data=df, family=binomial(link="logit"))
> md2 = glm(factor(won) ~ factor(income) + factor(age) + factor(home),
+ data=df, family=binomial(link="logit"))
> summary(md1)
Call:
glm(formula = factor(won) ~ income + age + home, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
1 2 3 4 5 6
-1.0845 -1.0845 0.8017 0.4901 1.7298 -0.8017
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.784832 6.326264 0.756 0.449
income -1.027049 1.056031 -0.973 0.331
age 0.007102 0.097759 0.073 0.942
home -0.896802 2.252894 -0.398 0.691
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178 on 5 degrees of freedom
Residual deviance: 6.8700 on 2 degrees of freedom
AIC: 14.87
Number of Fisher Scoring iterations: 4
> summary(md2)
Call:
glm(formula = factor(won) ~ factor(income) + factor(age) + factor(home),
family = binomial(link = "logit"), data = df)
Deviance Residuals:
1 2 3 4 5 6
-6.547e-06 -6.547e-06 6.547e-06 6.547e-06 6.547e-06 -6.547e-06
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.457e+01 1.310e+05 0 1
factor(income)5 -4.913e+01 1.605e+05 0 1
factor(income)6 -2.573e-30 1.853e+05 0 1
factor(age)19 NA NA NA NA
factor(age)23 -1.383e-30 1.853e+05 0 1
factor(age)39 -3.479e-14 1.605e+05 0 1
factor(age)50 NA NA NA NA
factor(home)1 NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178e+00 on 5 degrees of freedom
Residual deviance: 2.5720e-10 on 1 degrees of freedom
AIC: 10
So depending on the mode of the predictors, R produced different outputs. For factors, R splits out the coefficients into separate categories for the levels, but not for the model with numeric predictors. I'm wondering about a couple things.
-
Is it ever useful to have the response categories expressed as individual rows?
-
To express the general regression equation, how does one go from a model with the categories expressed in an individual equation to an equation with a single B_i. So, for example, if gender has two coefficients, 3.5 for Male and 2.3 for Female, how does one use that in an equation such that (besides converting them into numeric values):
Y = B0 + B1 (Gender)
Best Answer
I don't entirely understand question 1. Are you asking when to use numeric versus factor values? Factor values should be used for categorical data (discrete units that are not in any specific order), numeric should be used for continuous, ratio, or (some) interval level data. In the equation above, age should be numeric, home (if dichotomous) won't matter if it is factor or not, and income would likely be factor (though you could make a reasonable interpretation with numeric if the factors are equally spaced and distributed).
To know if you should be using factors, consider the following question: does a partial count (e.g. 0 < x < 1) make sense as a result? Treating non-numeric data as numeric is what gives us our famous 2.4 children.
For question 2, if you have a sample that is limited to the two genders (e.g. all respondents are male or female) you'll shouldn't be able to get a coefficient for male and female from the equation. One of them will be the reference variable, meaning that it is represented as part of the constant. So, the effect of being male in your equation would be:
y=bx(male) + BX(other covariates) + a(constant) + e. The result for male should be the effect of being male controlling for other covariates. If you take male out and put in female, the number should be of the same magnitude but in the other direction (assuming your model does not allow for any interaction between the covariates).
c.f. http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm