Using R, I have fitted a linear model for a single response variable from a mix of continuous and discrete predictors. This is uber-basic, but I'm having trouble grasping how a coefficient for a discrete factor works.
Concept: Obviously, the coefficient of the continuous variable 'x' is applied in the form y = coefx(varx) + intercept
but how does that work for a factor z if the factor is non-numeric? y = coefx(varx) + coefz(factorz???) + intercept
Specific: I have fitted a model in R as lm(log(c) ~ log(d) + h + a + f + h:a)
where h
and f
are discrete, non-numeric factors. The coefficients are:
Coefficients:
Estimate
(Intercept) -0.679695
log(d) 1.791294
h1 0.870735
h2 -0.447570
h3 0.542033
a 0.037362
f1 -0.588362
f2 0.816825
f3 0.534440
h1:a -0.085658
h2:a -0.034970
h3:a -0.040637
How do I use these to create the predictive equation:
log(c) = 1.791294(log(d)) + 0.037362(a) + h??? + f???? + h:a???? + -0.679695
Or am I doing it wrong?
I THINK that that concept is if the subject falls in category h1
and f2
, the equation becomes:
log(c) = 1.791294(log(d)) + 0.037362(a) + 0.870735 + 0.816825 + h:a???? + -0.679695
But I'm really not clear on how the h:a
interactive term gets parsed. Thanks for going easy on me.
Best Answer
This is not a problem specific to R. R uses a conventional display of coefficients.
When you read such regression output (in a paper, textbook, or from statistical software), you need to know which variables are "continuous" and which are "categorical":
The "continuous" ones are explicitly numeric and their numeric values were used as-is in the regression fitting.
The "categorical" variables can be of any type, including those that are numeric! What makes them categorical is that the software treated them as "factors": that is, each distinct value that is found is considered an indicator of something distinct.
Most software will treat non-numerical values (such as strings) as factors. Most software can be persuaded to treat numerical values as factors, too. For example, a postal service code (ZIP code in the US) looks like a number but really is just a code for a set of mailboxes; it would make no sense to add, subtract, and multiply ZIP codes by other numbers! (This flexibility is the source of a common mistake: if you are not careful, or unwitting, your software may treat a variable you consider to be categorical as continuous, or vice-versa. Be careful!)
Nevertheless, categorical variables have to be represented in some way as numbers in order to apply the fitting algorithms. There are many ways to encode them. The codes are created using "dummy variables." Find out more about dummy variable encoding by searching on this site; the details don't matter here.
In the question we are told that
h
andf
are categorical ("discrete") values. By default,log(d)
anda
are continuous. That's all we need to know. The model is$$\eqalign{ y &= \color{red}{-0.679695} & \\ &+ \color{RoyalBlue}{1.791294}\ \log(d) \\ &+ 0.870735 &\text{ if }h=h_1 \\ & -0.447570 &\text{ if }h=h_2 \\ &+ \color{green}{0.542033} &\text{ if }h=h_3 \\ &+ \color{orange}{0.037362}\ a \\ & -0.588362 &\text{ if }f=f_1 \\ &+ \color{purple}{0.816825} &\text{ if }f=f_2 \\ &+ 0.534440 &\text{ if }f=f_3 \\ & -0.085658\ a &\text{ if }h=h_1 \\ & -0.034970\ a &\text{ if }h=h_2 \\ & -\color{brown}{0.040637}\ a &\text{ if }h=h_3 \\ }$$
The rules applied here are:
The "intercept" term, if it appears, is an additive constant (first line).
Continuous variables are multiplied by their coefficients, even in "interactions" like the
h1:a
,h2:a
, andh3:a
terms. (This answers the original question.)Any categorical variable (or factor) is included only for cases where the value of that factor appears.
For example, suppose that $\log(d)=2$, $h=h_3$, $a=-1$, and $f=f_2$. The fitted value in this model is
$$\hat{y} = \color{red}{-0.6797} + \color{RoyalBlue}{1.7913}\times (2) + \color{green}{0.5420} + \color{orange}{0.0374}\times (-1) + \color{purple}{0.8168} -\color{brown}{0.0406}\times (-1).$$
Notice how most of the model coefficients simply do not appear in the calculation, because
h
can take on exactly one of the three values $h_1$, $h_2$, $h_3$ and therefore only one of the three coefficients $(0.870735, -0.447570, 0.542033)$ applies toh
and only one of the three coefficients $(-0.085658, -0.034970, -0.040637)$ will multiplya
in theh:a
interaction; similarly, only one coefficient applies tof
in any particular case.