Linear Regression – How to Apply Coefficient Term for Factors and Interactive Terms in a Linear Equation

contrastslinear modelregression coefficients

Using R, I have fitted a linear model for a single response variable from a mix of continuous and discrete predictors. This is uber-basic, but I'm having trouble grasping how a coefficient for a discrete factor works.

Concept: Obviously, the coefficient of the continuous variable 'x' is applied in the form y = coefx(varx) + intercept but how does that work for a factor z if the factor is non-numeric? y = coefx(varx) + coefz(factorz???) + intercept

Specific: I have fitted a model in R as lm(log(c) ~ log(d) + h + a + f + h:a) where h and f are discrete, non-numeric factors. The coefficients are:

Coefficients:
              Estimate 
(Intercept)  -0.679695 
log(d)        1.791294 
h1            0.870735  
h2           -0.447570  
h3            0.542033   
a             0.037362  
f1           -0.588362  
f2            0.816825 
f3            0.534440
h1:a         -0.085658
h2:a         -0.034970 
h3:a         -0.040637

How do I use these to create the predictive equation:

log(c) =  1.791294(log(d)) + 0.037362(a) + h??? + f???? + h:a???? + -0.679695

Or am I doing it wrong?

I THINK that that concept is if the subject falls in category h1 and f2, the equation becomes:

log(c) =  1.791294(log(d)) + 0.037362(a) +  0.870735  + 0.816825  + h:a???? + -0.679695

But I'm really not clear on how the h:a interactive term gets parsed. Thanks for going easy on me.

Best Answer

This is not a problem specific to R. R uses a conventional display of coefficients.

When you read such regression output (in a paper, textbook, or from statistical software), you need to know which variables are "continuous" and which are "categorical":

The "continuous" ones are explicitly numeric and their numeric values were used as-is in the regression fitting.
The "categorical" variables can be of any type, including those that are numeric! What makes them categorical is that the software treated them as "factors": that is, each distinct value that is found is considered an indicator of something distinct.

Most software will treat non-numerical values (such as strings) as factors. Most software can be persuaded to treat numerical values as factors, too. For example, a postal service code (ZIP code in the US) looks like a number but really is just a code for a set of mailboxes; it would make no sense to add, subtract, and multiply ZIP codes by other numbers! (This flexibility is the source of a common mistake: if you are not careful, or unwitting, your software may treat a variable you consider to be categorical as continuous, or vice-versa. Be careful!)

Nevertheless, categorical variables have to be represented in some way as numbers in order to apply the fitting algorithms. There are many ways to encode them. The codes are created using "dummy variables." Find out more about dummy variable encoding by searching on this site; the details don't matter here.

In the question we are told that h and f are categorical ("discrete") values. By default, log(d) and a are continuous. That's all we need to know. The model is

$$\eqalign{ y &= \color{red}{-0.679695} & \\ &+ \color{RoyalBlue}{1.791294}\ \log(d) \\ &+ 0.870735 &\text{ if }h=h_1 \\ & -0.447570 &\text{ if }h=h_2 \\ &+ \color{green}{0.542033} &\text{ if }h=h_3 \\ &+ \color{orange}{0.037362}\ a \\ & -0.588362 &\text{ if }f=f_1 \\ &+ \color{purple}{0.816825} &\text{ if }f=f_2 \\ &+ 0.534440 &\text{ if }f=f_3 \\ & -0.085658\ a &\text{ if }h=h_1 \\ & -0.034970\ a &\text{ if }h=h_2 \\ & -\color{brown}{0.040637}\ a &\text{ if }h=h_3 \\ }$$

The rules applied here are:

The "intercept" term, if it appears, is an additive constant (first line).
Continuous variables are multiplied by their coefficients, even in "interactions" like the h1:a, h2:a, and h3:a terms. (This answers the original question.)
Any categorical variable (or factor) is included only for cases where the value of that factor appears.

For example, suppose that $\log(d)=2$, $h=h_3$, $a=-1$, and $f=f_2$. The fitted value in this model is

$$\hat{y} = \color{red}{-0.6797} + \color{RoyalBlue}{1.7913}\times (2) + \color{green}{0.5420} + \color{orange}{0.0374}\times (-1) + \color{purple}{0.8168} -\color{brown}{0.0406}\times (-1).$$

Notice how most of the model coefficients simply do not appear in the calculation, because h can take on exactly one of the three values $h_1$, $h_2$, $h_3$ and therefore only one of the three coefficients $(0.870735, -0.447570, 0.542033)$ applies to h and only one of the three coefficients $(-0.085658, -0.034970, -0.040637)$ will multiply a in the h:a interaction; similarly, only one coefficient applies to f in any particular case.

Best Answer

Related Solutions

Solved – Modeling prices with the Hedonic regression

LASSO Regression – How to Treat Categorical Predictors in LASSO

Related Question