R – Getting Equation for lm/ggplot Geom Smooth with Multiple Levels [Closed]

ggplot2lmr

I'm trying to get equations for slope intercept for an lm with a three level categorical variable and a continuous covariate.

Essentially I have plotted these using ggplot and in the legend I would like to have the equation for each of the levels of the categorical variable.

I can use the iris dataset as an example:

# here is the graph
iris %>% qplot(Sepal.Length,Sepal.Width, colour = Species, fill=Species, 
data=.) + geom_smooth(method=lm)

In theory to get the equation now I should be able to run a lm, such as this:

summary(lm(Sepal.Length~Sepal.Width+Species, data=iris))

which yields:

 Call:
 lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

 Residuals:
 Min       1Q   Median       3Q      Max 
-1.30711 -0.25713 -0.05325  0.19542  1.41253 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-squared:  0.7259,    Adjusted R-squared:  0.7203 
F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

However, I always struggle to remember the structure I need to use. What are the equations and how would the be calculated for me to be able to plot the same lines as in the above graph but instead of using the geom_smooth command to do it using something like geom_abline where you specify intercept and slope.

Best Answer

The most lm() model you fitted is in essence a collection of 3 sub-models (one for each of the three species of iris).

First, let's see what the three species of iris are:

levels(iris$Species)

It turns out they are "setosa", "versicolor" and "virginica". Thus, the three sub-models are as follows.

Sub-Model for "setosa" species:

Sepal.Length = beta0 + beta1*Sepal.Width + error

Sub-Model for "versicolor" species:

Sepal.Length = (beta0 + beta2)+ beta1*Sepal.Width + error

Sub-Model for "virginica" species:

Sepal.Length = (beta0 + beta3) + beta1*Sepal.Width + error

In practice, we don't know the values of the regression coefficients beta0, beta1, beta2 and beta3, so we'll estimate them from the data via the lm() model you provided. If we denote the estimated values of these coefficients by b0, b1, b2 and b3, then the estimated (or fitted) regression equations you need to plot will be given by:

Fitted Sepal.Length = b0 + b1*Sepal.Width  (for the "setosa" species)

Fitted Sepal.Length = (b0 + b2)+ b1*Sepal.Width (for the "versicolor" species)

Fitted Sepal.Length = (b0 + b3) + b1*Sepal.Width (for the "virginica" species)

Now all you have to do is pull the values of b0, b1, b2 and b3 from the summary of your fitted lm() model.

 Call:
 lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

 Residuals:
 Min       1Q   Median       3Q      Max 
-1.30711 -0.25713 -0.05325  0.19542  1.41253 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)       b0 --> 2.2514     0.3698   6.089 9.57e-09 ***
Sepal.Width       b1 --> 0.8036     0.1063   7.557 4.19e-12 ***
Speciesversicolor b2 --> 1.4587     0.1121  13.012  < 2e-16 ***
Speciesvirginica  b3 --> 1.9468     0.1000  19.465  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-squared:  0.7259,    Adjusted R-squared:  0.7203 
F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

In fact, you can extract these values directly as follows:

m <- lm(Sepal.Length~Sepal.Width+Species, data=iris)

b0 <- coef(m)[1]

b1 <- coef(m)[2]

b2 <- coef(m)[3]

b3 <- coef(m)[4]

So the intercepts you need for the "setosa", "versicolor" and "virginica" species are given by b0, b0 + b2 and b0 + b3, respectively.

The slopes are all given by b1, since they are the same across the three species according to your model specification.