Solved – Compare linear regression models (same and different response variable)

linear modelrregression

How can I (1) compare two linear models between years and (2) Can I compare 2 models with different response variables?

My data have 4 variables: y_meas, x, year, y_calc. "y_meas" is a lab measured response varibale, and "y_calc" is an estimate of the same variable, using a standard calculation. "x " is a dosage, similar(ish) between two years:

#create dataset
set.seed(100)
dat <- within(data.frame(x = rep(1:10, times=2)),
                 {
                   year <- rep(1990:1991, each = 10)
                   y_meas <- 0.5 * x* (1:20) + rnorm(20)
         y_calc <- 0.3 * x* (1:20) + rnorm(20)
                   year <- factor(year)                 # convert to a factor
                 }
                 )

I have two related questions:
(1) Is there any difference between slope/intercept of models for 1990 and 1991?

m.1990<-lm(y_meas~x, data=subset(dat, year==1990))
m.1991<-lm(y_meas~x, data=subset(dat, year==1991))
anova(m.1990)
anova(m.1991) 
# both models are significant

I can't run anova(m.1990,m.1991) because the models are not nested? Do I need to use year as a dummy variable and run ANCOVA? What does this look like (roughly)?

(2) Assuming I can combine 1990 and 1991, can I compare the slope/intercept of 'y_meas~x' and 'y_calc~x'? Yes, two different response variables, but are on the same scale.

Best Answer

Just put everything in one model:

library(reshape2)
dat1 <- melt(dat, id.vars=c("x", "year"))

mod <- lm(value~x*variable*year, data=dat1)
anova(mod)

#Analysis of Variance Table
#
#Response: value
#                Df  Sum Sq Mean Sq   F value    Pr(>F)    
#x                1 13546.1 13546.1 1087.7872 < 2.2e-16 ***
#variable         1  1746.5  1746.5  140.2458 3.103e-13 ***
#year             1  4994.1  4994.1  401.0389 < 2.2e-16 ***
#x:variable       1   860.9   860.9   69.1300 1.687e-09 ***
#x:year           1  1399.1  1399.1  112.3510 5.352e-12 ***
#variable:year    1   292.1   292.1   23.4527 3.137e-05 ***
#x:variable:year  1    81.6    81.6    6.5533   0.01539 *  
#Residuals       32   398.5    12.5                        
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you see, you have the significant main effects and interactions you'd expect from your artificial data.

It's always recommended to plot:

library(ggplot2)
newdat <- expand.grid(x=c(1,10), variable=c("y_calc", "y_meas"), year=factor(1990:1991))
newdat <- cbind(newdat, predict(mod, newdata=newdat, interval = "confidence"))

ggplot(dat1, aes(x=x, y=value, colour=year, shape=variable, linetype=variable)) +
  geom_ribbon(data=newdat, aes(y=fit, ymin=lwr, ymax=upr, fill=year, colour=NULL), alpha=0.2) +
  geom_line(data=newdat, aes(y=fit)) +
  geom_point()

plot of data and fitted lines with confidence bands

This shows you that differences between slopes are most important, which is confirmed if you look at the summary table:

#Coefficients:
#                          Estimate Std. Error t value Pr(>|t|)    
#(Intercept)                -6.3443     2.4107  -2.632 0.012963 *  
#x                           3.2300     0.3885   8.314 1.69e-09 ***
#variabley_meas             -4.4852     3.4092  -1.316 0.197648    
#year1991                   -0.2361     3.4092  -0.069 0.945218    
#x:variabley_meas            2.2357     0.5494   4.069 0.000288 ***
#x:year1991                  3.1235     0.5494   5.685 2.71e-06 ***
#variabley_meas:year1991    -0.1319     4.8213  -0.027 0.978341    
#x:variabley_meas:year1991   1.9891     0.7770   2.560 0.015394 *  

Note that the usual diagnostics (in particular for variance homogeneity) should of course be performed.