Data Visualization – Visualizing Multivariate Multiple Regression of Continuous Data in R

continuous datadata visualizationmultiple regressionmultivariate regressionr

I have created a multivariate multiple regression model with 3 dependent and 3 independent variables in R, and would like to generate meaningful visualizations. All variables are continuous. When working with multiple regression models with 1 dependent variable, this is fairly easy.

For example

set.seed(0)

df <- data.frame(ind1 = c(1:10),
                ind2 = runif(10,5,15),
                ind3 = runif(10,5,15))
df$dep1 <- df$ind1 * df$ind2 * df$ind3
df$dep2 <- df$ind1 * df$ind2 * df$ind3 * runif(10)
df$dep3 <- df$ind1 * df$ind2 * df$ind3 * abs(rnorm(10))

model1 <- lm(data = df, dep1 ~ ind1 + ind2 + ind3)

There are simple options for different insightful visualizations

library('ggplot2')
library('ggeffects')

ggplot(ggpredict(model1, terms = c("ind1 [1,5,10]", "ind2", "ind3")), 
       aes(x, predicted, color = group)) + geom_line() + facet_wrap(~facet)

library('car')
avPlots(model1)

library('sjPlot')
plot_model(model1, type = 'diag')

However, in a model with 3 dependent variables

model2 <- lm(data = df, cbind(dep1, dep2, dep3) ~ ind1 + ind2 + ind3)

these options seem to go out the window. I am hoping to come up with something a little more powerful than simply

hist(resid(model2))

Let me know if this topic is a better fit for R StackOverflow.

Best Answer

I think you might do best to review multivariate regression (meaning no disrespect). There is a short tutorial at UVA's stats help page here: https://data.library.virginia.edu/getting-started-with-multivariate-multiple-regression/. They explain that multivariate regression is mostly the same as several univariate regressions, except that there are covariances between the betas for the different outcomes that need to be taken into account when testing the variables. In particular, they mention that the relevant plots are the same:

The same diagnostics we check for models with one predictor should be checked for these as well.

I will use their example to illustrate some data visualizations below (coded in R). I'll start with a scatterplot matrix of the data. GEN is binary, so I'll represent that with a different color and symbol. After fitting the model, if you try to run plot.lm() you'll get an error. However, it's easy enough to reproduce those plots manually. To plot a multiple regression model without interactions, you can pick a variable of interest and make a scatterplot with it and the response and draw the fitted function over it. Be sure to adjust the intercept by setting other variables at their means (see my answer to How to visualize a fitted multiple regression model?). You can also make scatterplots and qq-plots of the residuals (the latter lets you assess if their distribution is similar).

ami_data = read.table("http://static.lib.virginia.edu/statlab/materials/data/ami_data.DAT")
names(ami_data) = c("TOT","AMI","GEN","AMT","PR","DIAP","QRS")

summary(ami_data)
# output omitted
pairs(ami_data[,-3], col=ifelse(ami_data$GEN, "red", "black"),
                     pch=ifelse(ami_data$GEN, 3, 1))

mlm1 = lm(cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS, data=ami_data)
summary(mlm1)
# output omitted
head(resid(mlm1))
#          TOT        AMI
# 1  132.82172  161.52769
# 2  -72.00392 -264.35329
# 3 -399.24769 -373.85244
# 4 -382.84730 -247.29456
# 5 -152.39129   15.78777
# 6  366.78644  217.13206
windows()
  plot(mlm1)
# Error: 'plot.mlm' is not implemented yet

## reproducing R's plot.lm() for TOT
rst = rstandard(mlm1)  # standardized residuals
windows()  
  layout(matrix(1:4, nrow=2, byrow=T))
  # plot 1
  plot(resid(mlm1)[,1]~fitted(mlm1)[,1], 
       main="Residuals vs Fitted for TOT", xlab="fitted", ylab="residuals")
  lines(lowess(resid(mlm1)[,1]~fitted(mlm1)[,1]), col="red")
  # plot 2
  plot(sqrt(abs(rst[,1]))~fitted(mlm1)[,1], 
       main="Scale-Location plot for TOT", xlab="fitted", ylab="residuals")
  lines(lowess(sqrt(abs(rst[,1]))~fitted(mlm1)[,1]), col="red")
  # plot 3
  qqnorm(rst[,1], main="qq-plot of residuals TOT")
  qqline(rst[,1])
  # plot 5
  plot(rst[,1]~lm.influence(mlm1)$hat, xlab="Leverage", ylab="residuals",
       main="Residuals vs Leverage for TOT")
  lines(lowess(rst[,1]~lm.influence(mlm1)$hat), col="red")

windows()  
  layout(matrix(1:4, nrow=2, byrow=T))
  # plot of model for TOT vs AMT
  plot(TOT~AMT, ami_data, main="TOT vs AMT w/ data & model", 
       col=ifelse(ami_data$GEN, 2, 1), pch=ifelse(ami_data$GEN, 3, 1))
  abline(coef(mlm1)[-3,1]%*%c(1, apply(ami_data[,c(3,5:7)], 2, mean)),
         coef(mlm1)[3,1], col="blue")
  # plot of model for AMI vs AMT
  plot(AMI~AMT, ami_data, main="AMI vs AMT w/ data & model", 
       col=ifelse(ami_data$GEN, 2, 1), pch=ifelse(ami_data$GEN, 3, 1))
  abline(coef(mlm1)[-3,2]%*%c(1, apply(ami_data[,c(3,5:7)], 2, mean)),
         coef(mlm1)[3,2], col="blue")
  # scatterplot of residuals
  plot(resid(mlm1)[,1], resid(mlm1)[,2], xlab="Residuals for TOT",
       ylab="Residuals for AMI", main="scatterplot of residuals")
  # qq-plot of residuals
  qqplot(resid(mlm1)[,1], resid(mlm1)[,2], xlab="Residuals for TOT",
         ylab="Residuals for AMI", main="qq-plot for residuals")

Related Solutions

Multivariate Multiple Regression in R – Comprehensive Guide

Briefly stated, this is because base-R's manova(lm()) uses sequential model comparisons for so-called Type I sum of squares, whereas car's Manova() by default uses model comparisons for Type II sum of squares.

I assume you're familiar with the model-comparison approach to ANOVA or regression analysis. This approach defines these tests by comparing a restricted model (corresponding to a null hypothesis) to an unrestricted model (corresponding to the alternative hypothesis). If you're not familiar with this idea, I recommend Maxwell & Delaney's excellent "Designing experiments and analyzing data" (2004).

For type I SS, the restricted model in a regression analysis for your first predictor c is the null-model which only uses the absolute term: lm(Y ~ 1), where Y in your case would be the multivariate DV defined by cbind(A, B). The unrestricted model then adds predictor c, i.e. lm(Y ~ c + 1).

For type II SS, the unrestricted model in a regression analysis for your first predictor c is the full model which includes all predictors except for their interactions, i.e., lm(Y ~ c + d + e + f + g + H + I). The restricted model removes predictor c from the unrestricted model, i.e., lm(Y ~ d + e + f + g + H + I).

Since both functions rely on different model comparisons, they lead to different results. The question which one is preferable is hard to answer - it really depends on your hypotheses.

What follows assumes you're familiar with how multivariate test statistics like the Pillai-Bartlett Trace are calculated based on the null-model, the full model, and the pair of restricted-unrestricted models. For brevity, I only consider predictors c and H, and only test for c.

N <- 100                             # generate some data: number of subjects
c <- rbinom(N, 1, 0.2)               # dichotomous predictor c
H <- rnorm(N, -10, 2)                # metric predictor H
A <- -1.4*c + 0.6*H + rnorm(N, 0, 3) # DV A
B <-  1.4*c - 0.6*H + rnorm(N, 0, 3) # DV B
Y <- cbind(A, B)                     # DV matrix
my.model <- lm(Y ~ c + H)            # the multivariate model
summary(manova(my.model))            # from base-R: SS type I
#           Df  Pillai approx F num Df den Df  Pr(>F)    
# c          1 0.06835   3.5213      2     96 0.03344 *  
# H          1 0.32664  23.2842      2     96 5.7e-09 ***
# Residuals 97

For comparison, the result from car's Manova() function using SS type II.

library(car)                           # for Manova()
Manova(my.model, type="II")
# Type II MANOVA Tests: Pillai test statistic
#   Df test stat approx F num Df den Df  Pr(>F)    
# c  1   0.05904   3.0119      2     96 0.05387 .  
# H  1   0.32664  23.2842      2     96 5.7e-09 ***

Now manually verify both results. Build the design matrix $X$ first and compare to R's design matrix.

X  <- cbind(1, c, H)
XR <- model.matrix(~ c + H)
all.equal(X, XR, check.attributes=FALSE)
# [1] TRUE

Now define the orthogonal projection for the full model ($P_{f} = X (X'X)^{-1} X'$, using all predictors). This gives us the matrix $W = Y' (I-P_{f}) Y$.

Pf  <- X %*% solve(t(X) %*% X) %*% t(X)
Id  <- diag(N)
WW  <- t(Y) %*% (Id - Pf) %*% Y

Restricted and unrestricted models for SS type I plus their projections $P_{rI}$ and $P_{uI}$, leading to matrix $B_{I} = Y' (P_{uI} - P_{PrI}) Y$.

XrI <- X[ , 1]
PrI <- XrI %*% solve(t(XrI) %*% XrI) %*% t(XrI)
XuI <- X[ , c(1, 2)]
PuI <- XuI %*% solve(t(XuI) %*% XuI) %*% t(XuI)
Bi  <- t(Y) %*% (PuI - PrI) %*% Y

Restricted and unrestricted models for SS type II plus their projections $P_{rI}$ and $P_{uII}$, leading to matrix $B_{II} = Y' (P_{uII} - P_{PrII}) Y$.

XrII <- X[ , -2]
PrII <- XrII %*% solve(t(XrII) %*% XrII) %*% t(XrII)
PuII <- Pf
Bii  <- t(Y) %*% (PuII - PrII) %*% Y

Pillai-Bartlett trace for both types of SS: trace of $(B + W)^{-1} B$.

(PBTi  <- sum(diag(solve(Bi  + WW) %*% Bi)))   # SS type I
# [1] 0.0683467

(PBTii <- sum(diag(solve(Bii + WW) %*% Bii)))  # SS type II
# [1] 0.05904288

Note that the calculations for the orthogonal projections mimic the mathematical formula, but are a bad idea numerically. One should really use QR-decompositions or SVD in combination with crossprod() instead.

Solved – Multivariate Linear Regression with continuous and discrete explanatory variable

Notice that you are predicting negative prices for small gross areas. Does that make sense (Please come and live in this appartment, I will give you money if you do). I would consider using a log link function.

As for the categorical variable, I would just add indicator variables for the number of bathrooms instead of entering it as a continuous variable.

Best Answer

Related Solutions

Multivariate Multiple Regression in R – Comprehensive Guide

Solved – Multivariate Linear Regression with continuous and discrete explanatory variable

Related Question