Solved – Type III sums of squares

anovaregressionsums-of-squares

I have a linear regression model with one categorical variable $A$ (male & female) and one continuous variable $B$.

I set up contrasts codes in R with options(contrasts=c("contr.sum","contr.poly")).
And now I have Type III sums of squares for $A$, $B$, and their interaction (A:B) using drop1(model, .~., test="F").

What I am stuck with is how sums of squares is calculated for $B$. I think it is sum((predicted y of the full model - predicted y of the reduced model)^2). The reduced model would look like y~A+A:B. But when I use predict(y~A+A:B), R is returning predicted values that are the same as the full model predicted values. Therefore, the sums of squares would be 0.

(For the sums of squares of $A$, I used a reduced model of y~B+A:B, which is the same as y~A:B.)

Here is example code for randomly generated data:

A<-as.factor(rep(c("male","female"), each=5))
set.seed(1)
B<-runif(10)
set.seed(5)
y<-runif(10)

model<-lm(y~A+B+A:B)

options(contrasts = c("contr.sum","contr.poly"))

#type3 sums of squares
drop1(model, .~., test="F")
#or same result:
library(car)
Anova(lm(y~A+B+A:B),type="III")

#full model
predFull<-predict(model)

#Calculate sum of squares
#SS(A|B,AB)
predA<-predict(lm(y~B+A:B))
sum((predFull-predA)^2) 

#SS(B|A,AB) (???)
predB<-predict(lm(y~A+A:B))
sum((predFull-predB)^2) 
#Sums of squares should be 0.15075 (according to anova table)
#but calculated to be 2.5e-31

#SS(AB|A,B)
predAB<-predict(lm(y~A+B))
sum((predFull-predAB)^2)


#Anova Table (Type III tests)
#Response: y
#             Sum Sq Df F value Pr(>F)
#(Intercept) 0.16074  1  1.3598 0.2878
#A           0.00148  1  0.0125 0.9145
#B           0.15075  1  1.2753 0.3019
#A:B         0.01628  1  0.1377 0.7233
#Residuals   0.70926  6    

Best Answer

I've found differences in the estimation of regressors between R 2.15.1 and SAS 9.2, but after updating R to 3.0.1 version the results were the same. So, first I suggets you to update R to the latest version.

You're using the wrong approach because you're calculating the sum of square against two different models, which implies two different design matrices. This lead you to totally different estimation in the regressors used by lm() to compute the predicted values (you're using regressors with different values between the two models). SS3 is computed based on an hypotesis test, assuming that all the conditioning regressors equal zero, while the conditioned regressor equals 1. For the computations, you use the same design matrix used to estimate the full model, as for the regressor estimated in the full model. Remember that the SS3s aren't full additive. This means that if you sum the estimated SS3, you don't obtain the model SS (SSM).

Here I suggest a R implementation of the mathematics that implements the GLS algorithm used to estimate SS3 and regressors.

The values generated by this code are exactly the same generated using SAS 9.2 as for the results you gave in your code, while the SS3(B|A,AB) is 0.167486 instead of 0.15075. For this reason I suggest again to update your R version to the latest available.

Hope this helps :)

A<-as.factor(rep(c("male","female"), each=5))
set.seed(1)
B<-runif(10)
set.seed(5)
y<-runif(10)


# Create a dummy vector of 0s and 1s
dummy <- as.numeric(A=="male")

# Create the design matrix
R <- cbind(rep(1, length(y)), dummy, B, dummy*B)

# Estimate the regressors
bhat <- solve(t(R) %*% R) %*% t(R) %*% y
yhat <- R %*% bhat
ehat <- y - yhat

# Sum of Squares Total
# SST <- t(y)%*%y - length(y)*mean(y)**2
# Sum of Squares Error
# SSE <- t(ehat) %*% ehat
# Sum of Squares Model
# SSM <- SST - SSE

# used for ginv()
library(MASS)

# Returns the Sum of Squares of the hypotesis test contained in the C matrix
SSH_estimate <- function(C)
{
    teta <- C%*%bhat
    M <- C %*% ginv(t(R)%*%R) %*% t(C)
    SSH <- t(teta) %*% ginv(M) %*% teta
    SSH
}

# SS(A|B,AB)
# 0.001481682
SSH_estimate(matrix(c(0, 1, 0, 0), nrow=1, ncol=4))
# SS(B|A,AB)
# 0.167486
SSH_estimate(matrix(c(0, 0, 1, 0), nrow=1, ncol=4))
# SS(AB|A,B)
# 0.01627824
SSH_estimate(matrix(c(0, 0, 0, 1), nrow=1, ncol=4))
Related Question