Solved – Specification and interpretation of interaction terms using glm()

generalized linear modelinteractioninterpretationr

I am fitting a logistic model to data using the glm function in R. I have attempted to specify interaction terms in two ways:

fit1 <- glm(y ~ x*z, family = "binomial", data = myData) 
fit2 <- glm(y ~ x/z, family = "binomial", data = myData)

I have 3 questions:

What is the difference between specifying my interaction terms as x*z compared to x/d?

When I call summary(fit1) the report includes results for the intercept, x, z, and x:z while summary(fit2) only includes results for intercept, x, and x:z.

I did look at Section 11.1 in "An Introduction to R" but couldn't understand the meaning.

How do I interpret the fit equation mathematically? Specifically, how do I interpret the interaction terms formulaically?

Moving to math instead of R, do I interpret the equation as:

logit(y) = (intercept) + (coeff_x)*x + (coeff_z)*x + (coeff_xz)*x*z
?

This interpretation may differ in the two specifications fit1 and fit2. What is the interpretation of each?

Assuming the above interpretation is correct, how to I fit the model of x*(1/z) in R? Do I need to just create another column with these values?

Best Answer

x/z expands to x + x:z and so far I have used this only to model nested random effects.

set.seed(42)
x <- rnorm(100)
z <- rnorm(100)
y <- sample(c(0,1),100,TRUE)

fit2 <- glm(y ~ x/z, family = "binomial") 
fit3 <- glm(y ~ x + z %in% x, family = "binomial")
identical(summary(fit2)$coefficients,summary(fit3)$coefficients)
#TRUE
fit4 <- glm(y ~ x + x:z, family = "binomial")
identical(summary(fit2)$coefficients,summary(fit4)$coefficients)
#TRUE

fit5 <- glm(y ~ I(x/z), family = "binomial")    
a <- x/z
fit6 <- glm(y ~ a, family = "binomial")
all.equal(summary(fit5)$coefficients,summary(fit6)$coefficients)
#[1] "Attributes: < Component 2: Component 1: 1 string mismatch >"
#which means that only the rownames don't match, but values are identical

Related Solutions

Solved – Interpretation of interaction terms if main effect is insignificant

Rather than saying the relationship is stronger, I think it's more precise to say that weight increases significantly more quickly with height for males than for females. Strength of relationship would be measured by measures like $R^2$, and these are affected not only by the rate of increase of one variable with another, but by the amount of noise in the data. e.g. if the data were something like this:

maleheight <- rnorm(1000, 70, 3)
femaleheight <- rnorm(1000, 65, 2.5)
maleweight <- maleheight*2.2 + rnorm(1000, 0, 20)
femaleweight <- femaleheight*1.3 + rnorm(1000, 0, 10)
height <- c(maleheight, femaleheight)
weight <- c(maleweight, femaleweight)
male <- c(rep(1, 1000), rep(0, 1000))
data <- data.frame(cbind(height, weight, male))

and the model

m1 <- with(data, lm(weight~height + male + height*male))
summary(m1)

shows your pattern, but the relationship looks stronger for women

Solved – Why doesn’t adding variables to the glmnet lasso model improve fit

First off, it looks like this is a classification problem, so make sure to have the type.measure option set to class, as such:

fit2=cv.glmnet(x[1:test,], y[1:test], type.measure = "class")

Remember that the Lasso loss function we try to minimize is the sum of the squared residuals plus lambda*(sum of the absolute value of the coefficient magnitudes, excluding the intercept). So, if you are comparing a lambda value for both models, they will keep approximately the same number of variables and similar magnitudes, because the cost for a large coefficient value is similar between the two models. However, when adding 2000 variables for which you want to include some in the model while also keeping your original significant variables, you need to adjust for a lower lambda, to be more inclusive.

If the some of the variables you are including are indeed significant, then the reason why your fit2 does not fit as well as fit1 is because the 2000 variables you are introducing may be valuable in predicting y, but not AS valuable as the variables in fit1. So, if the lambdas for both models are similar, the difference will be in sometimes including variables of the 2000 that are good but not as good as some of the originals for which they are replacing (but appear more important in the Lasso algorithm due to your training sample being slightly different than the population as a whole). With so many new variables added, the probability of randomly sampling where at least 1 of them appears more significant than it should be is high. In a shrinkage algorithm like the Lasso, this could seriously affect the results. Additionally, if some of the significant variables are highly correlated, then in a random sample some could go to zero if the correlated variable is more prevalent in the sample than in the population.

So, it is likely you want to change to the class measure if you haven't already, but besides that, it could be that the search for lambda is not including a small enough value for the fit2 model. Consider creating your own grid for lambda and and run the cv.glmnet with that grid. Here is an example version you can use:

grid = 10^seq(10, -2, length=100)
fit2=cv.glmnet(x[1:test,], y[1:test], type.measure = "class", lambda = grid)

Best Answer

Related Solutions

Solved – Interpretation of interaction terms if main effect is insignificant

Solved – Why doesn’t adding variables to the glmnet lasso model improve fit

Related Question