Solved – Creating interaction terms for regression

interactionregression

The software I'm using for regression analysis doesn't allow for the direct inclusion of function calls or interaction terms in the formula call. Is there anything wrong with creating a new variable that is the product of the two variables I want to interact? I would then include both the original variables and this new variable in the formula call (one is continuous, the other a dummy).

Best Answer

In theory there is no problem creating an interaction by multiplying values of two variables (componentwise), but in practice there can be, depending on your software. Since this question concerns questionable software--a private package (coded by who knows who) or (perhaps worse) Excel, as suggested in another answer, this is a legitimate concern.

The problem is that unless the variables have a small range of absolute values near $1$, the sizes of the interaction can be so incommensurate with the sizes of the original variables that numerical problems arise in least squares procedures.

Numerical problems can be measured and detected through the sensitivity of a least squares solution to small perturbations in the explanatory variables. A standard measure is the condition number of the model matrix. It equals the ratio of its largest singular value to its smallest. The best possible value is $1$. It occurs when all explanatory variables are uncorrelated.

Because the square of the model matrix is involved in least squares solutions, its condition number is the square of that of the model matrix. For each power of ten in the condition number of the model matrix, you can therefore expect to lose two decimal digits of precision in the coefficient estimates. Since double-precision calculations carry a little less than 16 digits (and many computer least squares algorithms may only achieve 10 or 12 digits), expect to run into problems once the condition number exceeds 5 or so. Even worse, such problems might not be obvious. (They become obvious once the condition number exceeds 8, for then all precision is lost.)

Consider two variables with typical values around $a$ and $b$. Their product will have typical values near $ab$. When either $|a|$ or $|b|$ is large (compared to $1$), the product's values will be much larger than at least one of the variables. That can create large condition numbers.


Let me demonstrate with a simple example. As a control, consider three independent variables with values uniformly distributed between $10^9$ and $10^{10}$. They could be annual incomes of medium-sized corporations in Euros or dollars, for instance. In this simulated dataset of $200$ observations, their condition number comes out to $4.15$, which is fine.

When we replace the third variable with the product of the first two (their interaction), the condition number blows up to $1.56\times 10^{10}$. Because its square is greater than 20 powers of ten, that will wipe out all the information in double precision least squares calculations!.

If instead we rescale the first two variables to small values before computing their interaction, this simple expedient causes the condition number to drop all the way to $1.11$: no problem at all. Rescaling is the same as a change of units: for instance, re-express corporate incomes in billions of dollars instead of dollars. (Those experienced in data analysis reflexively express their data in appropriately small, commensurable units and so rarely run into such numerical problems.)

We can also compute the least squares fits in three ways: using the built-in procedure (for reference), using rescaled variables, and using the original values. In this example the latter fails with the error message

system is computationally singular: reciprocal condition number = 4.088e-21

Indeed, this number is the reciprocal square of $1.56 \times 10^{10}$, just as claimed above.

The moral is to scale (or standardize) your variables before creating interactions. Then you'll (usually) be fine.


Those familiar with good statistical computing packages may object that those packages routinely standardize the variables for internal calculations. That is only partially true. For instance, this does not take place in some (many?) of the (otherwise) sophisticated add-in regression packages to R, even though its workhorse built-in function lm is numerically robust.


This R code reproduces the results reported here. It gives you something to experiment further with if you like. Try it with x.max set to a small number to verify that all the approaches work. Then try it with x.max set to any value greater than $5\times 10^7$ or so. Interestingly, the preliminary scaling automatically costs you up to eight significant figures no matter what value xmax might have, large or small--but it doesn't fail when x.max gets large.

#
# Compute and round a condition number.
#
cn <- function(x) signif((function(x) max(x)/min(x)) (svd(x)$d), 3)
#
# Create innocuous independent variables.
#
n <- 200       # Number of observations
x.max <- 1e10  # Maximum values
set.seed(17)
x <- matrix(runif(3*n, x.max/10, x.max), ncol=3)
#
# The control case uses three (statistically) independent variables.
#
cat("Condition number for three variables is", cn(x))
#
# Instead make the third variable the product of the first two.
#
interact <- function(x) cbind(x[, 1:2], x[,1]*x[,2])
cat("Condition number for two variables + interaction is", cn(interact(x)))
#
# Demonstrate that standardizing the original two variables cures the problem.
#
cat("Condition number for two rescaled variables + interaction is", 
    cn(interact(scale(x, center=FALSE))))
#------------------------------------------------------------------------------#
#
# Compute the OLS fit with the built-in `lm` function (via QR decomposition).
#
y <- x %*% c(1,2,1)     # E[Y] = 1*x[,1] + 2*x[,2] + error
u <- interact(x)
y.lm <- predict(lm(y ~ u - 1))
#
# Compute the OLS fit directly using scaled variables.
#
u.scale <- interact(scale(x, center=FALSE))
y.hat <- u.scale %*% solve(crossprod(u.scale), crossprod(u.scale, y))
cat ("RMSE error with scaled variables is", signif(sqrt(mean((y.lm-y.hat)^2)), 3))
#
# Compute the OLS fit directly with the original variables.
# This emulates many procedures, including those in Excel and some add-ins
# in `R`.
#
y.hat.direct <- u %*% solve(crossprod(u), crossprod(u, y))
cat("RMSE error with unscaled variables is", signif(mean((y.lm-y.hat.direct)^2), 3))