Controlling for Variables – Is It the Same as Feature Importance and SHAP Values in Regression via XGBoost?

computational-statisticsconfoundingcontrolling-for-a-variablemachine learningregression

There are a few posts going over the fact that "controlling for variables" in traditional stats involves building a regression model and including possible covariates in the model.

An examples would be from this post here:
How exactly does one “control for other variables”?

Would prediction via regression using a more advanced algorithm (eg xgboost) and then checking the PDP/feature importance/SHAP values be comparable?

Or do these other methods not give you similar conclusions to looking at coefficients in a regression model?

Best Answer

Let's take a generic model: $y = f(x_1,x_2)$

Where $y$ is what we're trying to predict, $x_1$ is the variable we want to use to predict $y$, and $x_2$ is some nuisance variable we wish to account/control for.

In this case of regression, (and many models) we actually do something that is kind of odd when you think about it: we treat both $x_1$ and $x_2$ exactly the same. So, when we interpret the relationship between $y$ and $x_1$, simply having $x_2$ included in the model is "controlling for it."

This leads us to a question: what metric can we use to determine the way in which $x_2$ is effecting our model? By the way you phrased your question, I imagine that you are thinking that the coefficient for $x_2$ gives us this information. However, it does not. It tells us about the relationship of $y$ and $x_2$ while treating $x_1$ as the nuisance variable. There actually isn't a "built in" regression metric that gives us this information. To understand this relationship, we'd need to fit a model with just $y$ and $x_1$, and then fit the full model and look at the difference in the coefficient of $x_1$ in the two models. This tells us how controlling for $x_2$ changed the model.

I'm not an expert in the other models you mentioned, but based on a quick overview I would say the metrics you named appear to be more like coefficients than measures of "control/nuisance." Each algorithm has different quirks in how it handles covariance between regressors / predictors. With gradient boosted trees (xgboost) its going to have to deal with the underlying decision tree methods, which basically ignores covariance since it just selects one variable at a time. I don't think there's any elegant mathematical way to understand how boosted trees handle control variables; again, you'd just need to fit a model with and without and look at the differences.

Related Question