Detecting interactions in large logistic regression models

feature selectioninteractionlarge datalogisticmultiple regression

I have a dataset of a few million observations of a binary response with a low "Success"-probability of on average 1% to 2%. The dataset encompasses several categorical (~20 some with up to 50 categories) and numerical (~10) variables. I fitted a main effects logistic Generalised Linear Model (GLM) as baseline and a Gradient Boosted Tree (GBT). The GBT clearly outperforms the GLM as measured by log loss on a test set. Yet, both models seem to be quite similar when comparing only their marginal effects (as measured by eyeballing of suitable plots). So one potential reason for the outperformance of the GBT over the GLM may be the inclusion of interactions. I would like to verify this and ideally find (some of) those interactions.

My question

  1. What are possible ways to find interactions in the GBT model?
  2. More general: Where can I find information on the state of the art with respect to finding interactions and on what is currently doable and what is not?

My goals are quite pragmatic:

  • I don't need to find ALL interactions, a few "important" ones would be a great start.
  • I do not need to do hypothesis test on the findings.
  • But methods need to be implementable given "standard" computational resources.

My attempts so far

  • Given the size of the dataset and the number of inputs, any exhaustive search method such as stepwise regression seems futile.
  • Same problem with selection by regularization such as Lasso. In particular, since sparse design matrices are not possible due to the numerical inputs.
  • I am aware of, but have not tried yet, Friedman's H-statistic. The problems I see there is that it is based on variance decomposition and not log loss. It is also a kind of exhaustive search, and doable (at best ?) only for pairwise interactions. Furthermore, its estimates are based on permutations and some of the inputs show strong dependence.
  • The dataset is complex, and there is no a-priori reason why interactions should be limited to pairs. My success at "guessing" interactions based on my general domain knowledge and verifying them by inclusion in the GLM has been limited.

Best Answer

From what I understand there aren't that many variables when compared to observations and the sheer amount of observations can be burdensome for many common approaches. And the goal is to actually find the interactions. Important to keep in mind that finding three or even four-way interactions relies heavily on the number of instances of the minority class, since:

  • detecting interactions already depends on a lot more data rather than the sole main effects. See this SO answer about this.
  • models with binary response have a suggested number of variables that depend upon the sample size of the minority class. See this SO answer.

With all that said, this is how I would approach the problem. It isn't rigorous in and of itself, but it's principled.

Shallow trees

There is some background for using decision trees for finding interactions, such as CHAID trees. I wouldn't go after and actual $\chi^2$ computing algorithm, since those tend to be slow. I would:

  1. maybe lump infrequent categories into "other" just for stability
  2. divide my sample into a few sets, maybe preserving class proportion, maybe not, depending on the results
  3. fit a shallow decision tree algorithm, go all the way down and up finishing with pruned trees
  4. compare the most common leaves and see if there is any pattern emerging

I'd be looking into common variables that end up together, common ranges of continuous variables, etc... This would hint me into the variables I should be looking at to test in an actual logistic regression model. Remember that a leave at the end is just presenting to you an elaborate indicator variable i.e. in this leave lives the observations that had $X_1 > k_1 \text{ and } X_2 == 1 \text{ and } X_3 > k_3$. This is just describing and interaction between those three variables.

Grouped lasso

Ok, I see your angle, but the LASSO was literally built to help us find sparse effects, meaning I have a bunch of potential variables in the model, and I want just a few to be included. In this case, specifically, I would work with Group lasso and penalize two-way interactions and even more strongly three-way interactions, and leaving the main effects without regularization (this is why you need the group lasso). Pick regularization hipper parameters conservatively so that you are more confident that the ones found by the optimization patterns are non-noise.

Again I'd split my sample and compare the results between them.

WOE encoding or other target encoding transformation

This is an idea to make all categories numeric variables and study just numeric interactions but turn each category into a number that is a function of the prevalence in that category (such as proportion or log-odds) To avoid spurious findings I'd add noise into those variables. Read more about it here, here or here. Again, study the statistically significant ones just as a glance as to which categorical feature interacts with what, be it other categorical or other numerical features. The lasso regularization can be helpful here as well.

The same idea of dataset division and seeing what consistently comes up.

Finally, don't look into higher-degree interactions. Even focusing on three-way interaction is pushing it, because even with 1% of minority class out of a million is still unlikely to not be noise if you find it.

Conclusion

The boosted tree is already doing a lot of this heavy work for you, but leaving it under the black box that is the swarm of trees it averages over. I'm just suggesting a few ideas of how to explore interactions more closely. Do compare any of the results with the feature importance gathered from the GBT model to confirm the interaction.

Finally, all my approaches will help you maybe find the interactions that show up consistently throughout the data replications and may help you sort them out. I would still check the benefit of adding them into the final model, be it through cross-validation, or more statistically sound methods, such as likelihood ratio tests. However I wouldn't expect the GLM to outperform the GBT, since GBT is literally searching over interactions, and this is very powerful for binary outcomes.

Related Question