Multiple Logistic Regression with many coefficients – How to interpret non-significance and do I need to correct

This is an extension of a previous question I asked; Logistic Regression – between 2 unrelated treatments? it's a separate question entirely, hence I thought I'd put up a separate post for it.

Context I'm testing the effects of various email interventions on getting people to sign up for a financial literacy event, and benchmarking them against a default, control intervention. The outcome variable is sign-up (i.e. Yes, or No). So it's a binary outcome, and I'm running a logistic regression.

For clarity:

Control (T0) Email: A default email that the sponsoring agency has been using for years. ("Please sign up!")
Treatment 1 (T1) Email: Same email, but adds an extra 'note'. (e.g., "By the way, did you know that…..")
Treatment 2 (T2) Email: Same email again, with another type of 'note'.
Treatment 3 (T3) Email: Same email again, with another type of 'note'.
Treatment 4 (T4) Email: Same email again, with another type of 'note'.

Further to that, I have different demographic variables for age, income levels, and gender. I don't have a clear idea as to whether the odds of signup will vary linearly, so I decided to dummy code all of them. The picture below sums up the variables I've listed down.

As you can see, there are several variables, and after hand-coding coefficients for individual terms (and their interaction terms, for which I'm considering only age and income levels, given that there is literature evidence), I'm looking at ~40 coefficients. The picture below shows a sample (the interaction terms for t2,t3,t4 are not shown)

So my questions now are:

Sanity Check: Am I on the 'right path' as far as capturing interaction terms is concerned? Anything I'm missing out?
If I run a multiple logistic regression model – some coefficients will clearly be nonsignificant. I understand that it is still important to report them for integrity's sake – but for practical purposes, how would I calculate the log odds? simply pretend that the coefficient doesn't exist, and calculate only the significant coefficients?
Do I need to do any Bonferroni corrections, since there are 40 coefficients and that there are 40 p-values?

Appreciate the guidance.

Best Answer

As what you care about is predicting which email type is best suited to a particular demographic, don't worry about "statistical significance." That answers question 3. Statistical significance is often just a measure of your data set size. A predictor might be important in practice but just not pass the magical "p < 0.05" threshold on your data.

From that perspective, the answer to question 2 is also simple: use the entire model for predictions, if the model is well validated. With logistic regression, leaving out any predictor associated with outcome can lead to omitted-variable bias.

With respect to question 1, you are essentially on the right path. Note that so far you only seem to be including 2-way interactions among the predictors, and you don't seem to be including a 2-way interaction of gender with age or income. In principle you could include higher levels of interaction, too. Make sure that any such omissions are conscious decisions rather than oversights. That's one advantage of letting software set up your data coding instead of trying to do it by hand.

If you have values for age and income, it's preferable to treat them as continuous predictors instead of binning them. If you can't assume that they are linearly related to log-odds, methods like regression splines can handle nonlinearity in a flexible way that is informed by the data.

The danger is that with too many interactions you might be overfitting your data. The usual rule of thumb to avoid overfitting with logistic regression is to have about 15 cases in the minority class (presumably the sign-up class in your case) for each coefficient in your model. I count 44 individual and 2-way interaction coefficients beyond the intercept, so you should have 600-700 sign-ups in your data set to handle even that level of model complexity; to evaluate higher-order interactions you would need more sign-ups.

If you are still in the study-design stage, use that as an estimate of how large a study to conduct. Otherwise, if there aren't enough sign-ups to handle that number of coefficients, scale back your model complexity and use your knowledge of the subject matter to identify the most critical combinations of predictors to evaluate. Alternatively, try a learning method like boosted trees, although the resulting model might not be easy to interpret intuitively.

Frank Harrell's course notes and book are useful resources for principles of regression modeling. An Introduction to Statistical Learning presents other modeling approaches.

Best Answer

Related Solutions

Related Question