Solved – Logistic regression assumptions for a model with many binary independent variables

assumptionsgeneralized linear modelinteractionlogisticregression

I am working on developing a logistic regression model that uses qualitative variables only ($n=990$). My remit is to define the equation that can identify the most relevant characteristics of a survey respondent that is favorable towards Company X. The proposed equation is similar to the following:
\begin{align}
{\rm Fav} = &a*{\rm age} + b*{\rm CoAware} + c*{\rm IssueAware} + d*{\rm readnewspaper} + e*{\rm region} + \\
&f*{\rm income}\ldots
\end{align}
The dependent variable is "Company Favorability" (0 = Unfavorable/Neither | 1 = Favorable). There are currently 25 independent variables, 20 of which are binary IVs that range from highly correlated to the DV (awareness of Company) to not significant (gender). I also have 5 categorical variables that indicate region of the country, age (in categories), party affiliation, income level, and education.

I am almost certain that I need to use a logistic regression model for this approach. However, when I test my assumptions, I have having a very difficult time proving a linear relationship between the dichotomous independent variables and the logit transformation of the DV.

My other problem is that, I am somewhat overwhelmed by possible interaction effects. There are 34 possible options using 25 variables – leading me to over 50 million possible combinations.

I have three questions:

  1. Is there a better method to model with a binary dependent variable?
  2. Am I missing something in the assumptions? (ie: Do I need to indeed prove the linear relationship if all of my variables are dichotomous)
  3. Would it be better to approach this by looking at multicollinearity first, to reduce the number of variables overall, and then look at linear relationships with the logit of the DV?

Best Answer

If all your regressor variables are binary, then the linearity assumption is vacuous! so can be ignored. But you say you have variables like age, which is not binary. Then you can consider using a spline of age instead of age directly, which leads to GAMs (generalized additive models) or the use of regression splines. I found this useful when there are one or a few such variables.

Then consider which interactions you consider plausible, then start with your model.