Logistic Regression – How to Choose Between Multiple and Univariate Logistic Regression

logisticmultiple regressionregressionstatistical significanceunivariate

I have a set of data (~ 90 cases) and an outcome of a diagnostic test. I have collected factors that were determined before the test that could predict the outcome of the test. Now some of the data are binary, some are continuous (lab tests), one is categorical (the original diagnosis leading to the symptom with 5 categories).

The statistician did (in SPSS) a multiple logistic regression on most of the parameters and then a backwards stepwise selection, which produced a model, one of the factors was strongly significant (<0.001), the others insignificant (two of them in the 0.01-0.05 range). The question from my colleagues was natural – why was this or that possible factor not included, and "these two factors are actually significant, or can we say nearing significance". I must say I don't like to interpret "nearing significance", but I have to work with my colleagues. I asked the statistician to perform another analysis, this time with the full set of factors, and the result is different, only one of the factors (the most significant one) stayed in the new model, is still strongly significant but there are other factors, also in the 0.01-0.05 range, which will bring me to the "nearing significance" problem again. I also tried with R and Rcmdr with the same data, and stepwise selection based on AIC or BIC produces different results (only the one factor remains the same).

Now I see that given that various stepwise selection methods produce different models with insignificant factors, the presence of these factors in the model is just random. That's why I would like not to include them. Most of all, I don't want to interpret their presence in the model because the colleagues think that I should as they are "nearing significance".

Question 1: To make things simpler, is it possible to run a "battery" of univariate logistic regressions on all the variables? This would solve both my problems – I would have one simple significant model with the most significant variable and wouldn't have to deal with "nearing significance" interpretation in the others.

Question 2: Is there any correction of the significance to be used in multiple logistic regression as it is with multiple comparisons, so I can deal with the "near significance" argumentation?

Best Answer

It is obviously not correct to rerun models until by chance they give what you expect...

Ideally, the model structure (i.e. selection of predictors, transformations, interactions) is chosen based on a number of points before computing the model:

  • Expert knowledge such as publications and research questions. This includes thinking.

    Example: if you are mainly interested in the effect of a particular variable $X$ on the response $Y$ and you know that $X$ has a strong causal effect on $Z$, then it would be quite stupid to include $Z$ in the model along with $X$, because this would partially hide the effect of $X$ onto $Y$. That is maybe the case in your model.

  • Univariate distributions of variables (e.g. excluding potential predictor "sex" if there is only one male or if most values are missing; log-transform some right-skewed variables with outliers if it makes scientific sense etc.)
  • Bivariate distributions of the predictors (e.g. if both potential predictors "age" and "experience" are highly correlated, it might suffice to include just one of the two)

The hidden message of the above points: Don't take into consideration the association between the response variable and the potential predictors at this point. This will tend to bias the model to fit your expectations. It also answers Question 1: Such univariate screening is not suitable for variable selection. It might be part of the analysis though as complement to the multivariate model. This depends on the research question.

The answer to Question 2 depend on what the research question or the objective of the analysis is:

  • You could, for instance, be interested in testing some specific hypotheses. Then the "borderline significance" and multiple testing problem becomes an issue.
  • (And) or you might want to have a good predictive model. Then cross-validation (or similar) of the performance of the model is of much higher relevance than p values.
  • (And) or you might be interested in estimating the effects of some particular predictor.
Related Question