Logistic Regression – Interpreting Results of a Binary Logistic Regression

logisticregression

This is a basic question. I have been handed a binary logistic regression. The model has significant terms, but the goodness of fit tests indicates the logit model is not appropriate. The author of the study indicates that the goodness of fit data does not invalidate the relationship between the dependent variable and the predictors, only the ability of the model to accurately predict outcomes. the argument is that since we were only interested in verifying a relationship and not the magnitude, the result is conclusive.

I'm skeptical of this. Wouldn't it be more appropriate to say that the lack of fit does not necessarily invalidate the relationships? With a different link function, couldn't the observed and expected counts change enough to move some insignificant term to significance or vice versa?

Best Answer

Although changing the link function could change your significance, ignoring issues of multiple testing, only changing to a bad link function will make significance go away - and making the significance go away doesn't make the relationship go away. That is I don't think for a binary respose you can have a significant result that is not real - even if the model is not of the best form.

The terms will only show as significant if they are signficantly explaining variance in the response - they may not be being used in the best possible way to do this, but they are doing something.

One way to see this is to think of your logistic regression not as a model in its own right, but just an arbitrary data transformation. Suppose someone came along and handed you the responses and the binary predictions of the logistic regression (say thresholded at 0.5). You now just have a single binary predictor for a binary response - a $2 \times2$ contingency table. There is no "goodness-of-fit" to worry about - the only possible model is to $Y=X$ and the fit must be good since it is either right or wrong (all structure has been removed by construction!). However, by virtue of the fact that the original logistic regression was significant, it must be the case that the contingency table is significant, there $Y$ is related to $X$. Since $X$ is a function of your original predictor variables, it must also be the case that $Y$ is related to those original predictors.

Of course, the converse is not true. Predictors that were not found to be significant could turn out to have significance if a better fitting model was chosen. Also note that the coefficients of the fit are unlikely to be meaningful. A better fitted model will have very different (most likely larger) strengths of relationship, and may well find additional predictors are important. It is only the simple "yes/no" question of is this predictor related to the dependent variable that must be true even with a poor model fit.

(The above should be caveated that stacking models one after the other like that is a bad idea, and is only meant as a thought experiment.)

Related Solutions

Solved – Investigating robustness of logistic regression against violation of linearity of logit

The linearity assumption is so commonly violated in regression that it should be called a surprise rather than an assumption. Like other regression models, the logistic model is not robust to nonlinearity when you falsely assume linearity. Rather than detect nonlinearity using residuals or omnibus goodness of fit tests, it is better to use direct tests. For example, expand continuous predictors using regression splines and do a composite test of all the nonlinear terms. Better still don't test the terms and just expect nonlinearity. This approach is much better than trying different single-slope choices of transformations such as square root, log, etc., because statistical inference arise after such analyses will be incorrect because it does not have large enough numerator degrees of freedom.

Here's an example in R.

require(rms)
f <- lrm(y ~ rcs(age,4) + rcs(blood.pressure,5) + sex + rcs(height,4))
# Fits restricted cubic splines in 3 variables with default knots
# 4, 5, 4 knots = 2, 3, 2 nonlinear terms
Function(f)   # display algebraic form of fit
anova(f)      # obtain individual + combined linearity tests

Solved – Evaluating a logistic regression model

There are many thousands of tests one can apply to inspect a logistic regression model, and much of this depends on whether one's goal is prediction, classification, variable selection, inference, causal modeling, etc. The Hosmer-Lemeshow test, for instance, assesses model calibration and whether predicted values tend to match the predicted frequency when split by risk deciles. Although, the choice of 10 is arbitrary, the test has asymptotic results and can be easily modified. The HL test, as well as AUC, have (in my opinion) very uninteresting results when calculated on the same data that was used to estimate the logistic regression model. It's a wonder programs like SAS and SPSS made the frequent reporting of statistics for wildly different analyses the de facto way of presenting logistic regression results. Tests of predictive accuracy (e.g. HL and AUC) are better employed with independent data sets, or (even better) data collected over different periods in time to assess a model's predictive ability.

Another point to make is that prediction and inference are very different things. There is no objective way to evaluate prediction, an AUC of 0.65 is very good for predicting very rare and complex events like 1 year breast cancer risk. Similarly, inference can be accused of being arbitrary because the traditional false positive rate of 0.05 is just commonly thrown around.

If I were you, your problem description seemed to be interested in modeling the effects of the manager reported "obstacles" in investing, so focus on presenting the model adjusted associations. Present the point estimates and 95% confidence intervals for the model odds ratios and be prepared to discuss their meaning, interpretation, and validity with others. A forest plot is an effective graphical tool. You must show the frequency of these obstacles in the data, as well, and present their mediation by other adjustment variables to demonstrate whether the possibility of confounding was small or large in unadjusted results. I would go further still and explore factors like the Cronbach's alpha for consistency among manager reported obstacles to determine if managers tended to report similar problems, or, whether groups of people tended to identify specific problems.

I think you're a bit too focused on the numbers and not the question at hand. 90% of a good statistics presentation takes place before model results are ever presented.

Best Answer

Related Solutions

Solved – Investigating robustness of logistic regression against violation of linearity of logit

Solved – Evaluating a logistic regression model

Related Question