Solved – Multinomial logistic regression, weighted logistic regression

binninglogisticregressionstatsmodels

I have a binary predictor with many response variables. The binary predictor was originally continuous but was converted to binary … if the response was $>1000$ then 1, else 0. I would like to have a model in which responses of greater magnitude are more likely be to 1 vs responses of lesser magnitude. I have also thought of splitting the response to more categories … anyone have any ideas?

Best Answer

Whatever the problem is, you should not be binning a continuous response. You didn't give us much context, so advice is difficult to give, please add more context. But you say "there are a lot of legitimate 0 values." Why is that a problem? Maybe because you wanted to log-transform an otherwise positive variable? Then there are many other options, for instance modeling $\log( Y+c)$ for some positive constant $c$ (which could be estimated from that in a way similar to Box-Cox transforms). Or an extended Box-Cox transform of the form $\frac{(Y+c)^\lambda+1}{\lambda}$ Can be used, see Wikipedia or Transforming variables for multiple regression in R. Or you could simply use a glm (generalized linear model) with log link. That log-transforms the (estimated) expectation, not the observations! Many other possibilities, but tell us more about context first.

One other possibility that merits mention (since it is not too well known) is continuous ordinal regression. This is implemented in R in orm in package rms, and discussed at length in Frank Harrell's book "Regression Modeling Strategies".

Related Solutions

Logistic Regression – How to Resolve Model Non-Convergence Issues

glm() uses an iterative re-weighted least squares algorithm. The algorithm hit the maximum number of allowed iterations before signalling convergence. The default, documented in ?glm.control is 25. You pass control parameters as a list in the glm call:

delay.model <- glm(BigDelay ~ ArrDelay, data=flights, family=binomial,
                   control = list(maxit = 50))

As @Conjugate Prior says, you seem to be predicting the response with the data used to generate it. You have complete separation as any ArrDelay < 10 will predict FALSE and any ArrDelay >= 10 will predict TRUE. The other warning message tells you that the fitted probabilities for some observations were effectively 0 or 1 and that is a good indicator you have something wrong with the model.

The two warnings can go hand in hand. The likelihood function can be quite flat when some $\hat{\beta}_i$ get large, as in your example. If you allow more iterations, the model coefficients will diverge further if you have a separation issue.

Solved – Can value of predicted probability from logistic model be greater than one

The OP has explained in comments that they by error used the R glm function, but forgot to specify the argument family=binomial, that is, used the default gaussian family (with identity link.) But that is the usual (least squares) linear regression, not logistic regression, and clearly can give predictions outside the interval $(0,1)$.

Best Answer

Related Solutions

Logistic Regression – How to Resolve Model Non-Convergence Issues

Solved – Can value of predicted probability from logistic model be greater than one

Related Question