Solved – How to account for a nonlinear variable in a logistic regression

logisticnonlinearregression

How can I account for a nonlinear independent variable in a logistic regression?

For example, consider this data set:

b1  b2  b3  b4  b5
1   1   0   0   20
0   1   0   0   20
1   0   0   0   13
1   1   1   0   8
1   1   0   1   5
0   1   0   1   4
1   1   0   1   5
0   0   0   0   8
0   0   1   0   13
1   0   0   1   20
1   0   0   0   29
1   1   0   1   40
1   0   0   0   53
0   1   1   0   68

Suppose I gathered some data like the above (suppose I had 10,000 rows) and I want to use them to predict future sets of similar data with an equation generated by a logistic regression. I get the above data. My independent variables are b1 through b5. By looking at the data, I can clearly see that independent variables b1 through b4 are binary (maybe they are male/female, rent/own, etc).

However I see that b5 is clearly not binary. Upon graphing it and looking at the numbers I determine that the b5 variable has a bathtub shaped U curve with an equation of b5=(x-4)^2+4.

I understand that using the b5 variable as it is would screw up my logistic regression – since any interval independent variables must have a linear relationship with the dependent variable.

Yet, I must account for the variable b5 in my data set while doing a logistic regression for it to be accurate and predict well (right?).

Given the above data, how would I go about making a predictive model with a logistic regression? Any references online or in books would greatly be appreciated as I am having trouble finding a good explanation of the above.

Cheers =)

EDIT:

Sorry if my question is unclear.

Consider this: Suppose we had the following data on many people:

Y: Whether they had a stroke before the age of 50 or not
1. Whether they own a pet or not.
2. Whether they graduated college or not.
3. Whether they are male or female.
4. The number of hours they exercise every week.

We want to train a logistic regression model based on historical data to predict if someone in the future with the same data will be more apt or not to have a stroke before the age of 50.

The problem I am having is with the last variable. A continuous ratio variable like "number of hours exercised each week."

Exercising more each week may have diminishing returns on health and thus having a stroke. For example decreasing an increasing rate. Or maybe (totally made up) exercising a little bit is really good for you, exercising a moderate amount is really bad for you, and exercising a lot is really good for you again. We might have a U shaped curve like (x-4)^2+4 for example when plotting health to exercise.

I would image that having a curved, non-linear independent variable in a logistic regression like the one I described above would cause problems. Am I right? And if so, what kinds of things can you do to still include the variable in the logistic regression. Perhaps a transformation? Is that all?

Best Answer

As written, your question can't work, since y is a 0-1 variable and you're doing logistic regression.

If you mean that the linear predictor had a nonlinear relationship with one of the independent variables, that is, $\eta = a + bf(x)$, say, for some nonlinear $f$ (with all other variables held constant), then you can write $x^* = f(x)$ and put $x^*$ in your logistic regression as an independent variable. [In a logistic regression, $\eta = \text{logit}(P[Y=1])$]

This is quite commonly done in linear models and generalized linear models; there's a linear relationship, but it's with a transformed independent variable. Under the usual assumptions you need for a GLM, the transformed variable works perfectly well as a predictor.

Note that if $f$ is known and that coefficient, $b$ is known, you don't put $x^*$ in as a predictor, because $x^{**} = bf(x)$ is then an alternative predictor with coefficient 1; those come in as offsets (e.g. specified in R by using the offset argument). (In ordinary regression you could let $y^* = y-x^{**}$ instead, for the same effect.)

I will assume the coefficient of $f$ is unknown (though you specified it to be 1).

In your particular case $x^* = f(x) = (x-4)^2$. If you were unsure about the "4" there (e.g. if it's just a rough guess or something, rather than a value that's definitely known), then you could instead use two new variables, $x^*_l = x-4$ and $x^*_q = (x-4)^2$ both as predictors, which will capture a general quadratic relationship (with the additional benefit that if the '4' is nearly right, the estimates be nearly uncorrelated with each other and with the intercept.

Related Question