Solved – How to account for a nonlinear variable in a logistic regression

How can I account for a nonlinear independent variable in a logistic regression?

For example, consider this data set:

b1  b2  b3  b4  b5
1   1   0   0   20
0   1   0   0   20
1   0   0   0   13
1   1   1   0   8
1   1   0   1   5
0   1   0   1   4
1   1   0   1   5
0   0   0   0   8
0   0   1   0   13
1   0   0   1   20
1   0   0   0   29
1   1   0   1   40
1   0   0   0   53
0   1   1   0   68

Suppose I gathered some data like the above (suppose I had 10,000 rows) and I want to use them to predict future sets of similar data with an equation generated by a logistic regression. I get the above data. My independent variables are b1 through b5. By looking at the data, I can clearly see that independent variables b1 through b4 are binary (maybe they are male/female, rent/own, etc).

However I see that b5 is clearly not binary. Upon graphing it and looking at the numbers I determine that the b5 variable has a bathtub shaped U curve with an equation of b5=(x-4)^2+4.

I understand that using the b5 variable as it is would screw up my logistic regression – since any interval independent variables must have a linear relationship with the dependent variable.

Yet, I must account for the variable b5 in my data set while doing a logistic regression for it to be accurate and predict well (right?).

Given the above data, how would I go about making a predictive model with a logistic regression? Any references online or in books would greatly be appreciated as I am having trouble finding a good explanation of the above.

Cheers =)

EDIT:

Sorry if my question is unclear.

Consider this: Suppose we had the following data on many people:

Y: Whether they had a stroke before the age of 50 or not
1. Whether they own a pet or not.
2. Whether they graduated college or not.
3. Whether they are male or female.
4. The number of hours they exercise every week.

We want to train a logistic regression model based on historical data to predict if someone in the future with the same data will be more apt or not to have a stroke before the age of 50.

The problem I am having is with the last variable. A continuous ratio variable like "number of hours exercised each week."

Exercising more each week may have diminishing returns on health and thus having a stroke. For example decreasing an increasing rate. Or maybe (totally made up) exercising a little bit is really good for you, exercising a moderate amount is really bad for you, and exercising a lot is really good for you again. We might have a U shaped curve like (x-4)^2+4 for example when plotting health to exercise.

I would image that having a curved, non-linear independent variable in a logistic regression like the one I described above would cause problems. Am I right? And if so, what kinds of things can you do to still include the variable in the logistic regression. Perhaps a transformation? Is that all?

Best Answer

As written, your question can't work, since y is a 0-1 variable and you're doing logistic regression.

If you mean that the linear predictor had a nonlinear relationship with one of the independent variables, that is, $\eta = a + bf(x)$, say, for some nonlinear $f$ (with all other variables held constant), then you can write $x^* = f(x)$ and put $x^*$ in your logistic regression as an independent variable. [In a logistic regression, $\eta = \text{logit}(P[Y=1])$]

This is quite commonly done in linear models and generalized linear models; there's a linear relationship, but it's with a transformed independent variable. Under the usual assumptions you need for a GLM, the transformed variable works perfectly well as a predictor.

Note that if $f$ is known and that coefficient, $b$ is known, you don't put $x^*$ in as a predictor, because $x^{**} = bf(x)$ is then an alternative predictor with coefficient 1; those come in as offsets (e.g. specified in R by using the offset argument). (In ordinary regression you could let $y^* = y-x^{**}$ instead, for the same effect.)

I will assume the coefficient of $f$ is unknown (though you specified it to be 1).

In your particular case $x^* = f(x) = (x-4)^2$. If you were unsure about the "4" there (e.g. if it's just a rough guess or something, rather than a value that's definitely known), then you could instead use two new variables, $x^*_l = x-4$ and $x^*_q = (x-4)^2$ both as predictors, which will capture a general quadratic relationship (with the additional benefit that if the '4' is nearly right, the estimates be nearly uncorrelated with each other and with the intercept.

Best Answer

Related Solutions

Solved – run a regression when both independent and dependent variables are all dichotomous

Solved – the antonym of protective effect of a logistic regression coefficient

Related Question