How can I account for a nonlinear independent variable in a logistic regression?
For example, consider this data set:
b1 b2 b3 b4 b5
1 1 0 0 20
0 1 0 0 20
1 0 0 0 13
1 1 1 0 8
1 1 0 1 5
0 1 0 1 4
1 1 0 1 5
0 0 0 0 8
0 0 1 0 13
1 0 0 1 20
1 0 0 0 29
1 1 0 1 40
1 0 0 0 53
0 1 1 0 68
Suppose I gathered some data like the above (suppose I had 10,000 rows) and I want to use them to predict future sets of similar data with an equation generated by a logistic regression. I get the above data. My independent variables are b1 through b5. By looking at the data, I can clearly see that independent variables b1 through b4 are binary (maybe they are male/female, rent/own, etc).
However I see that b5 is clearly not binary. Upon graphing it and looking at the numbers I determine that the b5 variable has a bathtub shaped U curve with an equation of b5=(x-4)^2+4.
I understand that using the b5 variable as it is would screw up my logistic regression – since any interval independent variables must have a linear relationship with the dependent variable.
Yet, I must account for the variable b5 in my data set while doing a logistic regression for it to be accurate and predict well (right?).
Given the above data, how would I go about making a predictive model with a logistic regression? Any references online or in books would greatly be appreciated as I am having trouble finding a good explanation of the above.
Cheers =)
EDIT:
Sorry if my question is unclear.
Consider this: Suppose we had the following data on many people:
Y: Whether they had a stroke before the age of 50 or not
1. Whether they own a pet or not.
2. Whether they graduated college or not.
3. Whether they are male or female.
4. The number of hours they exercise every week.
We want to train a logistic regression model based on historical data to predict if someone in the future with the same data will be more apt or not to have a stroke before the age of 50.
The problem I am having is with the last variable. A continuous ratio variable like "number of hours exercised each week."
Exercising more each week may have diminishing returns on health and thus having a stroke. For example decreasing an increasing rate. Or maybe (totally made up) exercising a little bit is really good for you, exercising a moderate amount is really bad for you, and exercising a lot is really good for you again. We might have a U shaped curve like (x-4)^2+4 for example when plotting health to exercise.
I would image that having a curved, non-linear independent variable in a logistic regression like the one I described above would cause problems. Am I right? And if so, what kinds of things can you do to still include the variable in the logistic regression. Perhaps a transformation? Is that all?
Best Answer
As written, your question can't work, since y is a 0-1 variable and you're doing logistic regression.
If you mean that the linear predictor had a nonlinear relationship with one of the independent variables, that is, $\eta = a + bf(x)$, say, for some nonlinear $f$ (with all other variables held constant), then you can write $x^* = f(x)$ and put $x^*$ in your logistic regression as an independent variable. [In a logistic regression, $\eta = \text{logit}(P[Y=1])$]
This is quite commonly done in linear models and generalized linear models; there's a linear relationship, but it's with a transformed independent variable. Under the usual assumptions you need for a GLM, the transformed variable works perfectly well as a predictor.
Note that if $f$ is known and that coefficient, $b$ is known, you don't put $x^*$ in as a predictor, because $x^{**} = bf(x)$ is then an alternative predictor with coefficient 1; those come in as offsets (e.g. specified in R by using the
offset
argument). (In ordinary regression you could let $y^* = y-x^{**}$ instead, for the same effect.)I will assume the coefficient of $f$ is unknown (though you specified it to be 1).
In your particular case $x^* = f(x) = (x-4)^2$. If you were unsure about the "4" there (e.g. if it's just a rough guess or something, rather than a value that's definitely known), then you could instead use two new variables, $x^*_l = x-4$ and $x^*_q = (x-4)^2$ both as predictors, which will capture a general quadratic relationship (with the additional benefit that if the '4' is nearly right, the estimates be nearly uncorrelated with each other and with the intercept.