I'm working on a data set modeling road kills (0 = random point, 1 = road kill) as a function of a number of habitat variables. Following Hosmer and Lemeshow, I've examined each continuous predictor variable for linearity, and a couple appear nonlinear. I'd like to try a fractional polynomial transformation for each, also following Hosmer and Lemeshow, and have looked at the R package mfp, but I'm having trouble coming up with (and understanding) the R code that will correctly transform the variable. Can anyone suggest R code that would help me accomplish the concepts on p. 101 – 102 of Hosmer and Lemeshow's Applied Logistic Regression (2000).
Solved – How to implement a fractional polynomial transformation in R for logistic regression
fractional-polynomiallogisticmodelingrregression
Related Solutions
@mpiktas came close in offering a feasible model, however the term that needs to be used for the quadratic in time=t would be I(t^2))
. This is so because in R the formula interpretation of "^" creates interactions and does not perform exponentiation, so the interaction of "t" with "t" is just "t". (Shouldn't this be migrated to SO with an [r] tag?)
For alternatives to this process, which looks to me somewhat dubious for inference purposes, and one which probably fits your interest in using the supportive functions in Harrell's rms/Hmisc packages, see Harrell's "Regression Modeling Strategies". He mentions (but only in passing although he does cite some of his own papers) constructing spline fits to the time scale to model bathtub-shaped hazards. His chapter on parametric survival models describes a variety of plotting techniques for checking proportional hazards assumptions and for examining the linearity of estimated log-hazard effects on the time scale.
Edit: An additional option is to use coxph
's 'tt' parameter described as an "optional list of time-transform functions."
The exposition is obscure but the examples and the discussion on p. 101 make the intentions clear.
Recall that the objective (for the situation with a single continuous covariate $x$) is to generalize logistic regression from the case
$$\text{logit}(y) = \beta_0 + \beta_1 x$$
to a relatively simple nonlinear expression of the form
$$\text{logit}(y) = \beta_0 + \beta_1 F_1(x) + \cdots + \beta_J F_J(x).$$
"Fractional polynomials" [sic] are expressions of the form
$$F(x) = x^p (\log(x))^q$$
for suitably chosen powers $p$ and $q$, with $q$ a natural number and $p$ a real number close to $1$. It is intended that if a high power $q$ of the logarithm is included, then all lower powers $q-1, q-2, \ldots, 1, 0$ will also be included. To be practicable and interpretable, H&L suggest restricting the values of $p$ to the set $P$ = $\{-2, -1, -1/2, 0, 1/2, 1, 2, 3\}$ ($0$ corresponds to $\log$, as usual) and $q$ to the set $\{0,1\}$.
When we limit the fractional polynomial to just two terms ($J=2$), the only possibilities according to these rules are of the form
$$F_1(x) = x^{p_1}, F_2(x) = x^{p_2}$$
for $p_1 \ne p_2$ or
$$F_1(x) = x^p, F_2(x) = x^p\log(x).$$
(The case $p=0$ corresponds to using $F_1(x) = \log(x)$ and $F_2(x) = (\log(x))^2$.)
These possibilities can be uniquely determined by a non-decreasing sequence of $J=2$ elements of $P$. The sequence $(p_1,p_2)$ with $p_2 \gt p_1$ specifies the first kind of fractional polynomial and the sequence $(p_1,p_2) = (p,p)$ specifies the second kind. Because $P$ has eight elements, this gives $\binom{8+1}{2} = 36$ possibilities for $J=2$. For instance, your case of $(-1,-1)$ specifies the model
$$\text{logit}(y) = \beta_0 + \beta_1 \frac{1}{x} + \beta_2 \frac{\log(x)}{x}.$$
(H&L go on to recount an approximate procedure in which partial likelihood ratio tests are used to fit the best model with $J=1$ (there are just eight of these) and then the best model with $J=2$ is fit. Each contributes approximately $2J$ degrees of freedom in the resulting chi-squared test.)
Of course, to be really sure of what R
is doing, you should either look at the source code, or fit the model and plot the predictions against the data, or both.
Best Answer
Here is some R code with an example taken from an example data set included in package
MASS
:(I did not include output). An alternative would be to use splines.