The distribution of the predictors is almost irrelevant in regression, as you are conditioning on their values. Changing to factors is not needed unless there are very few unique values and some of them are not well populated.
But with very skewed predictors the model may fit better upon a transformation. I tend to use $\sqrt{}$ and $x^{\frac{1}{3}}$ because these allow zeros, unlike $\log(x)$. Then when the sample size allows I expand the transformed variables in a regression split to make them fit adequately, e.g.
require(rms)
cuber <- function(x) x^(1/3)
f <- lrm(y ~ rcs(cuber(x1), 4) + rcs(cuber(x2), 4) + rcs(x3, 5) + sex)
rcs
means "restricted cubic spline" (natural spline) and the number after the variable or transformed variable is the number of knots (two more than the number of nonlinear terms in the spline). When you make the distribution more symmetric (here with cube root), it frequently requires fewer knots to get a good fit.
AIC can help in choosing the number of knots $k$ if you force all variables to have the same number of knots. Below, $k=0$ is the same as linearity after initial transformation.
for(k in c(0, 3:7)) {
f <- lrm(y ~ rcs(cuber(x1),k) + rcs(cuber(x2),k) + rcs(x3, k) + sex)
print(AIC(f))
}
Eventually, I have found a comprehensive description of the algorithm for creating a calibration plot in
J. Esarey, A. Pierce: "Assessing Fit Quality and Testing for Misspecification in Binary Dependent Variable Models." Political Analysis 20.4, pp. 480-500, 2012
The article compares it with classification based evaluation. Here is a summary of the ideas together with my comments and R code for creating a calibration plot.
When comparing the probability predicted by the model with the "observed" probability, there is the problem that no probabilities are observed but only zeros or ones, i.e. (non-) occurences of the response. These values can be smoothed out to probabilities by a distance weighted average in the "neighborhood" of each value, e.g. with a LOESS local regression. The distance for establishing the "neighborhood" and the weights can be measured in different spaces. Two obvious possible choices are
- The distance on the link scale, i.e. on $\eta_i=\beta_0 + \langle\vec{\beta},\vec{x}_i\rangle$, where $x_i$ are the predictor variable values for the $i$-th observation, and $\beta$ are the model parameters.
- The distance on the probability scale, i.e. on $p_i=P(Y=1|\vec{x}_i) = 1 / (1+e^{-\eta_i})$
A LOESS fit through the points $(y_i,\eta_i)$ or $(y_i,p_i)$ will then yield an estimator $\hat{p}_i$ for each $y_i$, which can be compared to the probability $p_i$ predicted by the model:
There are two caveats, however:
- For degrees greater than zero, the LOESS fit can yield values outside [0,1]. For this reason, the first value is missing in both of the above plots: its estimated probability $\hat{p}_i$ is negative. This can be easily corrected by cutting off the probabilities at zero and one.
- LOESS only takes a certain percentage (parameter span) of neighbors into account.
The above plots have been created with the default span=0.75. Esarey & Pierce suggest two different optimization methods and link to a reference implementation in a footnote, but that link is meanwhile stalled. I have therefore implemented a very simple optimization criterion: the minimum MSE between $\hat{p}_i$ and $p_i$, i.e. $\sum_i(\hat{p}_i - p_i)^2$. The result on the Challenger Space Shuttle O-Ring dataset can be seen here:
I have also included the 95% prediction interval for $p_i$ as predicted by the model. Esarey & Pierce also compute the percentage of values that lie outside an 80% confidence interval by means of a parametric bootstrap, but this might easier be computed directly from the confidence intervals for $p_i$. Here is the code to produce the calibration plot on the link level (right hand side):
# Challenger Space Shuttle O-ring data: ok vs temp
data <- data.frame(y=c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1),
x=c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69,
70, 70, 70, 70, 72, 73, 75, 75, 76, 76,
78, 79, 81))
fit <- glm(y ~ x, data=data, family=binomial)
#
# calibration plot on link level
#
link.model <- predict(fit, data, type="link", se.fit=TRUE)
sort.key <- order(link.model$fit)
x <- link.model$fit[sort.key]
# prediction interval for probability
plot(link.model$fit, data$y, main="link level")
p.lower <- plogis(link.model$fit - qnorm(1-0.05/2) *
link.model$se.fit)[sort.key]
p.upper <- plogis(link.model$fit + qnorm(1-0.05/2) *
link.model$se.fit)[sort.key]
polygon(c(x,rev(x)), c(p.lower, rev(p.upper)), col="#dddddd",
border=NA)
points(link.model$fit, data$y) # replot overplotted points
lines(x, plogis(x), col="red")
# LOESS fit
optim.span <- optimize(resub.mse, c(0.1,1.0),
y=data$y, x=link.model$fit,
p.model=p.model)
span <- optim.span$minimum
p.fit <- loess(y ~ x, data=data.frame(y=data$y,
x=link.model$fit), family="gaussian", degree=1,
span=span)
p.cutfit <- predict(p.fit, data.frame(x=x))
p.cutfit[p.cutfit < 0] <- 0
p.cutfit[p.cutfit > 1] <- 1
lines(x, p.cutfit)
legend("topleft", c("model", sprintf("LOESS (span=%4.2f)",
span)),
col=c("red","black"), lty=1)
# the optimization function for estimating span
resub.mse <- function(span, y, x, p.model) {
fit <- loess(y ~ x, family="gaussian", degree=1, span=span)
return(sum((fit$fitted - p.model)^2))
}
Best Answer
A few newer techniques I have come across for assessing the fit of logistic regression models come from political science journals:
Both of these techniques purport to replace Goodness-of-Fit tests (like Hosmer & Lemeshow) and identify potential mis-specification (in particular non-linearity in included variables in the equation). These are particularly useful as typical R-square measures of fit are frequently criticized.
Both of the above papers above utilize predicted probabilities vs. observed outcomes in plots - somewhat avoiding the unclear issue of what is a residual in such models. Examples of residuals could be contribution to the log-likelihood or Pearson residuals (I believe there are many more though). Another measure that is often of interest (although not a residual) are DFBeta's (the amount a coefficient estimate changes when an observation is excluded from the model). See examples in Stata for this UCLA page on Logistic Regression Diagnostics along with other potential diagnostic procedures.
I don't have it handy, but I believe J. Scott Long's Regression Models for Categorical and Limited Dependent Variables goes in to sufficient detail on all of these different diagnostic measures in a simple manner.