I'm doing a logistic regression and created a calibration plot. I also conducted a Hosmer-Lemeshow test and got the corresponding chi-square. Is there any relationship between the calibration plot and the chi-square? It is reasonable to put the chi-square value on the calibration plot (to show the result)?
Logistic Calibration – Can Hosmer-Lemeshow Chi-Square Statistic Explain Calibration?
binary datacalibrationhosmer-lemeshow-testlogistic
Related Solutions
Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069 show that:
If the model is a logistic regression model and the $p$ parameters are estimated by maximum likelihood and the $G$ groups are defined on the estimated probabilities then it holds that $X^2$ is asymptotically $\chi^2(G-p-1)+\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$ (Hosmer,Lemeshow, 1980, p.1052, Theorem 2).
(Note: the necessary conditions are not explicitly in Theorem 2 on page 1052 but if one attentively reads the paper and the proof then these pop up)
The second term $\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$ results from the fact that the grouping is based on estimated - i.e. random - quantities (Hosmer,Lemeshow, 1980, p.1051)
Using simulations they showed that the second term can be (in the cases used in the simualtion) approximated by a $\chi^2(p-1)$ (Hosmer,Lemeshow, 1980, p.1060)
Combining these two facts results in a sum of two $\chi^2$ variables, one with $G-p-1$ degrees of freedom and a second one with $p-1$ degrees of freedom or $X^2 \sim \chi^2(G-p-1+p-1=G-2)$
So the answer to the question lies in the occurrence of the 'weighted chi-square term' or in the fact that the groups are defined using estimated probabilities that are themselves random variables.
Update 28 July: the below has now been pushed to CRAN in the generalhoslem
package along with the Lipsitz and Pulkstenis-Robinson tests.
Fagerland and Hosmer discuss a generalisation of the Hosmer-Lemeshow test and two other approaches (the Lipsitz test and Pulkstenis-Robinson tests) in A goodness-of-fit test for the proportional odds regression model 2013 Stat Med and in Tests for goodness of fit in ordinal logistic regression models (2016) Journal of Statistical Computing and Simulation.
I haven't checked properly yet, but as far as I know, they haven't been implemented in R. If that turns out to be the case, I plan to add them to the generalhoslem
package.
EDIT: it's probably also worth pointing out Fagerland and Hosmer's recommendation that
because the tests may detect different types of lack of fit, a thorough assessment of goodness of fit requires use of all three approaches.
ANOTHER EDIT: I didn't realise this straight away but the only difference between the multinomial version of the Hosmer-Lemeshow test and the ordinal version is in the degrees of freedom. Strictly speaking, the test for ordinal response models requires sorting the observations by a weighted ordinal 'response score' but they show in the 2012 article above that that is equivalent to binning the observations in the same way as in the multinomial case.
I haven't properly tested it but the the logitgof
function on my github page should now do the trick: https://github.com/matthewjay15/generalhoslem-v1.1.0/blob/1.2.0.9000/logitgof.R Note that if you examine the observed and expected tables produced by this function, then the columns may not be in the right order (they will be alphabetical). This shouldn't make any difference to the test statistic, though, as they will correspond to each other.
An example using the ordinal
and MASS
packages:
library(reshape) # needed by logitgof
logitgof <- function (obs, exp, g = 10, ord = FALSE) {
DNAME <- paste(deparse(substitute(obs)), deparse(substitute(exp)), sep = ", ")
yhat <- exp
if (is.null(ncol(yhat))) {
mult <- FALSE
} else {
if (ncol(yhat) == 1) {
mult <- FALSE
} else mult <- TRUE
}
n <- ncol(yhat)
if (mult) {
if (!ord) {
METHOD <- "Hosmer and Lemeshow test (multinomial model)"
} else {
METHOD <- "Hosmer and Lemeshow test (ordinal model)"
}
qq <- unique(quantile(1 - yhat[, 1], probs = seq(0, 1, 1/g)))
cutyhats <- cut(1 - yhat[, 1], breaks = qq, include.lowest = TRUE)
dfobs <- data.frame(obs, cutyhats)
dfobsmelt <- melt(dfobs, id.vars = 2)
observed <- cast(dfobsmelt, cutyhats ~ value, length)
observed <- observed[order(c(1, names(observed[, 2:ncol(observed)])))]
dfexp <- data.frame(yhat, cutyhats)
dfexpmelt <- melt(dfexp, id.vars = ncol(dfexp))
expected <- cast(dfexpmelt, cutyhats ~ variable, sum)
expected <- expected[order(c(1, names(expected[, 2:ncol(expected)])))]
stddiffs <- abs(observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)]) / sqrt(expected[, 2:ncol(expected)])
if (ncol(expected) != ncol(observed)) stop("Observed and expected tables have different number of columns. Check you entered the correct data.")
chisq <- sum((observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)])^2 / expected[, 2:ncol(expected)])
if (!ord) {
PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1)
} else {
PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1) + ncol(yhat) - 2
}
} else {
METHOD <- "Hosmer and Lemeshow test (binary model)"
if (is.factor(obs)) {
y <- as.numeric(obs) - 1
} else {
y <- obs
}
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g)))
cutyhat <- cut(yhat, breaks = qq, include.lowest = TRUE)
observed <- xtabs(cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
expected <- xtabs(cbind(yhat0 = 1 - yhat, yhat1 = yhat) ~ cutyhat)
stddiffs <- abs(observed - expected) / sqrt(expected)
chisq <- sum((observed - expected)^2/expected)
PARAMETER <- nrow(expected) - 2
}
if (g != nrow(expected))
warning(paste("Not possible to compute", g, "rows. There might be too few observations."))
if (any(expected[, 2:ncol(expected)] < 1))
warning("At least one cell in the expected frequencies table is < 1. Chi-square approximation may be incorrect.")
PVAL <- 1 - pchisq(chisq, PARAMETER)
names(chisq) <- "X-squared"
names(PARAMETER) <- "df"
structure(list(statistic = chisq, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, observed = observed,
expected = expected, stddiffs = stddiffs), class = "htest")
}
library(foreign) # just to download the example dataset
# with the ordinal package
library(ordinal)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
mod1 <- clm(ses ~ female + write + read, data = ml)
# extract predicted probs for each level of outcome
predprob <- data.frame(id = ml$id, female = ml$female, read = ml$read, write = ml$write)
fv <- predict(mod1, newdata = predprob, type = "prob")$fit
logitgof(ml$ses, fv, ord = TRUE) # set ord to TRUE to run ordinal instead of multinomial test
# with MASS
library(MASS)
mod2 <- polr(ses ~ female + write + read, data = ml)
logitgof(ml$ses, fitted(mod2), ord = TRUE)
Best Answer
I'd think it's reasonable. The Hosmer-Lemeshow statistic is pretty closely related to the calibration plot; it basically makes the same comparison that the calibration plot allows as well. However, the relation is not totally direct; the Hosmer-Lemeshow binning into groups is not usually shown on the calibration plot (although I think there is more than one way to set it up). So one can't directly connect the workings of the test to the plot. Where what is presented and how closely the connection needs to be between things that are presented together is to some extent a matter of taste. I think it's fine to put them together.