Hosmer D.W., Lemeshow S. (1980), A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics, A10, 1043-1069 show that:
If the model is a logistic regression model and the $p$ parameters are
estimated by maximum likelihood and the $G$ groups are defined on the
estimated probabilities then it holds that $X^2$ is asymptotically
$\chi^2(G-p-1)+\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$
(Hosmer,Lemeshow, 1980, p.1052, Theorem 2).
(Note: the necessary conditions are not explicitly in Theorem 2 on page 1052 but if one attentively reads the paper and the proof then these pop up)
The second term $\sum_{i=1}^{p+1} \lambda_i \chi_i^2(1)$ results from the fact that the grouping is based on estimated - i.e. random - quantities (Hosmer,Lemeshow, 1980, p.1051)
Using simulations they showed that the second term can be (in the cases used in the simualtion) approximated by a $\chi^2(p-1)$ (Hosmer,Lemeshow, 1980, p.1060)
Combining these two facts results in a sum of two $\chi^2$ variables,
one with $G-p-1$ degrees of freedom and a second one with $p-1$
degrees of freedom or $X^2 \sim \chi^2(G-p-1+p-1=G-2)$
So the answer to the question lies in the occurrence of the 'weighted
chi-square term' or in the fact that the groups are defined using
estimated probabilities that are themselves random variables.
See also Hosmer Lemeshow (1980) Paper - Theorem 2
Update 28 July: the below has now been pushed to CRAN in the generalhoslem
package along with the Lipsitz and Pulkstenis-Robinson tests.
Fagerland and Hosmer discuss a generalisation of the Hosmer-Lemeshow test and two other approaches (the Lipsitz test and Pulkstenis-Robinson tests) in A goodness-of-fit test for the proportional odds regression model 2013 Stat Med and in Tests for goodness of fit in ordinal logistic regression models (2016) Journal of Statistical Computing and Simulation.
I haven't checked properly yet, but as far as I know, they haven't been implemented in R. If that turns out to be the case, I plan to add them to the generalhoslem
package.
EDIT: it's probably also worth pointing out Fagerland and Hosmer's recommendation that
because the tests may detect different types of lack of fit, a thorough assessment of goodness of fit requires use of all three approaches.
ANOTHER EDIT: I didn't realise this straight away but the only difference between the multinomial version of the Hosmer-Lemeshow test and the ordinal version is in the degrees of freedom. Strictly speaking, the test for ordinal response models requires sorting the observations by a weighted ordinal 'response score' but they show in the 2012 article above that that is equivalent to binning the observations in the same way as in the multinomial case.
I haven't properly tested it but the the logitgof
function on my github page should now do the trick: https://github.com/matthewjay15/generalhoslem-v1.1.0/blob/1.2.0.9000/logitgof.R Note that if you examine the observed and expected tables produced by this function, then the columns may not be in the right order (they will be alphabetical). This shouldn't make any difference to the test statistic, though, as they will correspond to each other.
An example using the ordinal
and MASS
packages:
library(reshape) # needed by logitgof
logitgof <- function (obs, exp, g = 10, ord = FALSE) {
DNAME <- paste(deparse(substitute(obs)), deparse(substitute(exp)), sep = ", ")
yhat <- exp
if (is.null(ncol(yhat))) {
mult <- FALSE
} else {
if (ncol(yhat) == 1) {
mult <- FALSE
} else mult <- TRUE
}
n <- ncol(yhat)
if (mult) {
if (!ord) {
METHOD <- "Hosmer and Lemeshow test (multinomial model)"
} else {
METHOD <- "Hosmer and Lemeshow test (ordinal model)"
}
qq <- unique(quantile(1 - yhat[, 1], probs = seq(0, 1, 1/g)))
cutyhats <- cut(1 - yhat[, 1], breaks = qq, include.lowest = TRUE)
dfobs <- data.frame(obs, cutyhats)
dfobsmelt <- melt(dfobs, id.vars = 2)
observed <- cast(dfobsmelt, cutyhats ~ value, length)
observed <- observed[order(c(1, names(observed[, 2:ncol(observed)])))]
dfexp <- data.frame(yhat, cutyhats)
dfexpmelt <- melt(dfexp, id.vars = ncol(dfexp))
expected <- cast(dfexpmelt, cutyhats ~ variable, sum)
expected <- expected[order(c(1, names(expected[, 2:ncol(expected)])))]
stddiffs <- abs(observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)]) / sqrt(expected[, 2:ncol(expected)])
if (ncol(expected) != ncol(observed)) stop("Observed and expected tables have different number of columns. Check you entered the correct data.")
chisq <- sum((observed[, 2:ncol(observed)] - expected[, 2:ncol(expected)])^2 / expected[, 2:ncol(expected)])
if (!ord) {
PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1)
} else {
PARAMETER <- (nrow(expected) - 2) * (ncol(yhat) - 1) + ncol(yhat) - 2
}
} else {
METHOD <- "Hosmer and Lemeshow test (binary model)"
if (is.factor(obs)) {
y <- as.numeric(obs) - 1
} else {
y <- obs
}
qq <- unique(quantile(yhat, probs = seq(0, 1, 1/g)))
cutyhat <- cut(yhat, breaks = qq, include.lowest = TRUE)
observed <- xtabs(cbind(y0 = 1 - y, y1 = y) ~ cutyhat)
expected <- xtabs(cbind(yhat0 = 1 - yhat, yhat1 = yhat) ~ cutyhat)
stddiffs <- abs(observed - expected) / sqrt(expected)
chisq <- sum((observed - expected)^2/expected)
PARAMETER <- nrow(expected) - 2
}
if (g != nrow(expected))
warning(paste("Not possible to compute", g, "rows. There might be too few observations."))
if (any(expected[, 2:ncol(expected)] < 1))
warning("At least one cell in the expected frequencies table is < 1. Chi-square approximation may be incorrect.")
PVAL <- 1 - pchisq(chisq, PARAMETER)
names(chisq) <- "X-squared"
names(PARAMETER) <- "df"
structure(list(statistic = chisq, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, observed = observed,
expected = expected, stddiffs = stddiffs), class = "htest")
}
library(foreign) # just to download the example dataset
# with the ordinal package
library(ordinal)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
mod1 <- clm(ses ~ female + write + read, data = ml)
# extract predicted probs for each level of outcome
predprob <- data.frame(id = ml$id, female = ml$female, read = ml$read, write = ml$write)
fv <- predict(mod1, newdata = predprob, type = "prob")$fit
logitgof(ml$ses, fv, ord = TRUE) # set ord to TRUE to run ordinal instead of multinomial test
# with MASS
library(MASS)
mod2 <- polr(ses ~ female + write + read, data = ml)
logitgof(ml$ses, fitted(mod2), ord = TRUE)
Best Answer
The Hosmer and Lemeshow test is really sensitive on the group number and sometimes the change of group will lead to a disaster of conclusion. My suggestion is still use other test for comparison. And be careful with HL-test.
The issue of this is presented in this paper: https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf
And another published paper proposed alternatives and adjustment of HL-test, and also gave recommendations for different situations. Here: http://onlinelibrary.wiley.com/doi/10.1002/sim.5525/full