You are on the right track, ROC
is a common error measure for logistic regression models. More often, the Area Under The Receiver Operating Curve (AUROC
) is used. The advantage is that this measure is numeric and can be compared to other validation runs / model setups of your logistic regression.
You can, for example, use cross-validation to asses the performance of your model. As this goodness of fit depends highly on your training and test sets, it is common to use many repetitions with different training and tests sets. At the end, you have a somewhat stable estimation of your model fit taking the mean of all repetitions.
There are several packages providing cross-validation approaches in R. Assuming you have a fitted model, you can e.g. use the sperrorest
package with the following setup:
nspres <- sperrorest(data = data, formula = formula, # your data and formula here
model_fun = glm, model_args = list(family = "binomial"),
pred_fun = predict, pred_args = list(type = "response"),
smp_fun = partition_cv,
smp_args = list(repetition = 1:50, nfold = 10))
summary(nspres$pooled.err$train.auroc)
summary(nspres$pooled.err$test.auroc)
This will perform a cross-validation using 10 folds, 50 repetitions and give you a summary of the overall mean repetition error.
You may want to consider a measure of accuracy that measures the distance between the line and the data. There are a variety of these types of measures, maybe including Mean Absolute Error, Mean Square Error, or Root Mean Square Error.
The following is an example in R. The amount of vertical error for model
and model2
are the same, but model
has zero slope and zero r-squared, while model2
has an obvious slope and a high r-squared. You can compare the MAE
, MSE
, or RMSE
statistics. (Caveat: I am the author of the accuracy
function.)
if(!require(rcompanion)){install.packages("rcompanion")}
library(rcompanion)
X = c(1,2,3,4,5,6,7,8,9,10)
Y = c(5,6,4,5,5,5,5,4,6,5)
model = lm(Y ~ X)
plot(Y ~ X)
accuracy(list(model), plotit=F)
### Min.max.accuracy MAE MAPE MSE RMSE NRMSE.mean NRMSE.median NRMSE.mean.accuracy NRMSE.median.accuracy Efron.r.squared CV.prcnt
### 0.927 0.4 0.0833 0.4 0.632 0.126 0.126 0.874 0.874 0 12.6
X = c(1,2,3,4,5,6,7,8,9,10)
Z = X + c(5,6,4,5,5,5,5,4,6,5)
model2 = lm(Z ~ X)
plot(Z ~ X)
accuracy(list(model2), plotit=F)
### Min.max.accuracy MAE MAPE MSE RMSE NRMSE.mean NRMSE.median NRMSE.mean.accuracy NRMSE.median.accuracy Efron.r.squared CV.prcnt
### 0.961 0.4 0.0418 0.4 0.632 0.0602 0.0602 0.94 0.94 0.954 6.02
Best Answer
What you are looking for is the goodness-of-fit measure of a statistical model. These measures summarize between the observed values and expected values (from the model). Depending on certain conditions, you could use the AIC, Bayesian Information Criterion, etc.
If you are looking for more than just goodness-of-fit measure, you could use methods involving the model mimicry. Model mimicry is a concept where one model tries to account for the data generated by the other model. The "better" fit model of the two normally accounts for its own data and to a certain extent accounts for the data generated by the competing model. (Note that, the two models should be competing models).
Look at this paper where Wagenmakers et al. describe model selection by quantifying the model mimicry. This intuitive procedure can be easily coded in R.
Hope it helps!