The idea of gui11aume of building a two-stage model is the right way to go, however, one needs to consider the special difficulty of your setup which is the very strong negative correlation between the debt amount and the probability of making a payment
The primary issue of building a two-stage model here is, that the second model (for prediction of the debt), when built upon the "non-zeros" only, is built on a most likely non-random sample of the population (i.e. the whole dataset), but the combined model has to be applied on the whole population again. This means that the second model will have to make predictions for parts of the data which it has never seen before, resulting in a loss of accuracy. This is called Sample Selection Bias (for a overview from a ML perspective I recommend A Bayesian Network Framework for Reject Inference by Smith and Elkan).
The KDD-Cup-98 dealt with a similar issue where one should predict whether a donor for a veterans organization is likely to donate again and how much it is likely to donate. In this dataset, the probability of donating again was negatively correlated with the expected amount of money, too. The Sample Selection Bias also appeared.
The solution which impressed me the most can be found in Learning and Making Decisions When Costs and Probabilities are Both Unknown by Bianca Zadrozny and Charles Elkan. They have created a cost sensitive solution based upon the Heckman correction, which is to my knowledge the first systematic approach to correct the (sample) selection bias.
Following my own suggestion, I think I have found the solution. I found out that for the recalculation of the log-likelihoods, it is sufficient to expand the binary observations y and the predictions:
Pseudo.R2=function(object){
stopifnot(object$family$family == "binomial")
object0 = update(object, ~ 1)
wt <- object$prior.weights # length(wt)
y = object$y # weighted
ones = round(y*wt)
zeros = wt-ones
fv <- object$fitted.values # length(fv)
if (is.null(object$na.action)) fv0 <- object0$fitted.values else
fv0 <- object0$fitted.values[-object$na.action] # object may have missing values
resp <- cbind(ones, zeros)
Y <- apply(resp, 1, function(x) {c(rep(1, x[1]), rep(0, x[2]))} )
if (is.list(Y)) Y <- unlist(Y) else Y <- c(Y)
# length(Y); sum(Y)
fv.exp <- c(apply(cbind(fv, wt), 1, function(x) rep(x[1], x[2])))
if (is.list(fv.exp)) fv.exp <- unlist(fv.exp) else fv.exp <- c(fv.exp)
# length(fv.exp)
fv0.exp <- c(apply(cbind(fv0, wt), 1, function(x) rep(x[1], x[2])))
if (is.list(fv0.exp)) fv0.exp <- unlist(fv0.exp) else fv0.exp <- c(fv0.exp)
(ll = sum(log(dbinom(x=Y,size=1,prob=fv.exp))))
(ll0 = sum(log(dbinom(x=Y,size=1,prob=fv0.exp))))
n <- length(Y)
G2 <- -2 * (ll0 - ll)
McFadden.R2 <- 1 - ll/ll0
CoxSnell.R2 <- 1 - exp((2 * (ll0 - ll))/n) # Cox & Snell / Maximum likelihood pseudo r-squared
r2ML.max <- 1 - exp(ll0 * 2/n)
Nagelkerke.R2 <- CoxSnell.R2/r2ML.max # Nagelkerke / Cragg & Uhler's pseudo r-squared
out <- c(llh = ll, llhNull = ll0, G2 = G2, McFadden = McFadden.R2,
r2ML = CoxSnell.R2, r2CU = Nagelkerke.R2)
out
}
Old results (using pR2 from the pscl package):
> round(pR2(cuse.fit),4)
llh llhNull G2 McFadden r2ML r2CU
-50.7126 -118.6401 135.8552 0.5726 0.9998 0.9998
> round(pR2(cuse.fit2),4)
llh llhNull G2 McFadden r2ML r2CU
-933.9192 -1001.8468 135.8552 0.0678 0.9857 0.9857
> round(pR2(cuse.fit3),4)
llh llhNull G2 McFadden r2ML r2CU
-933.9192 -1001.8468 135.8552 0.0678 0.0811 0.1138
New results, using Pseudo.R2:
> round(Pseudo.R2(cuse.fit),4)
llh llhNull G2 McFadden r2ML r2CU
-933.9192 -1001.8468 135.8552 0.0678 0.0811 0.1138
> round(Pseudo.R2(cuse.fit2),4)
llh llhNull G2 McFadden r2ML r2CU
-933.9192 -1001.8468 135.8552 0.0678 0.0811 0.1138
> round(Pseudo.R2(cuse.fit3),4)
llh llhNull G2 McFadden r2ML r2CU
-933.9192 -1001.8468 135.8552 0.0678 0.0811 0.1138
Now the results are consistent, and no longer dependent on their level of aggregation (tabulation). I have notified the maintainer of the pscl package. Maybe he has some interest.
Best Answer
So I figured I'd sum up what I've learned about McFadden's pseudo $R^2$ as a proper answer.
The seminal reference that I can see for McFadden's pseudo $R^2$ is: McFadden, D. (1974) “Conditional logit analysis of qualitative choice behavior.” Pp. 105-142 in P. Zarembka (ed.), Frontiers in Econometrics. Academic Press. http://eml.berkeley.edu/~mcfadden/travel.html Figure 5.5 shows the relationship between $\rho^2$ and traditional $R^2$ measures from OLS. My interpretation is that larger values of $\rho^2$ (McFadden's pseudo $R^2$) are better than smaller ones.
The interpretation of McFadden's pseudo $R^2$ between 0.2-0.4 comes from a book chapter he contributed to: Bahvioural Travel Modelling. Edited by David Hensher and Peter Stopher. 1979. McFadden contributed Ch. 15 "Quantitative Methods for Analyzing Travel Behaviour on Individuals: Some Recent Developments". Discussion of model evaluation (in the context of multinomial logit models) begins on page 306 where he introduces $\rho^2$ (McFadden's pseudo $R^2$). McFadden states "while the $R^2$ index is a more familiar concept to planner who are experienced in OLS, it is not as well behaved as the $\rho^2$ measure, for ML estimation. Those unfamiliar with $\rho^2$ should be forewarned that its values tend to be considerably lower than those of the $R^2$ index...For example, values of 0.2 to 0.4 for $\rho^2$ represent EXCELLENT fit."
So basically, $\rho^2$ can be interpreted like $R^2$, but don't expect it to be as big. And values from 0.2-0.4 indicate (in McFadden's words) excellent model fit.