R Logistic Regression – Connecting Cumulative Distribution Function (CDF) with Logistic Regression

cumulative distribution functionlogisticr

Is the probability calculated by a logistic regression model (the one that is logit transformed) the fit of cumulative distribution function of successes of original data (ordered by the X variable)?

EDIT: In other words – how to plot the probability distribution of the original data that you get when you fit a logistic regression model?

The motivation for the question was Jeff Leak's example of regression on the Raven's score in a game and whether they won or not (from Coursera's Data Analysis course). Admittedly, the problem is artificial (see @FrankHarrell's comment below). Here is his data with a mix of his and my code:

download.file("http://dl.dropbox.com/u/7710864/data/ravensData.rda", 
              destfile="ravensData.rda", method="internal")
load("ravensData.rda")

plot(ravenWinNum~ravenScore, data=ravensData)

enter image description here

It doesn't seem like good material for logistic regression, but let's try anyway:

logRegRavens <- glm(ravenWinNum ~ ravenScore, data=ravensData, family=binomial)
summary(logRegRavens)
# the beta is not significant

# sort table by ravenScore (X)
rav2 = ravensData[order(ravensData$ravenScore), ]

# plot CDF
plot(sort(ravensData$ravenScore), cumsum(rav2$ravenWinNum)/sum(rav2$ravenWinNum), 
     pch=19, col="blue", xlab="Score", ylab="Prob Ravens Win", ylim=c(0,1), 
     xlim=c(-10,50))
# overplot fitted values (Jeff's)
points(ravensData$ravenScore, logRegRavens$fitted, pch=19, col="red")
# overplot regression curve
curve(1/(1+exp(-(logRegRavens$coef[1]+logRegRavens$coef[2]*x))), -10, 50, add=T)

If I understand logistic regression correctly, R does a pretty bad job at finding the right coefficients in this case.

enter image description here

  • blue = original data to be fitted, I believe (CDF)
  • red = prediction from the model (fitted data = projection of original data onto regression curve)

SOLVED
– lowess seems to be a good non-parametric estimator of the original data = what is being fitted (thanks @gung). Seeing it allows us to choose the right model, which in this case would be adding squared term to the previous model (@gung)
– Of course, the problem is pretty artificial and modelling it rather pointless in general (@FrankHarrell)
– in regular logistic regression it's not CDF, but point probabilities – first pointed out by @FrankHarrell; also my embarrassing inability to calculate CDF pointed out by @gung.

Best Answer

There are several issues here. The biggest is that the thinking behind the code for generating the blue points in the second plot is confused. I can't make any sense of what the logic is supposed to be. The $x$ values are the same as your data, but the $y$ values are the number of wins up to that point on the $x$ axis divided by the total number of wins. This really makes no sense. You aren't even dividing by the total number of games, which means you don't have the cumulative probability at that point (although that still wouldn't make sense). This also makes you wonder why the last blue point isn't at $1.0$. Actually it is, but the last point is at $x=55$, which is excluded by your argument xlim=c(-10,50).

At any rate, if you want to get a non-parametric estimate of what the underlying function might be, you could plot a lowess line:

plot(ravenWinNum~ravenScore, data=ravensData)
lines(with(ravensData, lowess(ravenWinNum~ravenScore)))

enter image description here

What is clear from this figure, is that there is a curvilinear relationship between the score and the probability of winning (at least in these data, we can debate whether that makes any theoretical sense). If you thought about it, it was clear from the plot of the raw data in that there are increasing numbers of wins at both the low end and high end of $x$; the losses are in the center. A very simple way to fit a curvilinear relationship is to add a squared term:

lr.Ravens2 <- glm(ravenWinNum~ravenScore+I(ravenScore^2), data=ravensData, 
                  family=binomial)

We can test these models as a whole by assessing their deviance against the null deviance and comparing the result to a chi-squared distribution with the number of degrees of freedom consumed. We do this because we want to think of both the linear term and the squared term together as a unit. This method of testing the model as a whole is analogous to the global $F$-test that comes with a multiple regression model.

# here is the test of the model with the squared term as a whole
1-pchisq(24.435-16.875, df=2)
[1] 0.02282269
# here is the test of your model
1-pchisq(24.435-20.895, df=1)
[1] 0.05990546
# this is the test of the improvement due to adding the squared term
1-pchisq(20.895-16.875, df=1)
[1] 0.04496371

I wouldn't make too much of the arbitrary threshold of significance at $.05$, but we see that the initial model is not significant by conventional standards, or at least is slightly less significant. (Given that you have so few data, I would think a higher $\alpha$ is fine.) The model that includes the squared term as well is more significant, even though it uses up another of your precious few degrees of freedom. The AIC is lower as well: the first model has $AIC=24.895$, whereas the second model has $AIC=22.875$. The last of the tests listed shows that the addition of the squared term yields a significant improvement in fit.

The implication of the fact that there is a curvilinear relationship between $x$ and $y$ is that the function that maps the conditional probability that $y_i=1$ given $x_i$ cannot be equated to a cumulative probability distribution that takes $x$ as its quantiles.

For completeness, here is the same plot with the model's predicted probabilities overlaid:

enter image description here

We see that the predicted probabilities mirror the lowess line reasonably well. Of course, they don't match perfectly, but there is no reason to believe that the lowess line is exactly correct; it just provides a non-parametric approximation of whatever the underlying function is.


Having addressed the specific example in your question, it is worth addressing the assumption underlying the question more broadly. What is the relationship between logistic regression and a cumulative distribution function (CDF)?

There can be a connection between binomial regression and a CDF. For example, when the probit link is used, the conditional probability of 'success' ($\pi_i(Y_i=1|X=x_i)$) is transformed via the inverse of the normal CDF. (For more about that, it may help to read my answer here: difference between logit and probit models.) But there are a couple of points to be made here. First, as @FrankHarrell and @Glen_b note, binomial (e.g., logistic) regression is about modeling conditional probabilities at each point.

Second, although the fitted function can look sort of like a CDF when the conditions are just right, it will never actually be a function (scaled, shifted, etc.) of the CDF of your $X$ data. Your own data serve as an example of the function not even looking like a CDF because you have a polynomial relationship. But it is true even when the function does look like a CDF. The easiest way to understand this is that your last (highest $x$ valued) point will always be 100% of the way through your data, but the fitted function will never reach $\hat p_i=1.0$, it can only approach that value as $X$ (the variable, not your data) approaches infinity. As a result, you should not think of a function of the CDF of your $x$ data as the values the logistic regression is supposed to approximate.

Related Question