I understand that one of the reason logistic regression is frequently used for predicting click-through-rates on the web is that it produces well-calibrated models. Is there a good mathematical explanation for this?
Solved – Why does logistic regression produce well-calibrated models
logisticregression
Related Solutions
The answer depends entirely on the amount of penalization used. If too little, the model will be seen to be overfitted when evaluated on an independent sample. If too much penality, it will be found to be underfitted. The goal is to solve for the penalty that gets it "just right". Two ways to do this are cross-validation (e.g., 100 repeats of 10-fold cross-validation) or computing the effective AIC and solving for the penalty that optimizes it.
The calibration curve is the plot of x=predicted probability of the event vs. y=actual probability of the event. The actual probability is obtained (in an independent sample, or using the bootstrap to correct the "apparent" calibration) by running a smoother on $(\hat{P}, Y)$, where $Y$ is the vector of binary outcomes. If the calibration curve is linear, it can be summarized by its intercept and slope. When the slope is greater than 1 there is underfitting, and when the slope is less than 1 there is overfitting (regression to the mean; low $\hat{P}$ are too low and high ones are too high).
The Bayesian approach to this has major advantages over what is described above, including:
- the penalty is automatically estimated if you do a reasonable job of specifying a prior distribution for the penalty parameters (usually stated in terms of the reciprocal, i.e., the variance of random effects)
- once finished, the Bayesian posterior distribution works exactly as it should, whereas the frequentist approach does not give us confidence intervals or hypothesis tests once penalization is used
Although this question and its first answer seems to be focused on theoretical issues of logistic regression model calibration, the issue of:
How could one ruin the calibration of a logistic regression...?
deserves some attention with respect to real-world applications, for future readers of this page. We shouldn't forget that the logistic regression model has to be well specified, and that this issue can be particularly troublesome for logistic regression.
First, if the log-odds of class membership is not linearly related to the predictors included in the model then it will not be well calibrated. Harrell's chapter 10 on Binary Logistic Regression devotes about 20 pages to "Assessment of Model Fit" so that one can take advantage of the "asymptotic unbiasedness of the maximum likelihood estimator," as @whuber put it, in practice.
Second, model specification is a particular issue in logistic regression, as it has an inherent omitted variable bias that can be surprising to those with a background in ordinary linear regression. As that page puts it:
Omitted variables will bias the coefficients on included variables even if the omitted variables are uncorrelated with the included variables.
That page also has a useful explanation of why this behavior is to be expected, with a theoretical explanation for related, analytically tractable, probit models. So unless you know that you have included all predictors related to class membership, you might run into dangers of misspecification and poor calibration in practice.
With respect to model specification, it's quite possible that tree-based methods like random forest, which do not assume linearity over an entire range of predictor values and inherently provide the possibility of finding and including interactions among predictors, will end up with a better-calibrated model in practice than a logistic regression model that does not take interaction terms or non-linearity sufficiently into account. With respect to omitted-variable bias, it's not clear to me whether any method for evaluating class-membership probabilities can deal with that issue adequately.
Best Answer
Yes.
The predicted probability vector $p$ from logistic regression satisfies the matrix equation
$$ X^t(p - y) = 0$$
Where $X$ is the design matrix and $y$ is the response vector. This can be viewed as a collection of linear equations, one arising from each column of the design matrix $X$.
Specializing to the intercept column (which is a row in the transposed matrix), the associated linear equation is
$$ \sum_i( p_i - y_i) = 0 $$
so the overall average predicted probability is equal to the average of the response.
More generally, for a binary feature column $x_{ij}$, the associated linear equation is
$$ \sum_i x_{ij}(p_i - y_i) = \sum_{i \mid x_{ij} = 1}(p_i - y_i) = 0$$
so the sum (and hence average) of the predicted probabilities equals the sum of the response, even when specializing to those records for which $x_{ij} = 1$.