Solved – Understanding how good a prediction is, in logistic regression

logisticprediction

I have fit a logistic regression model with four features and now I am able to use it for prediction.

When I plot the data points from my learning sample (i.e. I select two features, plot them in two axis), I see that there is more density of points in an area than in other areas. What I would like my model to return is not only the probabilities, but also a number that tells me how good these probabilities are (e.g. a p-value). I imagine that depending on the region the prediction is made, and the density of the sample points in that concrete region, the better the prediction, and conversely.

Is there a way of obtaining a value that tells me how good my prediction is?

In concrete, I am using python scikit logistic regression package, although the answer can be more broad.

Best Answer

The key thing to check first is the model's calibration, either using the bootstrap to correct for overfitting or using a huge independent sample not used for model development or fitting. The best way to assess calibration is using a loess smooth nonparametric regression. Once you establish calibration you can go on to predictive discrimination using the pseudo $R^2$ and Somers' $D_{xy}$ rank correlation coefficient, or a simple translation of it to the $c$-index AKA concordance probability or AUROC. The Brier score is an excellent addition to all this.