Probability Estimation – How to Evaluate the Quality of Probability Estimation

classificationmachine learningprobability

I have a classifier that gives its decisions as probability estimations: for each datum it returns a set of probabilities $p_j$ for each known class $j$: $p_j(\vec{x})=P(c_{real}=j|\vec{x})$.

I have a training dataset, on which I trained the classifier, and I have a test dataset to test it. In both datasets I have the data in form of $(\vec{x_i}, y_i)$, where $\vec{x_i}$ is the feature vector of each datum, and $y_i$ is the right human-assigned class of each datum (each datum can have only one right class assigned).

I can easily evaluate the quality of classification with PR and ROC curves, but I also want to evaluate the quality of probability estimations, that makes my classifier.

Now I'm using the next way:

1) I distribute all the $p_j$ in 10 bins:

$(0 \dots 0.1), (0.1 \ldots 0.2), \ldots , (0.9 \ldots 1.0)$.

2) Then for each bin I look, how many decisions fall into each bin, and how many of them are right decisions: $P_{bin}=\frac{right-decisions-in-bin}{total-decisions-in-bin}$

3) Having measured probability curve, I just look at its sum of squared deviations from $f(x)=x$ in each point and try to have a classifier when such sum is minimal.

But I do think that it not the best way of such an evaluation of probability estimation. Cannot you advice me something more reliable? Maybe, there are some standard techniques?

P.S. Some probability curves, measured in my way:
Some probability curves, measured in my way

The text marks on the plot show (#right decisions in bin / #decisions in bin) for one of the measured probability functions.

Best Answer

You can find a related question here.

Your method is pretty similar to a standard one, namely "Reliability plot".

Say that you choose a target class among two in a binary classification problem (1 and 0). Given $N$ records in the test set indexed by $i$, let $P_i$ be the real probability of record $i$, it could be either $1$ or $0$ for a binary classification problem. Instead $p_i$ is the probability output assigned by the model. You can compute the Mean Square Error or Brier Score as follows $$BS = \frac{1}{N}\sum_{i=1}^N\big(P_i-p_{i}\big)^2$$

This score might be written as the sum of two terms: Calibration and Refinement. The Calibration component captures how well the model represents the true distribution of the data; while Refinement component captures how much the model discriminate between classes.

If your model outputs just $k$ distinct probabilities the data is actually partitioned in $k$ subsets. Let $p_j$ be the probability for the subset $j$, $j=1,\dots,k$. You can compute the probability $r_j$ of the corresponding group $j$ in the test set. These test set records are the ones the model returns $p_j$ as output. If $N_j$ is the total number of records in the test set predicted as $p_j$ you can compute $r_j$ dividing the number of records of class 1 by $N_j$. Then $$C=\frac{1}{N}\sum_{j=1}^{k}N_j(r_j-p_j)^2$$ $$R=\frac{1}{N}\sum_{j=1}^{k}N_jr_j(1-r_j)$$

In case your model outputs too many distinct probabilities I think you can set up $k$ bins and proceed in the same way as above.

Your $P_{bin}$ looks similar to $r_j$ but it is not clear how you defined right-decisions-in-bin. If your model outputs probability you have to set up a threshold to make a decision and then you can talk about right decisions.

Related Question