Solved – Confusion matrix, metrics, & joint vs. conditional probabilities

conditional probabilityconfusion matrixprecision-recallterminology

In the binary classification/prediction problem we have unknown labels $y\in\{0,1\}$, which we try to predict using an estimator $\hat{y}$. Commonly the performance of an estimator is summarized using a confusion matrix, and many related performance metrics.

I have not done much work in this area, and reading about things like , I can definitely sympathize with things like this!

To the extent I have thought about these issues, I find it much clearer to the think about things in terms of joint/marginal/conditional probabilities. But then things like this make me less certain of my understanding.

So I am looking for an explanation of the relationships between:

  1. Standard confusion-matrix related terminology (see big Table at Wikipedia here)
  2. The various probability distributions of $y$ and $\hat{y}$ (i.e. joint vs. conditional vs. marginal)

Is there something like a "dictionary" to translate between the two descriptions?


Process Note: I have a notion of the answer I am looking for, which I think could be useful to others as well. However I am not fully certain, and could use feedback in any case. So I will be drafting an "answer" below, but hope others more knowledgeable may respond as well. (Also, I do not believe this is a duplicate, but if I missed an existing Q & A please let me know.)

Best Answer

For predicted labels $\hat{y}$ and true labels $y\in\{0,1\}$, the confusion matrix is given by

\begin{array}{c|c:c|c} & y=0 & y=1 & \\ \hline \hat{y}=0 & \mathrm{TN} & \mathrm{FN} & \hat{\mathrm{N}} \\ \hdashline \hat{y}=1 & \mathrm{FP} & \mathrm{TP} & \hat{\mathrm{P}} \\ \hline & \mathrm{N} & \mathrm{P} & (n_{\mathrm{obs}}) \end{array}

where the entries are counts, $\mathrm{N}$ = "Negative", $\mathrm{P}$ = "Positive", $\mathrm{T}$ = "True", and $\mathrm{F}$ = "False".

The confusion matrix proper is contained within the solid-outlined box, to which I have added the column sums ($\mathrm{N}$,$\mathrm{P}$), column sums ($\hat{\mathrm{N}}$,$\hat{\mathrm{P}}$), and total sum ($n_{\mathrm{obs}}$ = number of paired observations).

The confusion matrix is essentially an empirical estimate of the joint distribution between $\hat{y}$ and $y$, i.e. when the entries are normalized by $n_{\mathrm{obs}}$ we get

\begin{array}{c|c:c|c} & y=0 & y=1 & \\ \hline \hat{y}=0 & p[\sim\!\hat{y},\sim\!y] & p[\sim\!\hat{y},\phantom{\sim\!}y] & p[\sim\!\hat{y}] \\ \hdashline \hat{y}=1 & p[\phantom{\sim}\,\hat{y},\sim\!y] & p[\phantom{\sim}\,\hat{y},\phantom{\sim\!}y] & p[\phantom{\sim}\,\hat{y}] \\ \hline & p[\phantom{\sim\hat{y}}\sim\!y] & p[\phantom{\sim\hat{y},,}\,y] & (1) \end{array} where I have switched to a Boolean-style notation with $\sim$ = "not".

In the margins of the table (outside the box), the normalized row and column sums are now the marginal probabilities.

Within this framework, many of the standard confusion matrix based metrics correspond directly to the various conditional probabilities of the above joint distribution.


If we condition on $\boldsymbol{y}$ the table becomes \begin{array}{|c:c|} \hline p[\sim\!\hat{y}\mid\sim\!y] & p[\sim\!\hat{y}\mid\phantom{\sim\!}y] \\ \hdashline p[\phantom{\sim}\,\hat{y}\mid\sim\!y] & p[\phantom{\sim}\,\hat{y}\mid\phantom{\sim\!}y] \\ \hline \end{array}

where the entries correspond to the metrics \begin{array}{|c:c|} \hline \text{specificity} & \text{miss rate} \\ \hdashline \text{fall-out} & \text{sensitivity (recall)} \\ \hline \end{array} (Note that these metrics can also be referred to by appending "rate" to the corresponding name from the confusion matrix.)


Alternatively, if we condition on $\boldsymbol{\hat{y}}$ the table becomes \begin{array}{|c:c|} \hline p[\sim\!y\mid\sim\!\hat{y}] & p[\phantom{\sim\!}y\mid\sim\!\hat{y}] \\ \hdashline p[\sim\!y\mid\phantom{\sim}\hat{y}] & p[\phantom{\sim\!}y\mid\phantom{\sim}\hat{y}] \\ \hline \end{array}

where the entries correspond to the metrics \begin{array}{|c:c|} \hline \text{negative predictive value} & \text{false omission rate}^* \\ \hdashline \text{false discovery rate} & \text{positive predictive value (precision)} \\ \hline \end{array} (*This one was not in Wikipedia except in their "big table". I was curious why it was the only one of the conditional probabilities not given a special name.)

Related Question