Solved – the intuition behind an Indicator Function

indicator functionprobabilityrandom variable

What is an Indicator function?
What is the intuition behind an Indicator function?
Why is the indicator function $I_A$ needed in the following example?
Can the following example be rewritten without using the indicator function?

Let $A$ be any event. We can write $\Bbb P(A)$ as an expectation, as
follows:

Define the indicator function:

$$ I_A = \begin{cases} 1, & \text{if event $A$ occurs} \\
0, & \text{otherwise} \end{cases} $$

Then $I_A$ is a random variable, and

$$ \Bbb E(I_A) = \sum_{r=0}^1 r \cdot \Bbb P(I_A = r) \\
= \Bbb P(A). $$

Thus

$$ \Bbb P(A) = \Bbb E(I_A). $$

Best Answer

I do not think that you can go more intuitive about it then saying once again what it does: it returns $1$ for something that interests you, and $0$ for all the other cases.

So if you want to count blue-eyed people, you can use indicator function that returns ones for each blue-eyed person and zero otherwise, and sum the outcomes of the function.

As about probability defined in terms of expectation and indicator function: if you divide the count (or sum of ones) by total number of cases, you get probability. Peter Whittle in his books Probability and Probability via Expectation writes a lot about defining probability like this and even considers such usage of expected value and indicator function as one of the most basic aspects of probability theory.

As about your question in the comment

isn't the Random Variable there to serve the same purpose? Like $H=1$ and $T=0$?

Well, yes it is! In fact, in statistics we use indicator function to create new random variables, e.g. imagine that you have normally distributed random variable $X$, then you may create new random variable using indicator function, say

$$ I_{2<X<3} = \begin{cases} 1 & \text{if} \quad 2 < X < 3 \\ 0 & \text{otherwise} \end{cases} $$

or you may create new random variable using two Bernoulli distributed random variables $A,B$:

$$ I_{A\ne B} = \begin{cases} 0 & \text{if } & A=B, \\ 1 & \text{if } & A \ne B \end{cases} $$

...of course, you could use as well any other function to create new random variable. Indicator function is helpful if you want to focus on some specific event and signalize when it happens.

For a physical indicator function imagine that you marked one of the walls of six-sided dice using red paint, so you can now count red and non-red outcomes. It is not less random them the dice itself, while it's a new random variable that defines outcomes differently.

You may also be interested in reading about Dirac delta that is used in probability and statistics like a continuous counterpart to indicator function.

Related Solutions

Probability – Understanding Indicator Functions in Probability

I see no suggestion there's a typo; it looks reasonably straightforward to me.

I'd informally read it left to right as something like:

"Define the function $I_A(\omega)$ -- an indicator of $(\omega\in A)$ -- which takes the value $1$ when $\omega\in A$ and $0$ otherwise."

Weight of Evidence – Understanding Weight of Evidence and Information Value Formula Intuitively

It can be difficult to find sources giving precise definitions and good explanations of these concepts ... there is one R package at CRAN woe with a function woe one can check, and I found this paper which at least gives precise definitions. So, assume we have a binary response $Y$ and a grouped predictor $x$. As this seems to be used in credit scoring, the binary outcomes are usually called bad and good which we can encode as 0 and 1. Which is good and which bad does not matter for the formulas, because they are invariant under switching of the labels. The formulas express a comparison divergence of two distributions, the distributions of $x$-labels among the goods, denoted $g_i/g$ and of labels among the bads, $b_i/b$ ($g=\sum_i g_i, b=\sum_i b_i$).

Then we have $$ \text{woe}_i = \log\left( \frac{g_i/g}{b_i/b} \right) $$ where $i$ represents the classes defined by $x$. As $\left( \frac{g_i/g}{b_i/b} \right)$ is a ratio of two probabilities, it is a risk ratio (RR). If $\text{woe}_i$ is large positive, it means that in the group $i$ the goods are more frequent than in the full sample (or population, if we have population data), if large negative, bads are overrepresented. If zero the group has the same distribution as the full sample$^\dagger$.

Then for information value: $$ \text{IV} = \sum_i \left( \frac{g_i}{g}-\frac{b_i}{b} \right)\cdot \text{woe}_i $$ It is not obvious at a first glance how to interpret this. It turns out that this is a symmetrized Kullback-Leibler divergence, called the J-divergence (or Jaynes-divergence). Let us show this. Now write $p_i, q_i$ for the two distributions. The Kullback-Leibler divergence see Intuition on the Kullback–Leibler (KL) Divergence is given by $$ \DeclareMathOperator{\KL}{KL} \KL(p || q)= \sum_i p_i \log\frac{p_i}{q_i} $$ which is nonnegative, but not symmetric. To symmetrize it, take the sum \begin{align} \KL(p || q)+\KL(q || p) &=\sum_i p_i \log\frac{p_i}{q_i}+\sum_i q_i \log\frac{q_i}{p_i}\\[8pt] &= \sum_i p_i \log\frac{p_i}{q_i} - \sum_i q_i \log\frac{p_i}{q_i}\\[8pt] &= \sum_i (p_i-q_i) \log\frac{p_i}{q_i} \end{align} (where we used that $\log x^{-1} =-\log x$) and this can now easily be recognized as the information value $\text{IV}$.

A warning: These concepts seem to be much used in the context of univariate screening of variables to use in logistic regression models. That is generally not a good idea, for discussion see How come variables with low information values may be statistically significant in a logistic regression?.

A prototype implementation in R to experiment with:

library(tidyverse)

myWoE  <- function(data) { # data frame with cols x, y
    woetab <- data %>% group_by(x) %>%
        summarise(total=n(), good=sum(y), bad=sum(1-y) ) %>%
        mutate(gi = good/sum(good),
               bi = bad/sum(bad),
               woe = log(gi/bi),
               iv  = (gi - bi)*woe )
    woetab
    }

some test data:

test <- data.frame( x= rep(1:5, each=10), 
                    y= rep(rep(0:1, each=5), 5))# some very uninformative data     
test2 <- data.frame( x=rep(1:5, each=20),
                     y=rbinom(5*20, size=1, p=rep(seq(from=1, to=9, length.out=5)/10, each=20)) )# More informative

then run and compare the outputs (not included here):

library(woe)
myWoE(test)
woe::woe(test, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)

myWoE(test2)
woe::woe(test2, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)

$\dagger$: This definition differs from the one used in information theory, used for instance in this classical book by IJ Good and discussed by CS Peirce in this classic 1878 paper. There is some discussion of that here.

Best Answer

Related Solutions

Probability – Understanding Indicator Functions in Probability

Weight of Evidence – Understanding Weight of Evidence and Information Value Formula Intuitively

Related Question