It can be difficult to find sources giving precise definitions and good explanations of these concepts ... there is one R package at CRAN woe
with a function woe
one can check, and I found this paper which at least gives precise definitions. So, assume we have a binary response $Y$ and a grouped predictor $x$. As this seems to be used in credit scoring, the binary outcomes are usually called bad
and good
which we can encode as 0 and 1. Which is good
and which bad
does not matter for the formulas, because they are invariant under switching of the labels. The formulas express a comparison divergence of two distributions, the distributions of $x$-labels among the goods, denoted $g_i/g$ and of labels among the bads, $b_i/b$ ($g=\sum_i g_i, b=\sum_i b_i$).
Then we have
$$ \text{woe}_i = \log\left( \frac{g_i/g}{b_i/b} \right)
$$ where $i$ represents the classes defined by $x$. As $\left( \frac{g_i/g}{b_i/b} \right)$ is a ratio of two probabilities, it is a risk ratio (RR). If $\text{woe}_i$ is large positive, it means that in the group $i$ the good
s are more frequent than in the full sample (or population, if we have population data), if large negative, bad
s are overrepresented. If zero the group has the same distribution as the full sample$^\dagger$.
Then for information value:
$$ \text{IV} = \sum_i \left( \frac{g_i}{g}-\frac{b_i}{b} \right)\cdot \text{woe}_i
$$
It is not obvious at a first glance how to interpret this. It turns out that this is a symmetrized Kullback-Leibler divergence, called the J-divergence (or Jaynes-divergence). Let us show this. Now write $p_i, q_i$ for the two distributions. The Kullback-Leibler divergence see Intuition on the Kullback–Leibler (KL) Divergence is given by
$$ \DeclareMathOperator{\KL}{KL}
\KL(p || q)= \sum_i p_i \log\frac{p_i}{q_i}
$$ which is nonnegative, but not symmetric. To symmetrize it, take the sum
\begin{align}
\KL(p || q)+\KL(q || p) &=\sum_i p_i \log\frac{p_i}{q_i}+\sum_i q_i \log\frac{q_i}{p_i}\\[8pt]
&= \sum_i p_i \log\frac{p_i}{q_i} - \sum_i q_i \log\frac{p_i}{q_i}\\[8pt]
&= \sum_i (p_i-q_i) \log\frac{p_i}{q_i}
\end{align}
(where we used that $\log x^{-1} =-\log x$)
and this can now easily be recognized as the information value $\text{IV}$.
A warning: These concepts seem to be much used in the context of univariate screening of variables to use in logistic regression models. That is generally not a good idea, for discussion see How come variables with low information values may be statistically significant in a logistic regression?.
A prototype implementation in R to experiment with:
library(tidyverse)
myWoE <- function(data) { # data frame with cols x, y
woetab <- data %>% group_by(x) %>%
summarise(total=n(), good=sum(y), bad=sum(1-y) ) %>%
mutate(gi = good/sum(good),
bi = bad/sum(bad),
woe = log(gi/bi),
iv = (gi - bi)*woe )
woetab
}
some test data:
test <- data.frame( x= rep(1:5, each=10),
y= rep(rep(0:1, each=5), 5))# some very uninformative data
test2 <- data.frame( x=rep(1:5, each=20),
y=rbinom(5*20, size=1, p=rep(seq(from=1, to=9, length.out=5)/10, each=20)) )# More informative
then run and compare the outputs (not included here):
library(woe)
myWoE(test)
woe::woe(test, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)
myWoE(test2)
woe::woe(test2, "x", FALSE, "y", Bad=0, Good=1, C_Bin=5)
$\dagger$: This definition differs from the one used in information theory, used for instance in this classical book by IJ Good and discussed by CS Peirce in this classic 1878 paper. There is some discussion of that here.
Best Answer
I do not think that you can go more intuitive about it then saying once again what it does: it returns $1$ for something that interests you, and $0$ for all the other cases.
So if you want to count blue-eyed people, you can use indicator function that returns ones for each blue-eyed person and zero otherwise, and sum the outcomes of the function.
As about probability defined in terms of expectation and indicator function: if you divide the count (or sum of ones) by total number of cases, you get probability. Peter Whittle in his books Probability and Probability via Expectation writes a lot about defining probability like this and even considers such usage of expected value and indicator function as one of the most basic aspects of probability theory.
As about your question in the comment
Well, yes it is! In fact, in statistics we use indicator function to create new random variables, e.g. imagine that you have normally distributed random variable $X$, then you may create new random variable using indicator function, say
$$ I_{2<X<3} = \begin{cases} 1 & \text{if} \quad 2 < X < 3 \\ 0 & \text{otherwise} \end{cases} $$
or you may create new random variable using two Bernoulli distributed random variables $A,B$:
$$ I_{A\ne B} = \begin{cases} 0 & \text{if } & A=B, \\ 1 & \text{if } & A \ne B \end{cases} $$
...of course, you could use as well any other function to create new random variable. Indicator function is helpful if you want to focus on some specific event and signalize when it happens.
For a physical indicator function imagine that you marked one of the walls of six-sided dice using red paint, so you can now count red and non-red outcomes. It is not less random them the dice itself, while it's a new random variable that defines outcomes differently.
You may also be interested in reading about Dirac delta that is used in probability and statistics like a continuous counterpart to indicator function.
See also: Why 0 for failure and 1 for success in a Bernoulli distribution?