Maybe this simple example will help. I use it when I teach
conditional expectation.
(1) The first step is to think of ${\mathbb E}(X)$ in a new way:
as the best estimate for the value of a random variable $X$ in the absence of any information.
To minimize the squared error
$${\mathbb E}[(X-e)^2]={\mathbb E}[X^2-2eX+e^2]={\mathbb E}(X^2)-2e{\mathbb E}(X)+e^2,$$
we differentiate to obtain $2e-2{\mathbb E}(X)$, which is zero at $e={\mathbb E}(X)$.
For example, if I throw a fair die and you have to
estimate its value $X$, according to the analysis above, your best bet is to guess ${\mathbb E}(X)=3.5$.
On specific rolls of the die, this will be an over-estimate or an under-estimate, but in the long run it minimizes the mean square error.
(2) What happens if you do have additional information?
Suppose that I tell you that $X$ is an even number.
How should you modify your estimate to take this new information into account?
The mental process may go something like this: "Hmmm, the possible values were $\lbrace 1,2,3,4,5,6\rbrace$
but we have eliminated $1,3$ and $5$, so the remaining possibilities are $\lbrace 2,4,6\rbrace$.
Since I have no other information, they should be considered equally likely and hence the revised expectation is $(2+4+6)/3=4$".
Similarly, if I were to tell you that $X$ is odd, your revised (conditional) expectation is 3.
(3) Now imagine that I will roll the die and I will tell you the parity of $X$; that is, I will
tell you whether the die comes up odd or even. You should now see that a single numerical response
cannot cover both cases. You would respond "3" if I tell you "$X$ is odd", while you would respond "4" if I tell you "$X$ is even".
A single numerical response is not enough because the particular piece of information that I will give you is itself random.
In fact, your response is necessarily a function of this particular piece of information.
Mathematically, this is reflected in the requirement that ${\mathbb E}(X\ |\ {\cal F})$ must be $\cal F$ measurable.
I think this covers point 1 in your question, and tells you why a single real number is not sufficient.
Also concerning point 2, you are correct in saying that the role of $\cal F$ in ${\mathbb E}(X\ |\ {\cal F})$
is not a single piece of information, but rather tells what possible specific pieces of (random) information may occur.
Best Answer
As the name implies, an indicator random variable indicates something: the value of $I_A$ is $1$ precisely when the event $A$ occurs, and is $0$ when $A$ does not occur (that is, $A^c$ occurs). Think of $I_A$ as a Boolean variable that indicates the occurrence of the event $A$. This Boolean variable has value $1$ with probability $P(A)$ and so its average value is $P(A)$. In terms of long-term frequencies, $I_A$ will have value $1$ on roughly $N\cdot P(A)$ of $N$ trials of the experiment, and the long-term average value of $I_A$ on these $N$ trials will be approximately $P(A)$.