Heuristically, the probability density function on $\{x_1, x_2,..,.x_n\}$ with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of $\{x_1, x_2,..,.x_n\}$, in other words the Uniform distribution.
Now, for a more formal proof consider the following:
A probability density function on $\{x_1, x_2,..,.x_n\}$ is a set of nonnegative real numbers $p_1,...,p_n$ that add up to 1. Entropy is a continuous function of the $n$-tuples $(p_1,...,p_n)$, and these points lie in a compact subset of $\mathbb{R}^n$, so there is an $n$-tuple where entropy is maximized. We want to show this occurs at $(1/n,...,1/n)$ and nowhere else.
Suppose the $p_j$ are not all equal, say $p_1 < p_2$. (Clearly $n\neq 1$.) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at
some $n$-tuple, that entropy is uniquely maximized at the $n$-tuple with $p_i = 1/n$ for all $i$.
Since $p_1 < p_2$, for small positive $\varepsilon$ we have $p_1 + \varepsilon < p_2 -\varepsilon$. The entropy of $\{p_1 + \varepsilon, p_2 -\varepsilon,p_3,...,p_n\}$ minus the entropy of $\{p_1,p_2,p_3,...,p_n\}$ equals
$$-p_1\log\left(\frac{p_1+\varepsilon}{p_1}\right)-\varepsilon\log(p_1+\varepsilon)-p_2\log\left(\frac{p_2-\varepsilon}{p_2}\right)+\varepsilon\log(p_2-\varepsilon)$$
To complete the proof, we want to show this is positive for small enough $\varepsilon$. Rewrite the above equation as
$$-p_1\log\left(1+\frac{\varepsilon}{p_1}\right)-\varepsilon\left(\log p_1+\log\left(1+\frac{\varepsilon}{p_1}\right)\right)-p_2\log\left(1-\frac{\varepsilon}{p_2}\right)+\varepsilon\left(\log p_2+\log\left(1-\frac{\varepsilon}{p_2}\right)\right)$$
Recalling that $\log(1 + x) = x + O(x^2)$ for small $x$, the above equation is
$$-\varepsilon-\varepsilon\log p_1 + \varepsilon + \varepsilon \log p_2 + O(\varepsilon^2) = \varepsilon\log(p_2/p_1) + O(\varepsilon^2)$$
which is positive when $\varepsilon$ is small enough since $p_1 < p_2$.
A less rigorous proof is the following:
Consider first the following Lemma:
Let $p(x)$ and $q(x)$ be continuous probability density functions on an interval
$I$ in the real numbers, with $p\geq 0$ and $q > 0$ on $I$. We have
$$-\int_I p\log p dx\leq -\int_I p\log q dx$$
if both integrals exist. Moreover, there is equality if and only if $p(x) = q(x)$ for all $x$.
Now, let $p$ be any probability density function on $\{x_1,...,x_n\}$, with $p_i = p(x_i)$. Letting $q_i = 1/n$ for all $i$,
$$-\sum_{i=1}^n p_i\log q_i = \sum_{i=1}^n p_i \log n=\log n$$
which is the entropy of $q$. Therefore our Lemma says $h(p)\leq h(q)$, with equality if and only if $p$ is uniform.
Also, wikipedia has a brief discussion on this as well: wiki
Why can't we just compute the standard deviation?
Here's why. Let's compare the formulas for entropy and variance:
- $H(X) = - \sum\limits_x p(x) \, \log p(x) = - \mathbb E \, [ \log p(X) ]$
- $\text{var} (X) = \mathbb E \, \Big[(X - \mathbb E[X])^2 \Big]$
So note that entropy does not care about values that $X$ may take, it cares only about the distribution itself, while variance does care about the values of $X$. Also, for variance the variable has to be numeric, and it's not the case for entropy. Both these properties make entropy a good candidate for calculating the information gain.
To get more insights into entropy and other information-theoretic measures, you may read this question on math.SE.
Best Answer
Not quite. That sum is actually over the different states of the variables involved. For example, if both $X$ and $Y$ are binary variables, then the summation will have $2^3 = 8$ terms. In general, if $X$ and $Y$ have $k$ possible states, the summation will have $k^3$ terms.
The interpretation of the sum over states is not that different from the usual sum over states when calculating entropy or mutual information. The term $(x_{n+1}, x_n, y_n)$ represents how much information the particular state $y_n$ provides about the particular state $x_{n+1}$ of the future of $X$ when the past state of $X$ is $x_n$. For more information you can read Joe Lizier's work on local information measures.
The source of your confusion could be that if you want to estimate transfer entropy from a given time series, then you (typically) have to loop through the whole time series to estimate $p(x_{n+1}, x_n, y_n)$, usually by counting occurrences of all $(x_{n+1}, x_n, y_n)$ tuples. Then, once you have estimated your PDF, you have to go back to the $t(x|y)$ formula and calculate those $k^3$ terms I mentioned above.