Heuristically, the probability density function on $\{x_1, x_2,..,.x_n\}$ with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of $\{x_1, x_2,..,.x_n\}$, in other words the Uniform distribution.
Now, for a more formal proof consider the following:
A probability density function on $\{x_1, x_2,..,.x_n\}$ is a set of nonnegative real numbers $p_1,...,p_n$ that add up to 1. Entropy is a continuous function of the $n$-tuples $(p_1,...,p_n)$, and these points lie in a compact subset of $\mathbb{R}^n$, so there is an $n$-tuple where entropy is maximized. We want to show this occurs at $(1/n,...,1/n)$ and nowhere else.
Suppose the $p_j$ are not all equal, say $p_1 < p_2$. (Clearly $n\neq 1$.) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at
some $n$-tuple, that entropy is uniquely maximized at the $n$-tuple with $p_i = 1/n$ for all $i$.
Since $p_1 < p_2$, for small positive $\varepsilon$ we have $p_1 + \varepsilon < p_2 -\varepsilon$. The entropy of $\{p_1 + \varepsilon, p_2 -\varepsilon,p_3,...,p_n\}$ minus the entropy of $\{p_1,p_2,p_3,...,p_n\}$ equals
$$-p_1\log\left(\frac{p_1+\varepsilon}{p_1}\right)-\varepsilon\log(p_1+\varepsilon)-p_2\log\left(\frac{p_2-\varepsilon}{p_2}\right)+\varepsilon\log(p_2-\varepsilon)$$
To complete the proof, we want to show this is positive for small enough $\varepsilon$. Rewrite the above equation as
$$-p_1\log\left(1+\frac{\varepsilon}{p_1}\right)-\varepsilon\left(\log p_1+\log\left(1+\frac{\varepsilon}{p_1}\right)\right)-p_2\log\left(1-\frac{\varepsilon}{p_2}\right)+\varepsilon\left(\log p_2+\log\left(1-\frac{\varepsilon}{p_2}\right)\right)$$
Recalling that $\log(1 + x) = x + O(x^2)$ for small $x$, the above equation is
$$-\varepsilon-\varepsilon\log p_1 + \varepsilon + \varepsilon \log p_2 + O(\varepsilon^2) = \varepsilon\log(p_2/p_1) + O(\varepsilon^2)$$
which is positive when $\varepsilon$ is small enough since $p_1 < p_2$.
A less rigorous proof is the following:
Consider first the following Lemma:
Let $p(x)$ and $q(x)$ be continuous probability density functions on an interval
$I$ in the real numbers, with $p\geq 0$ and $q > 0$ on $I$. We have
$$-\int_I p\log p dx\leq -\int_I p\log q dx$$
if both integrals exist. Moreover, there is equality if and only if $p(x) = q(x)$ for all $x$.
Now, let $p$ be any probability density function on $\{x_1,...,x_n\}$, with $p_i = p(x_i)$. Letting $q_i = 1/n$ for all $i$,
$$-\sum_{i=1}^n p_i\log q_i = \sum_{i=1}^n p_i \log n=\log n$$
which is the entropy of $q$. Therefore our Lemma says $h(p)\leq h(q)$, with equality if and only if $p$ is uniform.
Also, wikipedia has a brief discussion on this as well: wiki
It depends what you want to show, what is the variable:
- categorical variable - it's fine
- discrete by ordinal - it's a bit tricky
- e.g. on 1-5 scale it is something different to have the same probabilities for 1 and 5, and
for 3 and 4
- continuos variable - it's even more tricky
- the previous argument
- the choice of coordinates matter (good coordinates are ones respecting symmetries (and they not always exist))
- changing bin size scales entropy
So, I will mostly focus of the categorical variant.
Typical quantity you can use is Kullback-Leibler divergence, which means how different is your probability distribution $Y$ with respect to some initial one $X$.
$$
D_{KL}(Y||X) = \sum_x P(Y=x) \log \left(\frac{P(Y=x)}{P(X=x)} \right)
$$
It can be interpreted as information gain - expecting probability distribution $X$ how much information you gained when you measured probability $Y$. If $X$ is uniform, then the KL divergence is just entropy of $X$ minus entropy of $Y$.
As an example, when you expect a coin to be fair $X=(\tfrac{1}{2}, \tfrac{1}{2})$, you toss it and get heads (and you are sure) $Y=(1,0)$, you learn exactly one bit of information.
When it comes to setting "uninformed" probability - it depends on the problem.
For discrete case, just take maximum entropy distribution given the constraints.
If there are no constraints, it is simply uniform probability.
For linear constraints (that is, that some averages are fixed) there is a simple recipe to compute such distribution.
If there are a few different models, you can compare them measuring against the same $X$. The same can work for some ad hoc assumptions (for example uniform on some set, zero elsewhere).
If you have to normalize it, divide by entropy of the uninformed probability distribution $X$.
EDIT:
If you want to just tell how concentrated is the distribution, just use entropy of $Y$ (comparing it to entropy of $X$). In this case, lower is better.
Best Answer
Using Lagrange multipliers we have the equation:
$$\mathcal{L} = \left \{ -\sum_i^k p_i \log p_i - \lambda\left ( \sum_i^k p_i - 1 \right )\right \}$$
Maximizing with respect to the probability,
$$\frac{\partial \mathcal{L}}{\partial p_i} = 0 = -\log p_i - 1 - \lambda \implies $$
$$p_i = e^{-(1+\lambda)}\tag{1}$$
Maximizing with respect to $\lambda$:
$$\frac{\partial \mathcal{L}}{\partial \lambda} = 0 = - \sum_i^k p_i + 1 \implies$$
$$ \sum_i^k p_i = 1 \tag{2}$$
Substituting equation (1) into equation (2):
$$\sum_i^k e^{-(1+\lambda)} = 1 \implies$$
$$k e^{-(1+\lambda)} = 1 $$
Since $p_i = e^{-(1+\lambda)}$
$$p_i = \frac{1}{k}$$
The Shannon Entropy formula now becomes
$$ H = - \sum_i^k \frac{1}{k}\log \frac{1}{k}$$
Since $k$ does not depend on the summation,
$$H = \frac{k}{k} \log k = \log k$$