It depends what you want to show, what is the variable:
- categorical variable - it's fine
- discrete by ordinal - it's a bit tricky
- e.g. on 1-5 scale it is something different to have the same probabilities for 1 and 5, and
for 3 and 4
- continuos variable - it's even more tricky
- the previous argument
- the choice of coordinates matter (good coordinates are ones respecting symmetries (and they not always exist))
- changing bin size scales entropy
So, I will mostly focus of the categorical variant.
Typical quantity you can use is Kullback-Leibler divergence, which means how different is your probability distribution $Y$ with respect to some initial one $X$.
$$
D_{KL}(Y||X) = \sum_x P(Y=x) \log \left(\frac{P(Y=x)}{P(X=x)} \right)
$$
It can be interpreted as information gain - expecting probability distribution $X$ how much information you gained when you measured probability $Y$. If $X$ is uniform, then the KL divergence is just entropy of $X$ minus entropy of $Y$.
As an example, when you expect a coin to be fair $X=(\tfrac{1}{2}, \tfrac{1}{2})$, you toss it and get heads (and you are sure) $Y=(1,0)$, you learn exactly one bit of information.
When it comes to setting "uninformed" probability - it depends on the problem.
For discrete case, just take maximum entropy distribution given the constraints.
If there are no constraints, it is simply uniform probability.
For linear constraints (that is, that some averages are fixed) there is a simple recipe to compute such distribution.
If there are a few different models, you can compare them measuring against the same $X$. The same can work for some ad hoc assumptions (for example uniform on some set, zero elsewhere).
If you have to normalize it, divide by entropy of the uninformed probability distribution $X$.
EDIT:
If you want to just tell how concentrated is the distribution, just use entropy of $Y$ (comparing it to entropy of $X$). In this case, lower is better.
// So, in view of the above I have questions which is Is mutual information another name for information gain? //
No. But MI can be expressed in terms of KL (i.e. Info Gain) http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities
// Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter? //
Not sure if I fully understand the question, but there are proofs that minimizing KL is the only inference process that satisfies certain axioms one would deem reasonable from uncertain reasoning. Suggest you read "An Uncertain Reasoners Companion" - Jeff Paris.
KL and Entropy have both been shown to be the only measures of information (relative or absolute resp.) that satisfy 3 axioms that one would reasonably expect. Arthur Hobson proved for KL in "Concepts in Statistical Mechanics" (very expensive book), and Shannon proved for entropy (can be found in many Information Theory books).
The similarity between these 3 axioms and the proofs should hopefully help you understand the similarity in their meaning.
I believe it is the strong pure mathematical & philosophical foundation of Information Theory which is why Information Theoretic approaches perform so well and generalize like no other.
Best Answer
Shannon Entropy is a concept related to the distribution of a random variable, not to any particular realization of the r.v. The OP talks about a "non-stationary" signal. This implies that the OP has available a sequence of signals, which can be viewed as a realized sequence of a stochastic process, which is a sequence of random variables.
If the process was (strictly) stationary then each r.v. would have the same distribution, hence the same entropy, and the specific realization of the process (the data) could be used to form some estimate of this common entropy.
If the stochastic process is not strictly stationary, then each element-random variable of the process may have a different entropy. In that case the theoretical validity of the Entropy concept remains -but if non-stationarity is left totally unrestricted, then we do not have a sufficient amount of data to estimate these different entropies.
This is a general issue with non-stationary stochastic process, it affects estimation attempts of all measures, characteristics, moments, statistics etc related to such a process. If we do not somehow restrict the memory and the time-heterogeneity of the process, we won't have enough data to say anything about it.
So any question about Shannon Entropy and non-stationary data should include the assumed restrictions on non-stationarity (assumed based on theory and/or on data assessment), in order to be actually answerable.