The issue is that you are working with a differential entropy for continuous random variables, which doesn't share all the nice properties of Shannon's entropy for discrete random variables and can behave counter to intuition. In particular, differential entropy can be negative!
The following might help to get a feel for what's going on. First, a little derivation. We have that
\begin{align}
H[X + Y, Y] &= H[X + Y \mid Y] + H[Y] = H[X \mid Y] + H[Y] = H[X, Y], \\
H[X + Y, Y] &= H[Y \mid X + Y] + H[X + Y],
\end{align}
so that,
$$H[X + Y] = H[X, Y] - H[Y \mid X + Y].$$
Since Shannon's entropy is always non-negative, the entropy of $X + Y$ will therefore always be smaller or equal to the entropy of $X, Y$, in line with your intuition. What must happen in your example is that $H[Y \mid X + Y]$ is negative, which is only possible because it is a differential entropy.
If you want a more well-behaved measure for continuous random variables, use relative entropy.
// So, in view of the above I have questions which is Is mutual information another name for information gain? //
No. But MI can be expressed in terms of KL (i.e. Info Gain) http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities
// Next, if max entropy implies high information content then why do we minimize entropy of the error between output signal and the desired signal. Is there any proof which shows that minimizing entropy of error when used as a fitness function would mean that we are getting close to the true estimate of the unknown parameter? //
Not sure if I fully understand the question, but there are proofs that minimizing KL is the only inference process that satisfies certain axioms one would deem reasonable from uncertain reasoning. Suggest you read "An Uncertain Reasoners Companion" - Jeff Paris.
KL and Entropy have both been shown to be the only measures of information (relative or absolute resp.) that satisfy 3 axioms that one would reasonably expect. Arthur Hobson proved for KL in "Concepts in Statistical Mechanics" (very expensive book), and Shannon proved for entropy (can be found in many Information Theory books).
The similarity between these 3 axioms and the proofs should hopefully help you understand the similarity in their meaning.
I believe it is the strong pure mathematical & philosophical foundation of Information Theory which is why Information Theoretic approaches perform so well and generalize like no other.
Best Answer
Based on your phrasing, it seems you are equating thermodynamic entropy with information entropy. The concepts are related, but you have to be careful because they are used differently in the two fields.
Shannon entropy measures unpredictability. You are correct that entropy is maximum when the outcome is the most uncertain. An unbiased coin has maximum entropy (among coins), while a coin that comes up Heads with probability 0.9 has less entropy. Contrary to your next statement, however, max entropy = maximum information content.
Suppose we flip a coin 20 times. If the coin is unbiased, the sequence might look like this:
If the coin comes up Heads with probability 0.9, it might look more like this:
The second signal contains less information. Suppose we encode it using run length encoding, like this:
which we interpret as "10 heads, then 1 tail, then 6 heads, then a tail, then 2 heads". Compare this to the same encoding method applied to the first signal:
We can't compress the signal from the maximum entropy coin as much, because it contains more information.
As for your specific questions:
A thermodynamic system in equilibrium has maximum entropy in the sense that its microstate is maximally uncertain given its macrostate (e.g. its temperature, pressure, etc). From our perspective as observers, this means that our knowledge of its microstate is less certain than when it was not in equilibrium. But the system in equilibrium contains more information because its microstate is maximally unpredictable. The quantity that has decreased is the mutual information between the macrostate and the microstate. This is the sense in which we "lose (mutual) information" when entropy increases. The loss is relative to the observer.
As long as the process is random, each new symbol adds information to the sequence. The symbols are random variables, so each one has a distribution for which we can calculate entropy. The information content of the sequence is measured with joint entropy.