The entropy of a message is a measurement of how much information it carries.
One way of saying this (per your textbook) is to say that a message has high entropy if each word (message sequence) carries a lot of information. Another way of putting it is saying that if we don't get the message, we lose a lot of information; i.e., entropy is a measure of the number of different things that message could have said. All of these definitions are consistent, and in a sense, the same.
To your first question: the entropy of each letter of the English language is about two bits, as opposed to a Hindi letter which apparently contains $3$.
The question this measurement answers is essentially the following: take a random sentence in English or Hindi, and delete a random letter. On average, how many possible letters might we expect to be in that blank? In English, there are on average $2$ possibilities. In Hindi, $3$
EDIT: the simplest way to explain these measurements is that it would take, on average, $2$ yes/no questions to deduce a missing english letter and $3$ yes/no question to deduce a missing Hindi letter. On average, there are in fact twice as many Hindi letters (on "average", you'd have $2^3=8$ letters) that can fill in a randomly deleted letter in a Hindi passage as the number of English letters (on "average", you'd have $2^2=4$ letters). See also Chris's comment below for another perspective.
For a good discussion of this stuff in the context of language, I recommend taking a look at this page.
As for (2), I don't think I can answer that satisfactorily.
As for (3), there's a lot to be done along the same lines of language. Just as we measure the entropy per word, we could measure the entropy per musical phrase or per base-pair. This could give us a way of measure the importance of damaged/missing DNA, or the number of musically appealing ways to end a symphony. An interesting question to ask about music is will we ever run out? (video).
Password strength comes down to the following question: how many passwords does a hacker have to guess before he can expect to break in? This is very much answerable via entropy.
I hope that helps.
Denoting $X=(X_1, X_2, \ldots, X_N)$ and similary for $Y$, note that, since the channel is memoryless
$$
\mathbb{P}(X=(x_1,x_2,\ldots,x_N)|Y=(y_1,y_2,\ldots,y_N))=\prod_{i=1}^N \mathbb{P}(X_i=x_i|Y_i=y_i).
$$
From fundamental properties of the entropy, it follows that
$$
\begin{align}
H(X|Y)&=\sum_{i=1}^NH(X_i|Y_i)\\
&=N H(X_i|Y_i),
\end{align}
$$
You have probably seen the first equality as holding for unconditional entropy, you may want to work on showing why this also holds for conditional entropy. The second equality holds since the channel treats each bit the same. Now, you need to show that $H(X_i|Y_i)=h(p_e)$.
Best Answer
The concept of mutual information seems to capture exactly what you are looking for:
Specifically, you would be looking at $I(X; X+Y)$.
Here I am reading your question
as "how much information does observing $X+Y$ gives me about $X$". If you meant "how much information remains to be learnt in $X$" (i.e., how much remaining entropy there is), then you may want to look at the conditional entropy instead (both are very much related).