The entropy of a message is a measurement of how much information it carries.
One way of saying this (per your textbook) is to say that a message has high entropy if each word (message sequence) carries a lot of information. Another way of putting it is saying that if we don't get the message, we lose a lot of information; i.e., entropy is a measure of the number of different things that message could have said. All of these definitions are consistent, and in a sense, the same.
To your first question: the entropy of each letter of the English language is about two bits, as opposed to a Hindi letter which apparently contains $3$.
The question this measurement answers is essentially the following: take a random sentence in English or Hindi, and delete a random letter. On average, how many possible letters might we expect to be in that blank? In English, there are on average $2$ possibilities. In Hindi, $3$
EDIT: the simplest way to explain these measurements is that it would take, on average, $2$ yes/no questions to deduce a missing english letter and $3$ yes/no question to deduce a missing Hindi letter. On average, there are in fact twice as many Hindi letters (on "average", you'd have $2^3=8$ letters) that can fill in a randomly deleted letter in a Hindi passage as the number of English letters (on "average", you'd have $2^2=4$ letters). See also Chris's comment below for another perspective.
For a good discussion of this stuff in the context of language, I recommend taking a look at this page.
As for (2), I don't think I can answer that satisfactorily.
As for (3), there's a lot to be done along the same lines of language. Just as we measure the entropy per word, we could measure the entropy per musical phrase or per base-pair. This could give us a way of measure the importance of damaged/missing DNA, or the number of musically appealing ways to end a symphony. An interesting question to ask about music is will we ever run out? (video).
Password strength comes down to the following question: how many passwords does a hacker have to guess before he can expect to break in? This is very much answerable via entropy.
I hope that helps.
Easy illustrative example:
Take a fair coin. $P({\rm each\ result})=1/2$. By independence, $P({\rm each\ result\ in\ n\ tosses})=1/2^n$. The surprise in each coin toss is the same. The surprise in $n$ tosses is $n\times$(surprise in one toss). The $\log$ makes the trick. And the entropy is the mean surprise.
Best Answer
Both definitions are accurate although the first definition is more general because there are many ways entropy can be defined. Entropy is generally used as a measure of uncertainty we have about a particular event. Where an uncertain event is a event whereby there are different possible outcomes.
If you have a event (or random variable) with $M$ equiprobable outcomes then $M$ can reasonably be used as a measure of uncertainty of the event. If you observed the result of an uncertain event and you needed to store that result or transmit it to another party then the entropy measures how efficiently you can achieve that.
As a simple example, if an event has $M = 10$ equiprobable possible outcomes you can allocate each possible outcome a unique digit between $0$ and $9$. After observing the actual outcome you can then send the result to another other party by just sending the digit that corresponds to that outcome. You would only need to send $1$ digit for every outcome and so the entropy would be $1$ digit per outcome. Where a digit is a normal base-$10$ number.
If on the other hand you had $M = 20$ equiprobable possible outcomes then you need to send $2$ digits per outcome. If you use digits of base $b$ then your entropy can be shown to be $\log_b M$ per outcome (in the case of equiprobable outcomes). It is common to use a base of $2$ to represent the entropy. The entropy in this case is then in bits.
In the case of the observation of some binary random variable $X$ which follows a distribution $p(x)$, it can be shown that if $n$ observations are made (where $n$ is very large) then even though $2^n$ different sequences are possible, there is a very high probability that the observed outcome actually turns out to be one of $M = 2^{nH}$ equiprobable sequences. These sequences are called typical sequences. It then follows that $\log_2 2^{nH} = nH$ is entropy of the observed sequence. $H$ can be shown to evaluate to $-\sum p(x) \log_2 p(x)$.