From a technical point of view, here is the argument:
For densities (but the argument is analogous in the discrete case), we write
$$ \pi \left( \theta |y\right) =\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{f(y)}
$$
The norming constant can be obtained as, by writing a marginal density as a joint density and then writing the joint as conditional times marginal, with the other parameter integrated out,
\begin{align*}
f(y)&=\int f\left( y,\theta \right) d\theta\\
&=\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta
\end{align*}
It ensures integration to 1 because
\begin{align*}
\int \pi \left( \theta |y\right) d\theta&=\int\frac{f\left( y|\theta \right) \pi \left(\theta \right) }{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}d\theta\\ &=\frac{\int f\left( y|\theta \right) \pi \left(\theta \right) d\theta}{\int f\left( y|\theta \right) \pi \left(\theta \right)d\theta}\\
&=1,
\end{align*}
where we can "take out" the integral in the denominator because $\theta$ had already been integrated out there.
As @glen_b pointed out, likelihood is not an inverse probability, as $\theta$ is not a random variable. However, you are correct in that it is a measure of evidential support. One caveat is that, unlike probability, it is not an absolute measure of support (a likelihood of 1, 10, or 1000 has no intrinsic meaning), but a relative measure of support. Generally, this is encoded by forming the likelihood ratio (LR):
$$ LR(\theta):= \frac{L(\theta;x)}{L(\theta_{MLE};x)}$$
Which will always be between 0 and 1. This is an improvement over the unnormalized likelihood, but we still aren't quite there. It turns out that, for example, a $LR=0.15$ is not by itself a useful measure either, since its interpretation depends on the dimension of $\theta$. If $\theta$ is a scalar, then $P(LR<0.15) \xrightarrow{n} 0.05$, so it can be used in a probabilistic framework in much the same way as any other test statistic.
However, it can also be used as a purely subjective measure of what we consider "plausible" parameter given the data (read:evidence). Under this non-probabilistic interpretation, we would say that any scalar $\theta$ that resulted in $LR<0.15$ would be "implausible" or "unlikely". Now, what if we wanted to port this same subjective assessment to a vector parameter, say $(\theta_1,\theta_2)$? Unfortunately, we cannot continue to use $0.15$ as our cutoff for "unlikely" (well, of course you can, but then your inferences at a higher dimension will not be compatible with inferences at a lower dimension. This is a subtle point. A good article on this was written by one of the strongest proponents of likelihood inference (JK Lindsey). See here.). Essentially, compatible inference can be implemented by raising the scalar likelihood cutoff to the number of dimensions of the vector parameter. For example, if our parameter dimension is 2, then a cutoff that would be compatible with $0.15$ would be $0.15^2$.
The above is a very abridged description of modern likelihood. I think your confusion is shown by the following statement you made:
Given that Graham is using an umbrella, there is a 20% chance that it is raining.
This is actually not what a 20% likelihood would tell you. What you stated above is a Bayesian posterior probability: $P(\textrm{Raining}|\textrm{Umbrella})$, what the likelihood it saying is quite the opposite:
$$L(\textrm{Raining}|\textrm{Umbrella}) = P(\textrm{Umbrella}|\textrm{Raining})$$
As you correctly pointed out, a prior probability (and a normalizing constant) is required to turn a likelihood into a probability.
Best Answer
The answer depends on whether you are dealing with discrete or continuous random variables. So, I will split my answer accordingly. I will assume that you want some technical details and not necessarily an explanation in plain English.
Discrete Random Variables
Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is $p$ and that coin tosses are independent).
Denote the observed outcomes by $O$ and the set of parameters that describe the stochastic process as $\theta$. Thus, when we speak of probability we want to calculate $P(O|\theta)$. In other words, given specific values for $\theta$, $P(O|\theta)$ is the probability that we would observe the outcomes represented by $O$.
However, when we model a real life stochastic process, we often do not know $\theta$. We simply observe $O$ and the goal then is to arrive at an estimate for $\theta$ that would be a plausible choice given the observed outcomes $O$. We know that given a value of $\theta$ the probability of observing $O$ is $P(O|\theta)$. Thus, a 'natural' estimation process is to choose that value of $\theta$ that would maximize the probability that we would actually observe $O$. In other words, we find the parameter values $\theta$ that maximize the following function:
$L(\theta|O) = P(O|\theta)$
$L(\theta|O)$ is called the likelihood function. Notice that by definition the likelihood function is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$.
Continuous Random Variables
In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed $O$ given $\theta$ because in the continuous case $P(O|\theta) = 0$. Without getting into technicalities, the basic idea is as follows:
Denote the probability density function (pdf) associated with the outcomes $O$ as: $f(O|\theta)$. Thus, in the continuous case we estimate $\theta$ given observed outcomes $O$ by maximizing the following function:
$L(\theta|O) = f(O|\theta)$
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe $O$ as we maximize the PDF associated with the observed outcomes $O$.