Here is what I think of as the standard explanation. I learned this from The Art of Computer Programming, Section 4.5.3.
Start with some number $z$ whose continued fraction we want to find. I'll review the continued fraction algorithm: Let $a_1 = \lfloor z \rfloor$ and let $r_1 = x-a_1$. Then let $a_2 = \lfloor r_1^{-1} \rfloor$ and let $r_2 = r_1^{-1} - a_2$. Continue in this way to define $r_k$ and $a_k$. The $a_k$ are the coefficients of the continued fraction. Our first task will be to understand the $r_k$.
If $z$ is chosen randomly, then $r_1$ is equally likely to be anywhere in $[0,1)$. But the same is not true of $r_2$. We have $r_2 \in [0,x]$ if and only if $r_1 \in \bigcup_{k=1}^{\infty} [(k+x)^{-1}, k^{-1}]$, so the probability that $r_2$ is in $[0,x]$ is $\sum_{k=1}^{\infty} \left( \tfrac{1}{k} - \tfrac{1}{k+x} \right)$. More generally, let $F_n(x)$ be the probability that $r_n \in [0,x]$, then
$$F_{n+1}(x) = \sum_{k=1}^{\infty} \left( F_n(\tfrac{1}{k}) - F_n(\tfrac{1}{k+x} ) \vphantom{\sum} \right).$$
If we imagine that $\lim_{n\to \infty} F_n(x)$ will exist, then the limiting function $F$ should obey
$$F(x) = \sum_{k=1}^{\infty} \left( F(\tfrac{1}{k}) - F(\tfrac{1}{k+x} ) \vphantom{\sum} \right).$$
One function that obeys this property is $\log (1+x)$. Let's verify this:
$$ \sum_{k=1}^{N} \left( \log \frac{k+1}{k} - \log \frac{k+x+1}{k+x} \right)
\sum_{k=1}^{N} \left( \log (k+1) - \log k - \log (k+x+1) + \log (1+x) \right).$$
This telescope to
$$\log (N+1) - \log 1 - \log (N+x+1) + \log(1+x) = \log \frac{N+1}{N+x+1} + \log (1+x).$$
Sending $N \to \infty$, we get $\log (1+x)$.
This argument was neutral as to what the base of the logarithm was but, since we want $F(1)=1$, we should take log base two.
Of course, I am not going to prove that the limit exists or that it takes this value -- this is a hard theorem! But let's see why it implies the rest. The probability that $a_k = b$ is the probability that $r_{k-1} \in (\tfrac{1}{b+1}, \tfrac{1}{b}]$, or $F_{k-1}(\tfrac{1}{b}) - F_{k-1}(\tfrac{1}{b+1})$. If $F_k$ approaches $\log_2 (1+x)$, then the probability that $a_k$ is $b$ approaches $\log_2 \tfrac{b+1}{b} - \log_2 \tfrac{b+2}{b+1} = \log_2 \tfrac{(b+1)^2}{b(b+2)}$. Note that this is $\approx \tfrac{1}{b^2}$.
Intuitively then, if $g$ is a function from $\mathbb{Z}_{>0}$ to $\mathbb{R}$, we would expect the average value of $g(a_k)$ to be $\sum_{b=1}^{\infty} g(b) \log_2 \tfrac{(b+1)^2}{b(b+2)}$. It wouldn't be a good idea to take $g(x) = x$, since then the sum would diverge. But, if $g(x)=\log x$, then the sum converges to the log of the Khinchin constant, as one would hope for.
There is a subtle switch in the last paragraph. Before the last paragraph, $z$ was a random variable, and I looked at how the probability distribution of $a_k$ behaved as $k \to \infty$. In the last paragraph, and in Khinchin's theorem, I took a single $z$ and asked about the average behavior of $a_k$ for that $z$. This is a hard issue, but ergodic theory has standard tools to justify it.
Best Answer
Here almost all means except on a set of measure zero; see here, for instance. All countable sets have measure zero, but not all measure zero sets are countable; the middle-thirds Cantor set is a well-known example of a set of measure zero whose cardinality is equal to that of $\Bbb R$.