There is no such a proof. That's just an intuitive thing. Model typically predicts training samples better than test samples, because it learns from the training data and test data is just something that model hasn't seen before. It's possible that test error is lower than training error, especially in case if samples are small.
Let the data be $\mathbf{x}=(x_1, \ldots, x_n)$. Write $F(\mathbf{x})$ for the empirical distribution. By definition, for any function $f$,
$$\mathbb{E}_{F(\mathbf{x})}[f(X)] = \frac{1}{n}\sum_{i=1}^n f(x_i).$$
Let the model $M$ have density $e^{f(x)}$ where $f$ is defined on the support of the model. The cross-entropy of $F(\mathbf{x})$ and $M$ is defined to be
$$H(F(\mathbf{x}), M) = -\mathbb{E}_{F(\mathbf{x})}[\log(e^{f(X)}] = -\mathbb{E}_{F(\mathbf{x})}[f(X)] =-\frac{1}{n}\sum_{i=1}^n f(x_i).\tag{1}$$
Assuming $x$ is a simple random sample, its negative log likelihood is
$$-\log(L(\mathbf{x}))=-\log \prod_{i=1}^n e^{f(x_i)} = -\sum_{i=1}^n f(x_i)\tag{2}$$
by virtue of the properties of logarithms (they convert products to sums).
Expression $(2)$ is a constant $n$ times expression $(1)$. Because loss functions are used in statistics only by comparing them, it makes no difference that one is a (positive) constant times the other. It is in this sense that the negative log likelihood "is a" cross-entropy in the quotation.
It takes a bit more imagination to justify the second assertion of the quotation. The connection with squared error is clear, because for a "Gaussian model" that predicts values $p(x)$ at points $x$, the value of $f$ at any such point is
$$f(x; p, \sigma) = -\frac{1}{2}\left(\log(2\pi \sigma^2) + \frac{(x-p(x))^2}{\sigma^2}\right),$$
which is the squared error $(x-p(x))^2$ but rescaled by $1/(2\sigma^2)$ and shifted by a function of $\sigma$. One way to make the quotation correct is to assume it does not consider $\sigma$ part of the "model"--$\sigma$ must be determined somehow independently of the data. In that case differences between mean squared errors are proportional to differences between cross-entropies or log-likelihoods, thereby making all three equivalent for model fitting purposes.
(Ordinarily, though, $\sigma = \sigma(x)$ is fit as part of the modeling process, in which case the quotation would not be quite correct.)
Best Answer
Here's my understanding of this quote. This is sort of a hand-wavy argument, but still gives some intuition. Let's consider a simple linear layer:
$$y = Wx + b$$
... or equivalently:
$$y_i = x_{1}W_{i,1} + ... + x_{n}W_{i,n} + b_i$$
If we focus on one weight $W_{i,j}$, its value is determined by observing two variables $(x_j, y_i)$. If the training data has $N$ rows, there're only $N$ pairs $(x_j, y_i)$, out of which $W_{i,j}$ is going to learn the correct value. That is a lot of flexibility, which the authors summarize in this phrase:
In other words, the number of training rows $N$ must be really big in order to capture the correct slope without regularization. On the other hand, $b_i$ affects just $y_i$, which basically means its value can be better estimated from the same number of examples $N$. The authors put it this way:
In the end, we'd like to regularize the weights that have "more freedom", that's why regularizing $W_{i,j}$ makes more sense than $b_i$.