As noted by whuber your question was answered to the great extent in his answer to another question on similar topic.
1) Can we think of $\widehat{p}(x)$ as a kernel density estimator with
the kernel $K()$ defined as the Dirac function at zero?
If you have some kernel $K$ parametrized with bandwidth $h$, then as $h \to 0$ your kernel becomes Dirac delta. This is basically one of the intuitions behind Dirac delta as also described on Wikipedia.
2) How can we calculate the mean of Dirac function?
It accumulates all probability mass at $\mu$ and is zero elsewhere, so from the definition of expected value is $\mu$.
3) What is the intuition behind the above empirical probability
distribution $\widehat{p}(x)$?
It's a discrete distribution, you assume that your variable can take only the values as observed in your sample (i.e. $x$'s), with probabilities equal to empirical probabilities. If your data is not discrete, then only empirical cumulative distribution function makes sense as an approximation of the true distribution. This is why histograms and kernel density estimators were developed, so that we can approximate the concinnous density function using empirical data.
4) If $x=x^{(i)}$ then $\widehat{p}\left(x=x^{(i)}\right)=+\infty$?
Dirac delta is not a function and is not infinite as $x^{(i)}$. This is just a heuristic about it, that is meant to provide an intuitive understanding of what it is, but if it was infinite, it would be useless. If you want to learn more, check Role of Dirac function in particle filters thread, where it is discussed in greater detail.
If you want intuitive understanding of it, then think of a $m$ distributions, where each of them integrates to unity and has the same probability density mass at $x^{(i)}$'s. You create a mixture of those distributions, each appearing with equal mixing proportion $1/m$. So whatever was the probability density mass at $x^{(i)}$'s, first you divide it by $m$, then multiply by $m$ (there was $m$ $\delta$'s), so in total you need to end up with same total mass equal to unity, and mass for each of the $x^{(i)}$ points would be $m$ times smaller then initially. Probability densities for $x^{(i)}$'s are huge since the area under each of the points is infinitesimally small (see Can a probability distribution value exceeding 1 be OK? to learn more about probability densities). It gets abstract and dirty, but the take-away message is that it is just a trick to use calculus with discrete values. If you stay at discrete-only reality, then basically each of the $x^{(i)}$ points has mass equal to $1/m$ so that in total their mass sums to unity by the axioms of probability.
Let the data be $\mathbf{x}=(x_1, \ldots, x_n)$. Write $F(\mathbf{x})$ for the empirical distribution. By definition, for any function $f$,
$$\mathbb{E}_{F(\mathbf{x})}[f(X)] = \frac{1}{n}\sum_{i=1}^n f(x_i).$$
Let the model $M$ have density $e^{f(x)}$ where $f$ is defined on the support of the model. The cross-entropy of $F(\mathbf{x})$ and $M$ is defined to be
$$H(F(\mathbf{x}), M) = -\mathbb{E}_{F(\mathbf{x})}[\log(e^{f(X)}] = -\mathbb{E}_{F(\mathbf{x})}[f(X)] =-\frac{1}{n}\sum_{i=1}^n f(x_i).\tag{1}$$
Assuming $x$ is a simple random sample, its negative log likelihood is
$$-\log(L(\mathbf{x}))=-\log \prod_{i=1}^n e^{f(x_i)} = -\sum_{i=1}^n f(x_i)\tag{2}$$
by virtue of the properties of logarithms (they convert products to sums).
Expression $(2)$ is a constant $n$ times expression $(1)$. Because loss functions are used in statistics only by comparing them, it makes no difference that one is a (positive) constant times the other. It is in this sense that the negative log likelihood "is a" cross-entropy in the quotation.
It takes a bit more imagination to justify the second assertion of the quotation. The connection with squared error is clear, because for a "Gaussian model" that predicts values $p(x)$ at points $x$, the value of $f$ at any such point is
$$f(x; p, \sigma) = -\frac{1}{2}\left(\log(2\pi \sigma^2) + \frac{(x-p(x))^2}{\sigma^2}\right),$$
which is the squared error $(x-p(x))^2$ but rescaled by $1/(2\sigma^2)$ and shifted by a function of $\sigma$. One way to make the quotation correct is to assume it does not consider $\sigma$ part of the "model"--$\sigma$ must be determined somehow independently of the data. In that case differences between mean squared errors are proportional to differences between cross-entropies or log-likelihoods, thereby making all three equivalent for model fitting purposes.
(Ordinarily, though, $\sigma = \sigma(x)$ is fit as part of the modeling process, in which case the quotation would not be quite correct.)
Best Answer
There is no such a proof. That's just an intuitive thing. Model typically predicts training samples better than test samples, because it learns from the training data and test data is just something that model hasn't seen before. It's possible that test error is lower than training error, especially in case if samples are small.