You can obtain the explicit equation of $\phi$ for the Gaussian kernel via the Tailor series expansion of $e^x$. For notational simplicity, assume $x\in \mathbb{R}^1$:
$$\phi(x) = e^{-x^2/2\sigma^2} \Big[ 1, \sqrt{\frac{1}{1!\sigma^2}}x,\sqrt{\frac{1}{2!\sigma^4}}x^2,\sqrt{\frac{1}{3!\sigma^6}}x^3,\ldots\Big]^T$$
This is also discussed in more detail in these slides by Chih-Jen Lin of NTU (slide 11 specifically). Note that in the slides $\gamma=\frac{1}{2\sigma^2}$ is used as kernel parameter.
The equation in the OP only holds for the linear kernel.
But what I don't understand is why mapping data into a higher
dimensional space (say infinite using RBF kernel) will be more
suitable to use linear PCA in the feature space? Is the performance
always guaranteed to improve as compare to PCA when applied to
nonlinear data?
It might be useful if you're interested in classification with linear method, or using clustering - in the higher dimensional space data might be easier to classify linearly, or clusters might be easier to separate. For a simplified, theoretical version, you could check out the Cover's theorem.
I tried to visualize the process but it is hard for me when I use the
RBF kernel.
That's not really surprising considering the fact that the embedding space is infinitely dimensional.
I don't physically or intuitively think mapping the data into
higher-dimensional space will always work for PCA purpose (for
nonlinear data) .
That's because it doesn't have to work well - I remember one time trying polynomial kernels with SVMs for something that should be in theory solvable using them (text classification, polynomial kernels sort of work like boolean expressions for One-hot encoded vectors), but in fact the method was worse than using linear kernel.
Is there any concrete example of explanation about what actually (not
just illustrative plot) happens and why it helps to use RBF kernel?
I don't know if it's possible to formulate it for PCA, but for classification with SVM there exist estimates of VC dimension (this effectively measures how complex decision functions classifier can handle). Some example facts can be found in these slides (see "VC Dimension of Support Vector Machines").
Best Answer
I think the key to the magic is smoothness. My long answer which follows is simply to explain about this smoothness. It may or may not be an answer you expect.
Short answer:
Given a positive definite kernel $k$, there exists its corresponding space of functions $\mathcal{H}$. Properties of functions are determined by the kernel. It turns out that if $k$ is a Gaussian kernel, the functions in $\mathcal{H}$ are very smooth. So, a learned function (e.g, a regression function, principal components in RKHS as in kernel PCA) is very smooth. Usually smoothness assumption is sensible for most datasets we want to tackle. This explains why a Gaussian kernel is magical.
Long answer for why a Gaussian kernel gives smooth functions:
A positive definite kernel $k(x,y)$ defines (implicitly) an inner product $k(x,y)=\left\langle \phi(x),\phi(y)\right\rangle _{\mathcal{H}}$ for feature vector $\phi(x)$ constructed from your input $x$, and $\mathcal{H}$ is a Hilbert space. The notation $\left\langle \phi(x),\phi(y)\right\rangle $ means an inner product between $\phi(x)$ and $\phi(y)$. For our purpose, you can imagine $\mathcal{H}$ to be the usual Euclidean space but possibly with inifinite number of dimensions. Imagine the usual vector that is infinitely long like $\phi(x)=\left(\phi_{1}(x),\phi_{2}(x),\ldots\right)$. In kernel methods, $\mathcal{H}$ is a space of functions called reproducing kernel Hilbert space (RKHS). This space has a special property called ``reproducing property'' which is that $f(x)=\left\langle f,\phi(x)\right\rangle $. This says that to evaluate $f(x)$, first you construct a feature vector (infinitely long as mentioned) for $f$. Then you construct your feature vector for $x$ denoted by $\phi(x)$ (infinitely long). The evaluation of $f(x)$ is given by taking an inner product of the two. Obviously, in practice, no one will construct an infinitely long vector. Since we only care about its inner product, we just directly evaluate the kernel $k$. Bypassing the computation of explicit features and directly computing its inner product is known as the "kernel trick".
What are the features ?
I kept saying features $\phi_{1}(x),\phi_{2}(x),\ldots$ without specifying what they are. Given a kernel $k$, the features are not unique. But $\left\langle \phi(x),\phi(y)\right\rangle $ is uniquely determined. To explain smoothness of the functions, let us consider Fourier features. Assume a translation invariant kernel $k$, meaning $k(x,y)=k(x-y)$ i.e., the kernel only depends on the difference of the two arguments. Gaussian kernel has this property. Let $\hat{k}$ denote the Fourier transform of $k$.
In this Fourier viewpoint, the features of $f$ are given by $f:=\left(\cdots,\hat{f}_{l}/\sqrt{\hat{k}_{l}},\cdots\right)$. This is saying that the feature representation of your function $f$ is given by its Fourier transform divided by the Fourer transform of the kernel $k$. The feature representation of $x$, which is $\phi(x)$ is $\left(\cdots,\sqrt{\hat{k}_{l}}\exp\left(-ilx\right),\cdots\right)$ where $i=\sqrt{-1}$. One can show that the reproducing property holds (an exercise to readers).
As in any Hilbert space, all elements belonging to the space must have a finite norm. Let us consider the squared norm of an $f\in\mathcal{H}$:
$ \|f\|_{\mathcal{H}}^{2}=\left\langle f,f\right\rangle _{\mathcal{H}}=\sum_{l=-\infty}^{\infty}\frac{\hat{f}_{l}^{2}}{\hat{k}_{l}}. $
So when is this norm finite i.e., $f$ belongs to the space ? It is when $\hat{f}_{l}^{2}$ drops faster than $\hat{k}_{l}$ so that the sum converges. Now, the Fourier transform of a Gaussian kernel $k(x,y)=\exp\left(-\frac{\|x-y\|^{2}}{\sigma^{2}}\right)$
is another Gaussian where $\hat{k}_{l}$ decreases exponentially fast with $l$. So if $f$ is to be in this space, its Fourier transform must drop even faster than that of $k$. This means the function will have effectively only a few low frequency components with high weights. A signal with only low frequency components does not ``wiggle'' much. This explains why a Gaussian kernel gives you a smooth function.
Extra: What about a Laplace kernel ?
If you consider a Laplace kernel $k(x,y)=\exp\left(-\frac{\|x-y\|}{\sigma}\right)$, its Fourier transform is a Cauchy distribution which drops much slower than the exponential function in the Fourier transform of a Gaussian kernel. This means a function $f$ will have more high-frequency components. As a result, the function given by a Laplace kernel is ``rougher'' than that given by a Gaussian kernel.
Regardless of the Gaussian width, one property is that Gaussian kernel is ``universal''. Intuitively, this means, given a bounded continuous function $g$ (arbitrary), there exists a function $f\in\mathcal{H}$ such that $f$ and $g$ are close (in the sense of $\|\cdot\|_{\infty})$ up to arbitrary precision needed. Basically, this means Gaussian kernel gives functions which can approximate "nice" (bounded, continuous) functions arbitrarily well. Gaussian and Laplace kernels are universal. A polynomial kernel, for example, is not.
In general, you can do anything you like as long as the resulting $k$ is positive definite. Positive definiteness is defined as $\sum_{i=1}^{N}\sum_{j=1}^{N}k(x_{i},x_{j})\alpha_{i}\alpha_{j}>0$ for all $\alpha_{i}\in\mathbb{R}$, $\{x_{i}\}_{i=1}^{N}$ and all $N\in\mathbb{N}$ (set of natural numbers). If $k$ is not positive definite, then it does not correspond to an inner product space. All the analysis breaks because you do not even have a space of functions $\mathcal{H}$ as mentioned. Nonetheless, it may work empirically. For example, the hyperbolic tangent kernel (see number 7 on this page)
$k(x,y) = tanh(\alpha x^\top y + c)$
which is intended to imitate sigmoid activation units in neural networks, is only positive definite for some settings of $\alpha$ and $c$. Still it was reported that it works in practice.
What about other kinds of features ?
I said features are not unique. For Gaussian kernel, another set of features is given by Mercer expansion. See Section 4.3.1 of the famous Gaussian process book. In this case, the features $\phi(x)$ are Hermite polynomials evaluated at $x$.