The best and standard reference will be
Aronszajn, Nachman. "Theory of reproducing kernels." Transactions of
the American mathematical society 68.3 (1950): 337-404.
I am not sure how deep you know about functional analysis, different levels of functional analysis has great difference. So I will say another standard is:
Smola, Alex J., and Bernhard Schölkopf. Learning with kernels.
GMD-Forschungszentrum Informationstechnik, 1998.
I am also not sure about your general math background, more or less you may be interested in:
Lafferty, John, and Guy Lebanon. "Diffusion kernels on statistical
manifolds." Journal of Machine Learning Research 6.Jan (2005):
129-163.
Your understanding of linear SVMs sounds correct, but there may be some misconceptions about kernelized SVMs.
A kernelized SVM is equivalent to a linear SVM that operates in feature space rather than input space. Conceptually, you can think of this as mapping the data (possibly nonlinearly) into feature space, then using a linear SVM. However, the actual steps taken when using a kernelized SVM don't look like this because the kernel trick is used. Rather than explicitly mapping the data into feature space, the mapping is defined implicitly by the kernel function, which returns the dot product between feature space representations of two data points. Say $x$ and $x'$ are points in input space and $K$ is the kernel function. Then $K(x, x') = \Phi(x) \cdot \Phi(x')$, where $\Phi$ is a mapping from input space to feature space. Because a linear SVM can be formulated in terms of dot products, one can replace these dot products with kernel function evaluations to obtain a linear SVM that operates in feature space, without ever having to compute (or even know) $\Phi$.
You could just introduce some arbitrary new dimensions...but there is no guarantee that this new space is appropriate. So instead, you use a kernel to compute the new space, which takes into account the actual locations of the datapoints, rather than just the space as a whole.
Feature space isn't defined by the data points, but by the kernel function itself. There's no guarantee that any particular kernel function will give linear separability. So, the choice of kernel is an important model selection problem.
In this new space, it may now be easier to separate out the data, because the new space actually represents similarities between datapoints, whereas in the original space, it just represented the real-world data.
Consider the linear kernel $K(x, x') = x \cdot x'$ which simply computes the dot product in input space. This gives a feature space that's equivalent to input space (up to rotation and reflection, which preserve the dot product). So, an SVM with this kernel would be equivalent to a regular linear SVM, and the above statement can't be true--just framing things in terms of similarities between data points doesn't necessarily increase separability.
And the reason why this new space is considered to have potentially infinite dimensions, is that you can have one dimension of the new space for each datapoint in your data; so as your dataset approaches infinite size, the dimensionality of the new dataset also approaches infinity.
Feature space does not have one dimension per datapoint. For example, consider the linear kernel again. Feature space is equivalent to the input space, and therefore has the same number of dimensions. Polynomial kernels induce a finite dimensional feature space, with higher dimensionality than the input space (for degree > 1). Some kernels (such as the RBF kernel) do induce an infinite dimensional feature space. This just means that, in order for $\Phi(x) \cdot \Phi(x')$ to equal $K(x, x')$, $\Phi(x)$ must be infinite dimensional. This is a consequence of the kernel function, rather than the number of data points. However, even if feature space is infinite dimensional, the data can only span a finite-dimensional subspace. The maximum possible dimensionality of this subspace is the number of data points.
Best Answer
As the name says, reproducing kernel Hilbert spaces is a Hilbert space, so some knowledge of Hilbert space/functional analysis comes in handy ... But you might as well start with RKHS, and then see what you do not understand, and what you need to read to cover that.
The usual example of Hilbert spaces, $L_2$, have the problem that the members are not functions, but equivalence classes of functions that coincide except on a set of (Lebesgue) measure zero. That way, they always give the same results when integrated ... and that is what $L_2$ spaces can be used for. Members of $L_2$ spaces cannot really be evaluated since you can change the value at one point without changing the value of the integral.
So in applications where you really want functions that you can evaluate at individual points (like in approximation theory, regression, ...) RKHS come in handy, because the defining property is equivalent to the requirement that the evaluation functional $$ E_x(f) = f(x) $$ is continuous in $f$ for each $x$. So you can evaluate the member functions, and replacing $f$ with some other function, say $f+\epsilon$ (in some sense ...) will only change the value a little bit. That is the intuition you asked for.