The RVM places an Automatic Relevance Determination (ARD) prior on the weights in a regularized regression/logistic regression setup. (The ARD prior is a just a weak gamma prior on the precision of a gaussian random variable). Marginalizing out the weights and maximizing the likelihood of the data with respect to the precision causes many of the precision parameters to become large, which would push the associated weights to zero. If you use feature vectors given by a design matrix, then this strategy selects a small set of examples that predict the target variable well.
The IVM strategy is fundamentally different from the RVM's strategy. The IVM is a Gaussian Process method that selects a small set of points from the training set using a greedy selection criterion (based on change in entropy of the posterior GP) and combines this strategy with standard GP regression/classification on the sparse set of points.
Unlike the SVM, for both the IVM and RVM there is not an obvious geometric interpretation of relevant or informative vectors. Basically, both of the algorithms find sparse (the SVM and IVM are dual sparse, but the RVM should probably be considered primal sparse) solutions for regression/classification problems but they use different approaches to do so.
Feature Space
Feature space refers to the $n$-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features. For example, consider the data set with:
Target
- $Y \equiv$ Thickness of car tires after some testing period
Variables
- $X_1 \equiv$ distance travelled in test
- $X_2 \equiv$ time duration of test
- $X_3 \equiv$ amount of chemical $C$ in tires
The feature space is $\mathbf{R}^3$, or more accurately, the positive quadrant in $\mathbf{R}^3$ as all the $X$ variables can only be positive quantities. Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, $X_4$ (this is the feature extraction part):
- $X_4 =\frac{X_1}{X_2} \equiv$ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of $\mathbf{R}^4$.
Mappings
Furthermore, a mapping in our example is a function, $\phi$, from $\mathbf{R}^3$ to $\mathbf{R}^4$:
$$\phi(x_1,x_2,x_3) = (x_1, x_2, x_3, \frac{x_1}{x_2} )$$
Best Answer
The big $\mathcal X$ is the space of all possible values the features, and the little $x$ is a point in that space.
This is no different from saying $x\in\mathbb R^n$, though $\mathcal X$ is more general (could have categorical features, for instance).
$\mathcal X = \mathbb R \times \mathbb R \times \{\text{dog}, \text{cat}, \text{horse}\}$ is a perfectly acceptable $\mathcal X$ and would represent two features that can take any real number and a third feature that is categorical (with levels of $\text{dog}$, $\text{cat}$, and $\text{horse}$).
(One might argue that the feature space should encode the categorical feature in $0$s and $1$s (e.g., one-hot encoding categories), though that is a separate issue.)