This is a statistics question, not a programming question, and would better be asked on CrossValidated. At least, the LaTeX code is getting parsed there automatically :). Also, this is more complicated than what is readily available on that webpage. I'll give some guidance, but as long as you want to learn how to do things, this won't be the complete answer. (If you don't want to do that, we can locate the cooked answers on the web, too.)
Each sampling of betas relies on the complete data set. If you are doing this with individual y_i's and x_i's, you are not doing this right. Before you start working with the code, you need to sit down with a piece of paper (letter size or A4, depending on your geography) and derive the posterior distributions of betas:
- This is given: y|beta is normal with mean x'beta and precision tau
- This is given: prior for beta is normal with mean mu and precision gamma
- Obtain this: the marginal distribution of y, by integrating the betas out (which is easy to do, since the joint distribution of y and beta is multivariate normal, and you can do this by kernel matching: the part that depends on beta is going to be exp[ a quadratic form in beta], so you recognize this to be a relevant part of a normal distribution distribution to integrate over; whatever's left after integration should be a normal density in y and the prior parameters)
- Obtain this: the posterior distribution of beta given y, by Bayes theorem (the likelihood times the prior divided by the posterior; again this should be a moderately complicated combination of exp[ jointly quadratic in y and beta ])
- Obtain this: the conditional distribution of beta_1 given beta_2 and y, one of the margins of the multivariate normal distribution obtained at the previous step.
You need to know how to manipulate the multivariate normal distribution and get conditional and marginal distributions out of it. Again, if this is over your head, we can find the ready solutions.
Note that you also need a sampler for the variance of regression errors, unless you treat it as known (which is hardly a practical situation). This will be slightly more complicated, as you would need to incorporate another dimension into your integration procedures.
Here are a few remarks. Perhaps someone else can fill in the details.
1) Basis representations are always a good idea. It's hard to avoid them if you want to actually do something computational with your covariance function. The basis expansion can give you an approximation to the kernel and something to work with. The hope is that you can find a basis that makes sense for the problem you are trying to solve.
2) I'm not quite sure what you mean by a configuration in this question. At least one of the basis functions will need to be a function of $\theta$ for the kernel to depend on $\theta$. So yes, the eigenfunctions will vary with the parameter. They would also vary with different parametrizations.
Typically, the number of basis functions will be (countably) infinite, so the number won't vary with the parameter, unless some values caused the kernel to become degenerate.
I also don't understand what you mean in point 2 about the process having Bayesian prior $w \sim \mathcal{N}(0,diag[\lambda_1^2, \ldots])$ since $w$ has not been mentioned up until this point. Also,$diag[\lambda_1^2, \ldots]$ seems to be an infinite dimensional matrix, and I have a problem with that.
3) Which set of basis functions form valid kernels? If you're thinking about an eigenbasis, then the functions need to be orthogonal with respect to some measure. There are two problems. 1) The resulting kernel has to be positive definite ... and that's OK if the $\lambda_i$ are positive. And 2) the expansion has to converge. This will depend on the $\lambda_i$, which need to dampen fast enough to ensure the convergence of the expression. Convergence will also depend on the domain of the $x$'s
If the basis functions are not orthogonal then it will be more difficult to show that a covariance defined from them is positive definite. Obviously, in that case you are not dealing with an eigen-expansion, but with some other way of approximating the function of interest.
However, I don't think people typically start from a bunch of functions and then try to build a covariance kernel from them.
RE: Differentiability of the kernel and differentiability of the basis functions. I don't actually know the answer to this question, but I would offer the following observation.
Functional analysis proceeds by approximating functions (from an infinite dimensional space) by finite sums of simpler functions. To make this work, everything depends on the type of convergence involved. Typically, if you are working on a compact set with strong convergence properties (uniform convergence or absolute summability) on the functions of interest, then you get the kind of intuitive result that you are looking for: the properties of the simple functions pass over to the limit function -- e.g. if the kernel is a differentiable function of a parameter, then the expansion functions must be differentiable functions of the same parameter, and vice-versa. Under weaker convergence properties or non-compact domains, this does not happen. In my experience, there's a counter-example to every "reasonable" idea one comes up with.
Note: To forestall possible confusion from readers of this question, note that the Gaussian expansion of point 1 is not an example of the eigen-expansion of point 2.
Best Answer
As you are confused let me start by stating the problem and taking your questions one by one. You have a sample size of 10,000 and each sample is described by a feature vector $x\in\mathbb{R}^{31}$. If you want to perform regression using Gaussian radial basis functions then are looking for a function of the form $$f(x) = \sum_{j}{w_j * g_j(x; \mu_j,\sigma_j}), j=1..m$$ where the $g_i$ are your basis functions. Specifically, you need to find the $m$ weights $w_j$ so that for given parameters $\mu_j$ and $\sigma_j$ you minimise the error between $y$ and the corresponding prediction $\hat{y}$ = $f(\hat{x})$ - typically you will minimise the least squares error.
What exactly is the Mu subscript j parameter?
You need to find $m$ basis functions $g_j$. (You still need to determine the number $m$) Each basis function will have a $\mu_j$ and a $\sigma_j$ (also unknown). The subscript $j$ ranges from $1$ to $m$.
Is the $\mu_j$ a vector?
Yes, it is a point in $\mathbb{R}^{31}$. In other words, it is point somewhere in your feature space and a $\mu$ must be determined for each of the $m$ basis functions.
I've read that this governs the locations of the basis functions. So is this not the mean of something?
The $j^{th}$ basis function is centered at $\mu_j$. You will need to decide on where these locations are. So no, it is not necessarily the mean of anything (but see further down for ways to determine it)
Now for the sigma that "governs the spatial scale". What exactly is that?
$\sigma$ is easier to understand if we turn to the basis functions themselves.
It helps to think of the Gaussian radial basis functions in lower dimensons, say $\mathbb{R}^{1}$ or $\mathbb{R}^{2}$. In $\mathbb{R}^{1}$ the Gaussian radial basis function is just the well-known bell curve. The bell can of course be narrow or wide. The width is determined by $\sigma$ – the larger $\sigma$is the narrower the bell shape. In other words, $\sigma$ scales the width of the bell shape. So for $\sigma$ = 1 we have no scaling. For large $\sigma$ we have substantial scaling.
You may ask what the purpose of this is. If you think of the bell covering some portion of space (a line in $\mathbb{R}^{1}$) – a narrow bell will only cover a small part of the line*. Points $x$ close to the centre of the bell will have a larger $g_j(x)$ value. Points far from the centre will have a smaller $g_j(x)$ value. Scaling has the effect of pushing points further from the centre – as the bell narrows points will be located further from the centre - reducing the value of $g_j(x)$
Each basis function converts input vector x into a scalar value
Yes, you are evaluating the basis functions at some point $\mathbf{x}\in\mathbb{R}^{31}$.
$$\exp\left({-\frac{\|\mathbf{x}-\mu_j\|_2^2}{2*\sigma_j^2}}\right)$$
You get a scalar as a result. The scalar result depends on the distance of the point $\mathbf{x}$ from the centre $\mu_j$ given by $\|\mathbf{x}-\mu_j\|$ and the scalar $\sigma_j$.
I've seen some implementations that try such values as .1, .5, 2.5 for this parameter. How are these values computed?
This of course is one of the interesting and difficult aspects of using Gaussian radial basis functions. if you search the web you will find many suggestions as to how these parameters are determined. I will outline in very simple terms one possibility based on clustering. You can find this and several other suggestions online.
Start by clustering your 10000 samples (you could first use PCA to reduce the dimensions followed by k-Means clustering). You can let $m$ be the number of clusters you find (typically employing cross validation to determine the best $m$). Now, create a radial basis function $g_j$ for each cluster. For each radial basis function let $\mu_j$ be the center (e.g. mean, centroid, etc) of the cluster. Let $\sigma_j$ reflect the width of the cluster (eg radius...) Now go ahead and perform your regression (this simple description is just an overview- it needs lots of work at each step!)
*Of course, the bell curve is defined from -$\infty$ to $\infty$ so will have a value everywhere on the line. However, the values far from the centre are negligible