As you are confused let me start by stating the problem and taking your questions one by one. You have a sample size of 10,000 and each sample is described by a feature vector $x\in\mathbb{R}^{31}$. If you want to perform regression using Gaussian radial basis functions then are looking for a function of the form $$f(x) = \sum_{j}{w_j * g_j(x; \mu_j,\sigma_j}), j=1..m$$ where the $g_i$ are your basis functions. Specifically, you need to find the $m$ weights $w_j$ so that for given parameters $\mu_j$ and $\sigma_j$ you minimise the error between $y$ and the corresponding prediction $\hat{y}$ = $f(\hat{x})$ - typically you will minimise the least squares error.
What exactly is the Mu subscript j parameter?
You need to find $m$ basis functions $g_j$. (You still need to determine the number $m$) Each basis function will have a $\mu_j$ and a $\sigma_j$ (also unknown). The subscript $j$ ranges from $1$ to $m$.
Is the $\mu_j$ a vector?
Yes, it is a point in $\mathbb{R}^{31}$. In other words, it is point somewhere in your feature space and a $\mu$ must be determined for each of the $m$ basis functions.
I've read that this governs the locations of the basis functions. So is this not the mean of something?
The $j^{th}$ basis function is centered at $\mu_j$. You will need to decide on where these locations are. So no, it is not necessarily the mean of anything (but see further down for ways to determine it)
Now for the sigma that "governs the spatial scale". What exactly is that?
$\sigma$ is easier to understand if we turn to the basis functions themselves.
It helps to think of the Gaussian radial basis functions in lower dimensons, say $\mathbb{R}^{1}$ or $\mathbb{R}^{2}$. In $\mathbb{R}^{1}$ the Gaussian radial basis function is just the well-known bell curve. The bell can of course be narrow or wide. The width is determined by $\sigma$ – the larger $\sigma$is the narrower the bell shape. In other words, $\sigma$ scales the width of the bell shape. So for $\sigma$ = 1 we have no scaling. For large $\sigma$ we have substantial scaling.
You may ask what the purpose of this is. If you think of the bell covering some portion of space (a line in $\mathbb{R}^{1}$) – a narrow bell will only cover a small part of the line*. Points $x$ close to the centre of the bell will have a larger $g_j(x)$ value. Points far from the centre will have a smaller $g_j(x)$ value. Scaling has the effect of pushing points further from the centre – as the bell narrows points will be located further from the centre - reducing the value of $g_j(x)$
Each basis function converts input vector x into a scalar value
Yes, you are evaluating the basis functions at some point $\mathbf{x}\in\mathbb{R}^{31}$.
$$\exp\left({-\frac{\|\mathbf{x}-\mu_j\|_2^2}{2*\sigma_j^2}}\right)$$
You get a scalar as a result. The scalar result depends on the distance of the point $\mathbf{x}$ from the centre $\mu_j$ given by $\|\mathbf{x}-\mu_j\|$ and the scalar $\sigma_j$.
I've seen some implementations that try such values as .1, .5, 2.5 for this parameter. How are these values computed?
This of course is one of the interesting and difficult aspects of using Gaussian radial basis functions. if you search the web you will find many suggestions as to how these parameters are determined. I will outline in very simple terms one possibility based on clustering. You can find this and several other suggestions online.
Start by clustering your 10000 samples (you could first use PCA to reduce the dimensions followed by k-Means clustering). You can let $m$ be the number of clusters you find (typically employing cross validation to determine the best $m$). Now, create a radial basis function $g_j$ for each cluster. For each radial basis function let $\mu_j$ be the center (e.g. mean, centroid, etc) of the cluster. Let $\sigma_j$ reflect the width of the cluster (eg radius...) Now go ahead and perform your regression (this simple description is just an overview- it needs lots of work at each step!)
*Of course, the bell curve is defined from -$\infty$ to $\infty$ so will have a value everywhere on the line. However, the values far from the centre are negligible
Best Answer
I think we'd like the subject narrowed down a bit too; same concerns here about our time and effectiveness. Do edit in more detail if you can. That being said, here's a first attempt at narrowing things down for you somewhat. I hate to do this to you, but I'm still going to end up tossing jargon at you that you might have to read into a bit to understand. (Hovering your cursor over our tags may be enough, hopefully!)
With data that large that's all aimed at predicting one thing, overfitting (overfitting) may pose one of the bigger problems for your predictive model, especially given high multicollinearity (multicollinearity) among some of your predictors. Using principal-components regression (PCR) should be a good way to handle multicollinearity, assuming you or your software exclude(s) the principal components with trivially small eigenvalues relative to the total sum of eigenvalues. "What's trivial?" may be a difficult question, but if you're lucky, you'll find natural gaps. Rank order your principal components by eigenvalue, and look for sharp drops in the size of each eigenvalue compared to the next smallest. In a relatively simple scenario, you'd want to use all the principal components before the last big drop in eigenvalue. You'll probably have a lot of relatively high eigenvalues that drop off gradually, but if you're lucky, it'll look something like this:
which is a relatively clear case for retaining three factors (bifactor analysis aside). Note @Scortchi's comment on this answer though: you'll want to be careful about throwing out PCs that are doing some real predictive work for your model, even if they have really small eigenvalues.
continuous-data is important in fitting a linear model, because binary data generally require logistic regression. If, as it sounds, your DV is continuous, and your ellipses aren't curvilinear, discontinuous, pear-shaped, etc., but just smoothly elliptical like variably elongated footballs (in the American sense), you're probably right to go with a linear model, though I don't know that you can really just eyeball this sort of thing. Run some basic regression diagnostics if you know or can figure out how. If any of your predictors are categorical-data, PCR probably won't know this, so you'll effectively be using them as approximations of continuous dimensions, which may not be safe, especially if they are nominal, or there are less than five (approximate rule of thumb) ordinal categories, or you don't actually have any reason to expect a normal distribution underlies your system of ordinal categories.
You may want to throw out the relatively useless predictors, which are conventionally identified by $t$-testing the regression coefficients. If the coefficients don't differ significantly from zero, they may be adding more error than information about the DV to your model's predictions. Lots of better ideas on how exactly to test which predictors to retain when you've got so many can be found in the discussion, Is adjusting p-values in a multiple regression for multiple comparisons a good idea? @whuber's suggestion to hold out some data for model validation is particularly straightforward and convincing in ways I think you'll find appealing.
If you were to care enough about what your model looks like, particularly in terms of how those principal components of your predictors organize themselves, you might consider piecing together a structural equation model (sem) of your own design. If you could model the latent factor structure of your set of predictors manually and accurately before using the latent factors to predict your DV, you could remove measurement-error from the factors in advance of doing the predictive modeling with them, and probably gain a better understanding of your model in the process. This could also let you identify mediation among your model's predictors, depending on how you organize it. I don't suppose you'd be inclined to care about that if you're only interested in prediction at the moment (and I don't mean to assume that you're wrong not to care), but if you ever find you need to explain how you're getting those predictions and why you think they're valid, you might have to revisit a lot of this when/if you do start caring. Therefore a little preemptive caring might be advisable, even if there's really no immediate reason! Then again, maybe you'll have more time later and be better able to afford starting over then, if necessary. Your call, your risk. Happy modeling, and may the trashy fiction be ever the result of someone else's work!