Solved – Understanding Gaussian Basis function parameters to be used in linear regression

basis functionmachine learningregression

I'd like to apply the Gaussian basis function into a linear regression implementation. Unfortunately I'm having a hard time understanding a couple parameters in the basis function. Specifically $\mu$ and $\sigma$.

My dataset is a 10,000 x 31 matrix. 10,000 samples and 31 features. I've read that "Each basis function converts input vector x into a scalar value". So I assume x is 1 sample so a 1 x 31 vector. From here I'm confused. What exactly is the $\mu_j$ parameter? I've read that this governs the locations of the basis functions. So is this not the mean of something? I'm also thrown off by the subscript j ($\mu$ and $\phi$), this make me think jth row. But that doesn't seem to make sense. Is the $\mu_j$ a vector? Now for the $\sigma$ that "governs the spatial scale". What exactly is that? I've seen some implementations that try such values as .1, .5, 2.5 for this parameter. How are these values computed? I've been doing research and looking for examples to learning from but as of yet I haven't been able to find any. Any help or direction is greatly appreciated! Thank you.

Best Answer

As you are confused let me start by stating the problem and taking your questions one by one. You have a sample size of 10,000 and each sample is described by a feature vector $x\in\mathbb{R}^{31}$. If you want to perform regression using Gaussian radial basis functions then are looking for a function of the form $$f(x) = \sum_{j}{w_j * g_j(x; \mu_j,\sigma_j}), j=1..m$$ where the $g_i$ are your basis functions. Specifically, you need to find the $m$ weights $w_j$ so that for given parameters $\mu_j$ and $\sigma_j$ you minimise the error between $y$ and the corresponding prediction $\hat{y}$ = $f(\hat{x})$ - typically you will minimise the least squares error.

What exactly is the Mu subscript j parameter?

You need to find $m$ basis functions $g_j$. (You still need to determine the number $m$) Each basis function will have a $\mu_j$ and a $\sigma_j$ (also unknown). The subscript $j$ ranges from $1$ to $m$.

Is the $\mu_j$ a vector?

Yes, it is a point in $\mathbb{R}^{31}$. In other words, it is point somewhere in your feature space and a $\mu$ must be determined for each of the $m$ basis functions.

I've read that this governs the locations of the basis functions. So is this not the mean of something?

The $j^{th}$ basis function is centered at $\mu_j$. You will need to decide on where these locations are. So no, it is not necessarily the mean of anything (but see further down for ways to determine it)

Now for the sigma that "governs the spatial scale". What exactly is that?

$\sigma$ is easier to understand if we turn to the basis functions themselves.

It helps to think of the Gaussian radial basis functions in lower dimensons, say $\mathbb{R}^{1}$ or $\mathbb{R}^{2}$. In $\mathbb{R}^{1}$ the Gaussian radial basis function is just the well-known bell curve. The bell can of course be narrow or wide. The width is determined by $\sigma$ – the larger $\sigma$is the narrower the bell shape. In other words, $\sigma$ scales the width of the bell shape. So for $\sigma$ = 1 we have no scaling. For large $\sigma$ we have substantial scaling.

You may ask what the purpose of this is. If you think of the bell covering some portion of space (a line in $\mathbb{R}^{1}$) – a narrow bell will only cover a small part of the line*. Points $x$ close to the centre of the bell will have a larger $g_j(x)$ value. Points far from the centre will have a smaller $g_j(x)$ value. Scaling has the effect of pushing points further from the centre – as the bell narrows points will be located further from the centre - reducing the value of $g_j(x)$

Each basis function converts input vector x into a scalar value

Yes, you are evaluating the basis functions at some point $\mathbf{x}\in\mathbb{R}^{31}$.

$$\exp\left({-\frac{\|\mathbf{x}-\mu_j\|_2^2}{2*\sigma_j^2}}\right)$$

You get a scalar as a result. The scalar result depends on the distance of the point $\mathbf{x}$ from the centre $\mu_j$ given by $\|\mathbf{x}-\mu_j\|$ and the scalar $\sigma_j$.

I've seen some implementations that try such values as .1, .5, 2.5 for this parameter. How are these values computed?

This of course is one of the interesting and difficult aspects of using Gaussian radial basis functions. if you search the web you will find many suggestions as to how these parameters are determined. I will outline in very simple terms one possibility based on clustering. You can find this and several other suggestions online.

Start by clustering your 10000 samples (you could first use PCA to reduce the dimensions followed by k-Means clustering). You can let $m$ be the number of clusters you find (typically employing cross validation to determine the best $m$). Now, create a radial basis function $g_j$ for each cluster. For each radial basis function let $\mu_j$ be the center (e.g. mean, centroid, etc) of the cluster. Let $\sigma_j$ reflect the width of the cluster (eg radius...) Now go ahead and perform your regression (this simple description is just an overview- it needs lots of work at each step!)

*Of course, the bell curve is defined from -$\infty$ to $\infty$ so will have a value everywhere on the line. However, the values far from the centre are negligible