Splines free of Gibbs phenomenon for smoothing

smoothingsplines

I'm trying to do 3-dimensional smoothing of some observations we made. To simplify things, here's a visualisation of one slice of the data we have:

Note that the observations are gridded and contain a lot of zeros. When I run a cubic smoothing spline (I use the package csaps in python) we get something like the plot below.

(My apologies for the changing colour scales).

As can be seen on the second plot, the splines oscillate and become even negative, which in our application does not make sense and is an issue. I researched a bit and found that this phenomenon is often referred to as Gibbs phenomenon that can affect splines (for example in this or that application).

I have a vague understanding of the math behind it and am more interested in the practical application. Does anybody know of an implementation of splines that are free of the Gibbs phenomenon in Python (or R or something else really) that we could use for our application?

Some more background on the application:

We are estimating the probability that two particles are in the same volume. The observations are occupation rates per volume (in some arbitrary unit, such as hours/year). We would like to fit the smooth to the data in an analogous way as you can fit a function to a 1D histogram (in our case it’s just 3D). Since the observed data is noisy (and we don't have access to the underlying generating process), we would like to use the fitted function to draw or to estimate some probabilities instead of using the observations directly. The probabilities (or occupancy rate, which is somewhat similar in our case) are then used to estimate the probability that two particles are in the same volume (the other particle is generated by a different process which we know how to model).

Best Answer

Thanks to the hints in the comments, I managed to find a suitable solution to my problem. More precisely, I found two that overcome the issue with the splines. But neither of my solutions uses splines anymore.

I think it's worth highlighting that the suggestion of @Gavin is probably also possible, but I didn't try this since I'm quite happy with what I have so far.

Gaussian Filter

I applied a 3D gaussian filter to the data and the results look reasonable, as shown in the figure below.

From my point of view, the drawback of this method is that the choice of sigma (standard deviation of the Gaussian kernel) is somewhat arbitrary. Instead, I settled on the next method.

Kernel Density Estimation

Thanks to the hint from @Aksakal I looked into kernel density estimation (KDE) and it seems to me suitable for the problem at hand. The result is shown in the figure below.

What I like about this method is that the bandwidth of the kernel can be estimated from the data with cross validation (in my opinion this is an advantage over just simply going for the value that seems to be looking fine).

If anybody is interested in KDE in Python, an excellent comparison of the different implementations to be found here.

Related Solutions

Solved – Generalizing Add-one/Laplacian Smoothing

The "uniform" in uniform prior doesn't just mean that hits and misses are equally likely. It means that you assume that you have a probability measure on the rates $[0,1]$ and this measure is the uniform measure. For example, it means the chance the true rate is between $0.9$ and $1.0$ is $0.1$. There are other measures on $[0,1]$ which have mean value $1/2$.

For example, if you start with the uniform prior and then observe $1$ hit and $1$ miss, the updated distribution is more concentrated around a rate of $1/2$ than the uniform prior. Instead of a probability of $\Delta r $ for the rate to be between $r$ and $r+\Delta r$, you would estimate the probability to be about $6 r (1-r) \Delta r$. (The factor of $6$ makes the total measure $1$.) This new distribution would also predict that the probability of getting a hit next time is $1/2$. If you observe an additional $h$ hits and $m$ misses, the expected probability of a hit is

$$\frac{h+2}{h+m+4}.$$

This is closer to $1/2$ than $\frac{h+1}{h+m+2}$. This agrees with the fact that you started with a distribution which was more concentrated near $1/2$. Of course, if you include that first hit and miss then there have been $h+1$ hits and $m+1$ misses in total.

The distribution you get from the uniform prior by observing some number of hits and misses is called a beta distribution. The uniform distribution on $[0,1]$ is $\text{Beta}(1,1)$. Beta distributions have the nice property that if you update a beta distribution with an observation, the result is still in that family. From one observation, the $\text{Beta}(a,b)$ distribution is updated either to $\text{Beta}(a+1,b)$ or $\text{Beta}(a,b+1)$. The mean of a $\text{Beta}(a,b)$ distribution is $\frac{a}{a+b}$. It is perfectly reasonable to have a prior distribution which is not a beta distribution, but the answers might not come out as nicely.

If I'm very confident that the hit rates are 50/50, I could add 2 instead of 1 
to the hits and misses. Or if I'm less confident, I could add 1/2.

That might or might not agree with confidence. That is about whether you think it is likely that the rate is close to $1/2$. In some situations, you may believe the rate should be close to $0$ or close to $1$, and the first trial should be very informative. You might believe that half of your students know how to solve a type of problem, and that if you give a random student $5$ of these problems, the student is very likely to solve all $5$ or fail to solve any. This might be approximated by a $\text{Beta}(\epsilon,\epsilon)$ distribution so that after $h$ correct and $m$ incorrect, the average value is

$$\frac{h+\epsilon}{h+m+2\epsilon},$$

which is close to $0$ or $1$ after one observation.

Solved – Are splines overfitting the data

Overfitting comes from allowing too large a class of models. This gets a bit tricky with models with continuous parameters (like splines and polynomials), but if you discretize the parameters into some number of distinct values, you'll see that increasing the number of knots/coefficients will increase the number of available models exponentially. For every dataset there is a spline and a polynomial that fits precisely, so long as you allow enough coefficients/knots. It may be that a spline with three knots overfits more than a polynomial with three coefficients, but that's hardly a fair comparison.

If you have a low number of parameters, and a large dataset, you can be reasonably sure you're not overfitting. If you want to try higher numbers of parameters you can try cross validating within your test set to find the best number, or you can use a criterion like Minimum Description Length.

EDIT: As requested in the comments, an example of how one would apply MDL. First you have to deal with the fact that your data is continuous, so it can't be represented in a finite code. For the sake of simplicity we'll segment the data space into boxes of side $\epsilon$ and instead of describing the data points, we'll describe the boxes that the data falls into. This means we lose some accuracy, but we can make $\epsilon$ arbitrarily small, so it doesn't matter much.

Now, the task is to describe the dataset as sucinctly as possible with the help of some polynomial. First we describe the polynomial. If it's an n-th order polynomial, we just need to store (n+1) coefficients. Again, we need to discretize these values. After that we need to store first the value $n$ in prefix-free coding (so we know when to stop reading) and then the $n+1$ parameter values. With this information a receiver of our code could restore the polynomial. Then we add the rest of the information required to store the dataset. For each datapoint we give the x-value, and then how many boxes up or down the data point lies off the polynomial. Both values we store in prefix-free coding so that short values require few bits, and we won't need delimiters between points. (You can shorten the code for the x-values by only storing the increments between values)

The fundamental point here is the tradeoff. If I choose a 0-order polynomial (like f(x) = 3.4), then the model is very simple to store, but for the y-values, I'm essentially storing the distance to the mean. More coefficients give me a better fitting polynomial (and thus shorter codes for the y values), but I have to spend more bits describing the model. The model that gives you the shortest code for your data is the best fit by the MDL criterion.

(Note that this is known as 'crude MDL', and there are some refinements you can make to solve various technical issues).

Best Answer

Related Solutions

Solved – Generalizing Add-one/Laplacian Smoothing

Solved – Are splines overfitting the data

Related Question