When you did K-means, presumably you treated the *attributes* at each pixel as a $5$-tuple of real values and you clustered them based on Euclidean distance in $\mathbb{R}^5$. **To achieve ***spatial contiguity* in the clustering, include spatial coordinates among the attributes. If you include (say) the two Cartesian map coordinates, you will effectively be doing the K-means clustering in $\mathbb{R}^7 \approx \mathbb{R}^5 \times \mathbb{R}^2$. I have written this as a Cartesian product to emphasize that there is a tuning parameter available to you: the relative sizes of the last two (spatial) attributes compared to the first five attributes. By rescaling the spatial attributes you can vary the amount by which they influence the clustering. With only a little influence, the result will be noisy; with a lot of influence, you will be performing purely spatial clustering. Experiment to identify an optimal value.

(This is an approach I have used many times. It is not always as successful as I would like, but it works sufficiently often to be worth looking at in any case.)

This will be edited several times, so be patient while in construction.

**Globally applicable background:**

We live in a causal universe. That is the fundamental premise of physics. If you look at the noise engineering for things like gravitational wave detectors they can tell when trains miles away pass by due to vibrations in the ground - and they have to account for that in their filtering.

There are no perfect models, only some that are more useful than others. An infinitely perfect representation would have to account for the whole universe - it would comprise a duplicate of the universe. In it, there would be no noise.

Because we have imperfect information, we make a model that is "good enough" and put all other factors into bins that we call "noise". Noise, as its own entity, does not exist. It is a consequence of the weakness of our modeling.

When you say that "the last principal components accumulate noise" what are you really saying? *For easy of nomenclature lets assume a spatial distribution of data only (and not time).* You are say that the "noise" (whatever it is) lives in the components of smallest geometric variation aka highest wave-number. This is a generalization, and while it can be useful it is hardly an axiom.

The generalization exists when effort to characterize is directly proportional to scale. When some interior scale is harder to characterize, this generalization breaks down.

**About SNR:**

Scholarpedia defines (SNR) as the ratio of the power of the signal over the power of the noise, but then goes on to say that for images the Peak signal to noise ratio is used where the numerator is the square of the max signal and the denominator is the variance of the noise:

$ PSNR = \frac {P_{peak}^{2}}{\sigma_{noise}}$

Now when I think about signal vs. noise in images, I think of signal as an entire image composed of signal, and noise as derived from the an entire picture of "non-image". If you are looking in a photograph for a truck, then the truck is image, and everything but the truck is non-image. Now your brain knows "truck" from non-truck, welcome to having an amazing computer between your ears, but most silicon based computers don't know truck vs. non-truck. They can't determine important vs. non-important. That means that either you have an entire image of non-truck, or you parse the image up by regions to compute the PSNR.

Personally I like to work on minimizing error PSNR values.

**About this particular SNR**

So you have a 5-dimensional space and you have determined the hyper-ellipsoid of your data and then are taking the largest semi-major axis as a presumed coordinate system and setting all the others to zero and reconstructing your data. At that point you don't really have information about what is signal and what is noise.

If I, personally, were working on this I would iterate through using first the first component, then next through the first two components, and so on through all of the principal components doing the following:

- (use singular value decomposition to get my eigenvalues and eigenvector matrices)
- set all eigenvalues above my current component count to zero. i.e. on the loop with the first principal component I would set the 2nd through end eigenvalues to zero.
- construct my approximation data using the modified eigenvalues $ {y}' = U {\Sigma}' V^{T}$
- compute the error image $ err = {y}'-y$
- compute the measure of badness as $ C(n) = E(err)^2 - V(err)$. Where E takes the mean error, V computes the variance of the error, and n is the number of components used.

At the end of iteration I would graph C vs. n and find an appropriate value using "skree plot" logic. I find semi-log plots in the y-axis to be useful.

I have personally dealt with higher dimensional data whose plot "has a flat spot" in the middle. This means that at the ends the decrease in error with additional components is high, but in the middle it is very low. When I examine useful projections of the reconstructed ($ {y}'$) data I can tell that at the beginning of the "plateau" the noise is starting to be included in the models. The steep decrease in error after the plateau means that the noise is well enough represented that increasing the number of principal components at that point is as informative for the model as the first few components were.

**Some actual work**

Here is where I went to get multispectral images: (link).

**Interesting next steps**

Interesting next steps include:

- Test compressed sensing aka using radical subsampling of both the y and $ {y}'$ to determine how few pixels sampled still inform the dimension reduction. If you can randomly sample 20 pixels instead of thousands or millions, then you can make the dimension reduction very fast. The threshold where the sub-sampling becomes useless is important in communicating something about the nature of the geographic scale of the information.
- Explore the results of the above on the same image, where the regions for sampling are pre-segregated. Perhaps a compressed sensing segregation operating on a reduced dimension of the information can be useful.

(temporary break, to complete when I get more time)

## Best Answer

You want to measure the signal to noise ratio on each image. This is akin to asking what the error is of a single number: you don't know. What's the error of five? That doesn't make any sense.

In this case it might be more interesting to find out the the SNR of the process. Here's how you might go about that:

Start with a "perfect" image. That is, an image who's noise is so low that it could be considered negligible. Use some very high quality, standard or constructed image for this purpose, like lenna. That's your signal.

Add some noise. Usually we use gaussian white noise for this purpose. That's your noise.

Now, the combined image (your "noisy image") has a signal to noise ratio with some meaning because you can compare it to the perfect image, eg pixel by pixel.

After processing it with your adaptive median filter, your final image (your "processed image") also has a signal to noise ratio because, again, you can compare it to your perfect image in the same way. It is now meaningful to ask if the SNR has gone up or down and by how much.