It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)
Here I made up a very simple example myself, in Octave, with a unidimensional signal (y)
of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3).
I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.
N=200;
b=3;
delay=130;
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50); % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)]; % octave does not have corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z); % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1)); % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay
Given a model
$$
Y = f(X) + \varepsilon
$$
The signal to noise ratio can be defined as (ref. ESL10) :
$$
\frac{Var(f(X))}{Var(\varepsilon)}
$$
To generate data with a specific signal to noise ratio:
signal_to_noise_ratio = 4
data = c(0.47, 0.45, 0.30, 1.15, 0.82, 0.38, 0.51, 1.36, 1.72, 0.36)
noise = rnorm(data) # generate standard normal errors
k <- sqrt(var(data)/(signal_to_noise_ratio*var(noise)))
data_wNoise = data + k*noise
Best Answer
This will be edited several times, so be patient while in construction.
Globally applicable background:
We live in a causal universe. That is the fundamental premise of physics. If you look at the noise engineering for things like gravitational wave detectors they can tell when trains miles away pass by due to vibrations in the ground - and they have to account for that in their filtering.
There are no perfect models, only some that are more useful than others. An infinitely perfect representation would have to account for the whole universe - it would comprise a duplicate of the universe. In it, there would be no noise.
Because we have imperfect information, we make a model that is "good enough" and put all other factors into bins that we call "noise". Noise, as its own entity, does not exist. It is a consequence of the weakness of our modeling.
When you say that "the last principal components accumulate noise" what are you really saying? For easy of nomenclature lets assume a spatial distribution of data only (and not time). You are say that the "noise" (whatever it is) lives in the components of smallest geometric variation aka highest wave-number. This is a generalization, and while it can be useful it is hardly an axiom.
The generalization exists when effort to characterize is directly proportional to scale. When some interior scale is harder to characterize, this generalization breaks down.
About SNR:
Scholarpedia defines (SNR) as the ratio of the power of the signal over the power of the noise, but then goes on to say that for images the Peak signal to noise ratio is used where the numerator is the square of the max signal and the denominator is the variance of the noise:
$ PSNR = \frac {P_{peak}^{2}}{\sigma_{noise}}$
Now when I think about signal vs. noise in images, I think of signal as an entire image composed of signal, and noise as derived from the an entire picture of "non-image". If you are looking in a photograph for a truck, then the truck is image, and everything but the truck is non-image. Now your brain knows "truck" from non-truck, welcome to having an amazing computer between your ears, but most silicon based computers don't know truck vs. non-truck. They can't determine important vs. non-important. That means that either you have an entire image of non-truck, or you parse the image up by regions to compute the PSNR.
Personally I like to work on minimizing error PSNR values.
About this particular SNR
So you have a 5-dimensional space and you have determined the hyper-ellipsoid of your data and then are taking the largest semi-major axis as a presumed coordinate system and setting all the others to zero and reconstructing your data. At that point you don't really have information about what is signal and what is noise.
If I, personally, were working on this I would iterate through using first the first component, then next through the first two components, and so on through all of the principal components doing the following:
At the end of iteration I would graph C vs. n and find an appropriate value using "skree plot" logic. I find semi-log plots in the y-axis to be useful.
I have personally dealt with higher dimensional data whose plot "has a flat spot" in the middle. This means that at the ends the decrease in error with additional components is high, but in the middle it is very low. When I examine useful projections of the reconstructed ($ {y}'$) data I can tell that at the beginning of the "plateau" the noise is starting to be included in the models. The steep decrease in error after the plateau means that the noise is well enough represented that increasing the number of principal components at that point is as informative for the model as the first few components were.
Some actual work
Here is where I went to get multispectral images: (link).
Interesting next steps
Interesting next steps include:
(temporary break, to complete when I get more time)