Solved – Comparison of 3 different data sets

datasetnormalizationrms

I have two time varying datasets that are on different scales (one typically of the order of magnitude of 1e-3 and the other of order of magnitude 100). I've been working on a process that generates a 3rd dataset. The goal is that the "shape" of this 3rd dataset should resemble that of the first as much as possible. Unfortunately again this is on a different scale, usually within an order of magnitude of the second. However, it can also vary quite significantly in range depending on the parameter values that I use. As I have ~50 sources of datasets and several parameters that I can vary, it's not feasible to do visual comparisons on them all, and it's rather subjective. I'm not trying to do data fitting like linear regression. For my tests I have access to examples of the first dataset, but in the real world I won't have these. So I'm trying to make "blind" predictions based on various parameters and the 2nd dataset.

I've seen that when comparing two datasets on different scales then generally you can either calculate the RMS difference and then either divide by the reference mean or divide by the reference range.

I tried dividing each dataset by it's mean, and then calculating the RMS differences data1/data2 and data1/derived. This seems to give figures that are comparable. However I now have doubts that this is fundamentally flawed. When I divide by the mean, the standard deviation is also divided by the mean. Consequently if my process increases the mean, it will decrease the standard deviation, without necessarily improving the match of the shapes.

I've tried google, and also things like standardisation.

Does anyone have any suggestions how I can fairly compare the 3 datasets?

Best Answer

I'm not sure if my example is exactly relevant but I had a similar requirement to check if two waveforms are similar. A useful technique was the diffcumspec or the difference between cumulative spectrums.

Standardize two sets of data, calculate the cumulative distribution and then calculate the difference between these two sets to arrive at a measure of how different they are. Here's a reference link