Solved – Comparing two datasets

correlationdataset

I'm not very familiar in statistics so please bear with me. I have two datasets which consist of four attributes:

order, original, prediction, absolute_difference(original-prediction)
  1. order is just numbers from 1 to n
  2. original is real measured value
  3. prediction is predicted value
  4. absolute_difference is absolute difference between original and prediction.

If the prediction is perfect then original and prediction should match (and absolute difference should be 0). Those data represent amount of electrical power consumed by a corporation's office building as a function of time. I want be able to distinguish which dataset contains better prediction and somehow quantify this value. For such tasks correlation should be fine (in this case correlation between original and prediction). But I've found that correlation is not good when your data follows something else but linear, and this is my case: there are many peaks, repeating cycles, random events at particular days, etc. Here are multiple methods for comparing datasets. But I'm not very familiar with them. My intuitive approach was: calculate mean from absolute differences and use this value as discriminator between datasets. The lower the mean is, the better the prediction is. Then I've realized that there is also standard deviation which can be calculated from those absolute differences of original and predicted values. Next step would be to pick up all values which have absolute differences of original and predicted value:

  1. above mean + standard deviation
  2. bellow mean – standard deviation

This should give me an overview of how many values fall in this interval and how many doesn't. The dataset where more values fall in the interval is dataset with better prediction.
Does this make sense?

PS:
Please consider also following two cases: is the method for evaluating best prediction different or it is the same for both:

  1. original is the same for both datasets but prediction might vary
  2. original (and logically prediction) vary between datasets

PPS:
Quote from this site says:

The standard deviation is used in conjunction with the mean to
summarise continuous data, not categorical data. In addition, the
standard deviation, like the mean, is normally only appropriate when
the continuous data is not significantly skewed or has outliers.

I'm not sure if I understand correctly, but categorical data is something else than discrete data but discrete data are not continuous I suppose. So I'm not sure if I can use standard deviation for my purposes?

PPPS:
There are also metrics such as string distance or Mahalanobis distance which also compare the similarity between sets but I guess is not what I want. This leads me to assumption that particular method is good for particular dataset. If yes is there some cheat sheet or rule of thumb or something which will tell me appropriate method for particular dataset?

Best Answer

As far as I understood, you need to compare which prediction method performed better?

If that's correct than I think you need to do a paired T-test between the absolute_difference column for both the methods. It might be better to do it on just the difference column.

T-test are pretty standard and you can find about them with a simple google search.

In case of two different input datasets, the comparison is not valid as you're not evaluating on common grounds.

Related Question