Solved – How to calculate median absolute deviation in a cross validation scenario

cross-validationmedian

In a cross validation scenario "leave one out" I want to measure how the estimated continuous variable fits with the observed variable. I learned from Wikipedia that the Median absolute deviation (MAD) could by used.

My question: How is the MAD to be calculated in this scenario? I have two ideas. The first is inspired from the definition of MAD where the center is the median of all deviations:

  1. Set the deviation $D_i = E_i – O_i$ for each corresponding estimated and observed outcome
  2. Calculate the median $M = \underset{i}{arg\;median}({D_i})$
  3. Set $MAD = \underset{i}{arg\;median}({|M-D_i|})$

The second one seems more appropriated to me in the context of cross validation:

  1. Set the deviation $D_i = E_i – O_i$ for each corresponding estimated and observed outcome
  2. Set $MAD = \underset{i}{arg\;median}{(|D_i|})$.

The last is the median of the absolute deviation between estimated and observed value, literally. However, it seems not to confirm to the definition given by Wikipedia.

What is the best solution in my scenario?

BTW: In the wikipedia article about the Mean absolute error I found an interesting comment about this issue:

The mean absolute error is a common measure of forecast error in time series analysis, where the terms "mean absolute deviation" is sometimes used in confusion with the more standard definition of mean absolute deviation. The same confusion exists more generally.

Best Answer

Wikipedia (emphasis mine):

For a univariate data set X1, X2, ..., Xn, the MAD is defined as the median of the absolute deviations from the data's median

Since you have observed values in your scenario, you are likely looking for something along the lines of Mean Absolute Error or Root Mean Square Error.

EDIT: (additional clarification from comments below) For a univariate data set, MAD is a measure of the variability (i.e. what is the "average" deviation from the "average" value, using the median function to determine averages).

So as you compute it the first way, it tells you the variability among your errors, which isn't necessarily meaningful as a measure of how well your model fits.

In the latter way, you are computing something very close to MAE, but using median instead of the mean. This is a valid way to determine your goodness of fit, but I don't know why you wouldn't use the more common measures of MAE or RMSE.