[Math] Theory question: How to use Mean Absolute Error properly in a log scaled linear regression

mathematical modelingregressionstatisticstransformation

First of all, I had a look here and in a couple of other questions: I couldn't find what I am looking for.

So my question is purely theoretical (although I have an example by my hands).

Suppose I have some data $(x_i,y_i)$ for $i=1,..,n$.
Suppose I fit the following models with IID $\epsilon_i \sim N(0, \sigma^2)$ for $i=1,..,n$

  • $M_1: \log(y_i)= \beta_0+\beta_1x_i+\epsilon_i$
  • $M_2: \log(y_i)= \beta_0+\beta_1x_i+\beta_2x_i^2+\epsilon_i$
  • $M_3: \log(y_i)= \beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i$

Now I want to see which of these models is better, so I use the following (maybe weird, but stay with me) method, to evaluate their "predictive powers":

  1. Use $(x_i, \log(y_i))$ for $i=1,..,\frac{n}{2}$, to fit $M_1, M_2, M_3$ respectively.
  2. Now use the fitted model (so $M_1, M_2,M_3$ respectively), to predict $y_i$'s using the $x_i$'s from the remaining $\frac{n}{2}$ data , so from $i = \frac{n}{2}+1, .., n$ (careful, predict $y_i$ not $\log(y_i)$)
  3. Use MAE or Mean Absolute Error (here) $MAE = \frac{1}{\frac{n}{2}}\sum_{i=\frac{n}{2}+1}^{n}|y_i-\hat{y}_i|$, being careful that $\hat{y}_i$ is in the original scale of values!

So now my question:

If I do point $1.$ and I fit the three models (hence obtaining estimates for the parameters, their standard errors etc..) and then use these parameters (respectively of course!) to predict the responses of the other $x_i$'s:

  1. Will I be predicting $\log(y_i)$'s right? And this is true… Is it also true that in order to get $\hat{y}_i$'s , instead of $\widehat{\log{(y)}}_i$, I should just take the exponential of those terms? So in general, is it true $\hat{y}_i = e^{\widehat{\log{(y)}}_i}$?
  2. Once I find the three MAE's, how do I judge the models? Should I be looking for the one with smaller MAE?

EDIT

For example suppose I have $1000$ data points. I use the first $500$ to fit model $M_1$. Once I've fitted it, I can predict new values. Hence I predict the new responses of the other $500$ $x_i$'s left. of course, the prediction will be given in logarithmic scale. But I want to calculate MAE on the normal scale.

This is the context of my question, of course I would do this procedure for all the three models and compare the MAEs.

Best Answer

IMO which model is better will depend on many factors.

These include:

  1. Amount of data in each $M_k$
  2. Skewness / spread of the data for each $M_k$ - eg done via box plots.
  3. Plots of errors for each $M_k$ observed vs expected.

These should be done first in my opinion, since the results of these should be used for which seeing which assumptions can be used in each model.

Answering your questions:

Will I be predicting $log(y_i)$'s right?

Yes with what you have wrote.

Is it also true that in order to get $\hat{y_i}$'s , instead of $\widehat{log(y_i)}$, I should just take the exponential of those terms? So in general, is it true $\hat{y_i}=\widehat{e^{\log(y_i)}}$?

Not quite: for example in your first model $M_1$ you define as:

$$\log(y_i)=\beta_0+\beta_1x_i+\epsilon_i$$

Hence $\hat{y_i}=\widehat{e^{\beta_0+\beta_1x_i+\epsilon_i}}$

$=e^{\hat{\beta_0}}e^{\hat{\beta_1}x_i}e^{\hat{\epsilon_i}}$

Once I find the three MAE's, how do I judge the models? Should I be looking for the one with smaller MAE?

Taking the one with the smaller MAE would make sense, however I would take the value of highest $R^2$.

Most importantly to be able to use any of these models, they need to be significant. The way this is measured is typically via p-values. Depending on the hypothesis being tested, from a p-value that is less than eg $0.05$ it can be inferred it is significant.

A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

http://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/

Related Question