Solved – Does normalisation affect the values of Mean Squared Error, Mean Absolute Percentage Error etc.

correlationmachine learningmaemaperegression

I have a large amount of data, divided into folders and files. Each file has 56 features/columns and around 10,000 rows. The data is normalised (values between -1 and 1) and some of the features have binary values (0,1).

My task is to predict the values for a specific column/label against the rest of the columns/features. I have been given no prior information or domain knowledge what-so-ever regarding the features. But, I have been told that, somehow, there is some relation between the label and the features. So I can't properly identify the most important and least important features just by looking at their names.

I have gone through the following steps. I was hoping someone could answer my questions regarding the steps I have taken. Plus, maybe someone can answer a few questions I have regarding each step.

STEP 1: I calculated the correlations for each file. So far so good. This step was just so that I could gain some knowledge about how much each features is related to the other etc.

STEP 2: I applied some statistical techniques and found out the mean, mode, variance, skewness, kurtosis etc. of the features. Then again, so far so good.

STEP 3: I used a simple linear regression algorithm and tried to predict the values for "y" (the label). I calculated the score, mean absolute percentage errors, mean absolute errors etc. Here is where most of the problems (I think) arise.

My first question is: The calculated "mean squared errors" for my regression technique are somewhere around (0.00001 to 0.00005). Why are they so small? Does the normalisation affect the value to be so small since the square of a number greater than 1 is greater than that number but its the opposite for numbers between 0 and 1? If the values were de-normalised to their original values, wouldn't the MSE be greater than the values that I am getting?

My second question is: The same goes for Mean Absolute Percentage Error that I calculate for the predicted values. They are around 0.5 (even after multiplying by 100). Why am I getting such "good" results when in reality my figure looks like this:

From the figure, its evident that I wasn't able to fit the data properly (hence the score of 0.76??) but my MAE and MAPE show very small values. In my opinion, the points should be more scattered "along" the line in an elongated fashion, not in a bulky way as shown in this figure.

STEP 4: After applying linear regression on the label column against all the columns (except for the label column), I was unsatisfied with the results. So what I did was, based on the correlation I got earlier in Step 1, I removed the internally correlated features that had a correlation equal to or greater than .95. Still I did not get satisfactory results. Which led me to the next question that is;

My third question is: Why are my predicted values not lying close to the predicted line. What am I doing wrong? Am I (1) over-using the features, as in am I using more features than needed which makes my model to take extra unnecessary features into consideration? or (2) Are the available features not enough to define my label properly? Are there some missing features that I was not provided with and because of that, my model is missing some key information (just like not taking into consideration the crime rate of an area while predicting the house prices).

References:

Linear Regression in Python; Predict The Bay Area’s Home Prices

Importance of feature selection

Best Answer

I went and simulated some data that qualitatively looked more or less like your point cloud.

require(mvtnorm)
set.seed(1)
foo <- rmvnorm(1000,c(0.77,0.77),cbind(c(.001,.0009),c(.0009,.001)))
plot(foo,pch=19,cex=0.6,xlab="",ylab="")
mean((foo[,1]-foo[,2])^2)
# [1] 0.000215418
100*mean(abs(foo[,1]-foo[,2])/foo[,1])
# [1] 1.512891

Your MSEs seem to make sense. My simulation gets a somewhat larger one, but that may simply be because you may have more dots in the center of your cloud.

I can't really answer your question about normalization, because your target variable is not normalized in any meaningful sense. All the values are between 0.70 and 0.84. If this were normalized, then the range between -1 and 1 would be completely used. (And then MAPEs would not make sense.)
As above, I get a MAPE of about 1.5%, which is not far away from your 0.5%, and the difference may again be because you may have more points in your data cloud.

Why am I getting such "good" results when in reality my figure looks like this: ... In my opinion, the points should be more scattered "along" the line in an elongated fashion, not in a bulky way as shown in this figure.

The relationship between forecast error measures and scatterplots between forecasts and actuals is not straightforward. MSEs of course depend on scaling - multiply both actuals and forecasts by 10, and your cloud will look exactly the same, except for the axes, but the MSE will be 100 times as large. Add 10 to both forecasts and actuals, and the cloud will again look exactly the same, except for the axes, but this time, the MAPE will be smaller by a factor of about 10.

Don't try to relate error measures to scatterplots. It won't work.
We don't know why your forecasts are not better. (We don't even know whether your plot is for a holdout sample, or in-sample.) You may be overfitting, or not capturing enough information, or there may simply be residual noise that you cannot capture. If there is information in there you have not yet leveraged, then plotting residuals against each predictor may suggest possible remedies, like transformations of predictors. Otherwise, I'm afraid that as long as you can't investigate your data more deeply, there is little you can do. How to know that your machine learning problem is hopeless?

Related Solutions

Mean Absolute Error – MAE vs Root Mean Squared Error in Regression Analysis

This depends on your loss function. In many circumstances it makes sense to give more weight to points further away from the mean--that is, being off by 10 is more than twice as bad as being off by 5. In such cases RMSE is a more appropriate measure of error.

If being off by ten is just twice as bad as being off by 5, then MAE is more appropriate.

In any case, it doesn't make sense to compare RMSE and MAE to each other as you do in your second-to-last sentence ("MAE gives a lower error than RMSE"). MAE will never be higher than RMSE because of the way they are calculated. They only make sense in comparison to the same measure of error: you can compare RMSE for Method 1 to RMSE for Method 2, or MAE for Method 1 to MAE for Method 2, but you can't say MAE is better than RMSE for Method 1 because it's smaller.

Symmetric Mean Absolute Percentage Error – Minimization Techniques

I don't think there is a closed-form solution to this question. (I'd be interested in being proven wrong.) I'd assume you will need to simulate. And hope that your predictive posterior is not misspecified too badly.

In case it is interesting, we wrote a little paper (see also this presentation) once that explained how minimizing percentage errors can lead to forecasting bias, by rolling standard six-sided dice. We also looked at various flavors of MAPE and wMAPE, but let's concentrate on the sMAPE here.

Here is a plot where we simulate "sales" by rolling $n=8$ six-sided dice $N=1,000$ times and plot the average sMAPE, together with pointwise quantiles:

fcst <- seq(1,6,by=.01)
n.sims <- 1000
n.sales <- 10
confidence <- .8
result.smape <- matrix(nrow=n.sims,ncol=length(fcst))
set.seed(2011)
for ( jj in 1:n.sims ) {
  sales <- sample(seq(1,6),size=n.sales,replace=TRUE)
  for ( ii in 1:length(fcst) ) {
    result.smape[jj,ii] <-
      2*mean(abs(sales-rep(fcst[ii],n.sales))/(sales+rep(fcst[ii],n.sales)))
  }
}

(Note that I'm using the alternative sMAPE formula which divides the denominator by 2.)

plot(sales,type="o",ylab="",xlab="",pch=21,bg="black",ylim=c(1,6),
  main=paste("Sales:",n.sales,"throws of a six-sided die"))
plot(fcst,fcst,type="n",ylab="sMAPE",xlab="Forecast",ylim=c(0.3,1.1))
polygon(c(fcst,rev(fcst)),c(
    apply(result.smape,2,quantile,probs=(1-confidence)/2),
    rev(apply(result.smape,2,quantile,probs=1-(1-confidence)/2))),
  density=10,angle=45)
lines(fcst,apply(result.smape,2,mean))
legend(x="topright",inset=.02,col="black",lwd=1,legend="sMAPE")

simulated sMAPEs

Something along these lines may help in your case. (Again, you will need to assume that your posterior predictive distribution is "correct enough" to do this kind of simulation - but you would need to assume that for any other approach, too, so this just adds a general caveat, not a specific issue.)

In this simple example of rolling standard six-sided dice, we can actually calculate and plot the expected s(M)APE as a function of the forecast:

expected.sape <- function ( fcst ) sum(abs(fcst-seq(1,6))/(seq(1,6)+fcst))/3
plot(fcst,mapply(expected.sape,fcst),type="l",xlab="Forecast",ylab="Expected sAPE")

expected sAPE

This agrees rather well with the simulation averages above. And it shows nicely that the EsAPE-minimal forecast for rolling a standard six-sided die is a biased 4, instead of the unbiased expectation of 3.5.

Additional fun fact: if your predictive distribution is a Poisson with a predicted parameter $\hat{\lambda}<1$, then the forecast that minimizes the expected sAPE is $\hat{y}=1$ - independently of the specific value of $\hat{\lambda}$.

At least this is claimed in footnote 1 in Seaman & Bowman (in press, IJF, commentary on the M5 forecasting competiton) without a proof. It's quite easy to see that the EsAPE-minimal forecast satisfies $\hat{y}\geq 1$ (you just show that any alternative forecast $\hat{y}'<1$ will lead to a larger EsAPE). Showing that $\hat{y}'>1$ will lead to a larger EsAPE than $\hat{y}=1$ seems to be a little tedious. However, simulations look reassuring.

Best Answer

Related Solutions

Mean Absolute Error – MAE vs Root Mean Squared Error in Regression Analysis

Symmetric Mean Absolute Percentage Error – Minimization Techniques

Related Question