R Metric – How to Compare Models Using Metrics?

mapemetricr

Using R, I have developed three models:

linear regression using lm();
decision tree using rpart();
k-nearest neighbor using kknn().

I would like to conduct leave-one-out cross-validation tests and compare these models. However, which error metric should I use for better representation? Does mean absolute percentage error (MAPE) or sMAPE (symmetric MAPE) look fine? Please suggest me a metric.

For example, when I conducted leave-one-out CV tests on linear regression (LR) and decision tree (DT) models, the sMAPE error values are 0.16 and 0.20. However, the R-squared values of LR and DT are 0.85 and 0.92 respectively. Where sMAPE computed as [sum (abs(predicted - actual)/((predicted + actual)/2))] / (number of data points). Here DT is pruned regression tree. These R^2 values are computed on full data set. There are a total of 60 data points in the set.

Model  R^2   sMAPE
 LR    0.85   0.16
 DT    0.92   0.20

Best Answer

Lots of metric exist and no one is generally the best to use, it depends of your problem, of your data. Often, many metric can be used. I find usefull, to compute both hypothesis test and different metric (RMSE, MAPE ...), and see if they provide similar result. So your conclusions won't be based only on one metric.

Related Solutions

Solved – Compare Models. LOCCV implementation in R

Why re-invent the wheel? R already has many libraries that implement cross-validation and calculate RMSE, MSE, MAPE, etc:

library(caret)
library(forecast)
library(rpart)
library(mda)

#Load Sample-Data
data(trees)

#Custom Summary Function for Cross-Validation
customSummary <- function (data, lev = NULL, model = NULL) {
    stats1 <- postResample(data[, "pred"], data[, "obs"])
    stats2 <- accuracy(data[, "pred"], data[, "obs"])
    c(stats1, stats2)
}

#Choose sampling method and sumamry function
myControl <- trainControl(method='LOOCV', summary=customSummary)

#Run Models
model_LM <- train(Volume~Girth+Height, data=trees, method='lm', 
                  trControl=myControl)
model_CART <- train(Volume~Girth+Height, data=trees, method='rpart', 
                  trControl=myControl)
model_MARS <- train(Volume~Girth+Height, data=trees, method='earth', 
                  trControl=myControl)

#Compare models
model_LM
model_CART
model_MARS

Choosing which statistics to use is up to you, and depends on the particulars of your problem. I usually use MAE as a default, but it pays to think about what the actual error cost in your problem.

Solved – Using model with heteroskedasticity for predictions

The prediction will not be altered in any way by using het-robust standard errors. It remains the same and is still valid.

It's the interval around that prediction (and any hypothesis tests about coefficients or the predictions) that will changed by the choice of of whether to use the het-robust errors or not. In general, if you have heteroskedasticity and use the non-het-robust errors, your intervals will be too small.

Best Answer

Related Solutions

Solved – Compare Models. LOCCV implementation in R

Solved – Using model with heteroskedasticity for predictions

Related Question