Solved – How to statistically compare machine learning “regression” models

hypothesis testingmachine learningmodel comparisonstatistical significance

Let say that I want to compare the performance of XGBoost vs NN, or NN vs NN, or even the same NN at different epochs for a regression task.

All algorithms are trained and evaluated on the exact same dataset.

My thought is to compare the distribution of the residuals i.e.: set up a hypothesis test such tha $\mu_{xgb} > \mu_{nn}$, or do a t-test,…

Here is an example I was working on …

enter image description here

enter image description here

As you may see, both models are similar, both are non-normally distributed, but NN's have a larger variance. I did not know how to compare, so I selected a paired Wilcoxon Signed-rank test since it does not assume a normal distribution. As expected, p-value was really low that the median of XGBoost is less than the median of NN.

I have no idea if this is kosher – but I could not find anything online.

Also, I was very surprised of how biased both models are in regions with the most frequent data. In terms of linear regression models – both of them would be considered terrible models. I would think QQ-plots would be a better measure than i.e.: feature importance in the case of XGBoost, if we assume

$$y = f(x, w) + \epsilon$$

where $x$ is the input and $w$ are weights in both models.

Best Answer

Because my last answer was downvoted, I'm going to provide a full example.

You don't want to compare the residuals, you want to compare losses. Let's say that your regression looks like this

enter image description here

Let's compare two models on RMSE: a linear model and a generalized additive model. Clearly, the linear model will have larger loss because it is has high bias low variance. Let's take a look at the histogram of loss values.

enter image description here

We have lots of data, so we can use the central limit theorem to help us make inference. When we have "enough" data, the sampling distribution for the mean is normal with expectation equal to the population mean and standard deviation $\sigma/\sqrt{n}$.

So all we have to do is perform a t test on the loss values (and not the residuals) and that will allow us to determine which model has smaller expected loss.

Using the data I generated

>> t.test(loss2, loss1)

    Welch Two Sample t-test

data:  loss2 and loss1
t = -7.8795, df = 1955, p-value = 5.408e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2717431 -0.1634306
sample estimates:
mean of x mean of y 
0.3761796 0.5937665 

The mean loss of the gam model is 0.37 while the mean loss of the linear model is 0.6. The t test tells us that that if the sampling distributions of the mean did have the same expectation (that is, if the losses for the models were the same) then the difference in means would be incredibly unlikely to observe by chance alone. Thus, we reject the null.

A paired method might help, but usually we have so much data that the loss in power is really not a problem.

Does that clarify things?

Related Question