Solved – Is R-squared value appropriate for comparing models

elastic netmachine learningneural networksr-squaredrandom forest

I'm trying to identify the best model to predict the prices of automobiles, using the prices and features available on automobile classified advertisement sites.

For this I used a couple of models from the scikit-learn library and neural network models from pybrain and neurolab. The approach I used so far is to run a fixed amount of data through some models(machine learning algorithms) and compare there $R^2$ values which was calculated with the scikit-learn metrics module.

  1. Is $R^2$ a good method to compare the performance of different models?
  2. Although I got quite acceptable results for models such as Elastic net and Random forests I got very poor $R^2$ values for Neural network models,so is $R^2$ an appropriate method for evaluating neural networks (or non-linear methods)?

Best Answer

I think the crucial part to consider in answering your question is

I'm trying to identify the best model to predict the prices of automobiles

because this statement implies something about why you want to use the model. Model choice and evaluation should be based on what you want to achieve with your fitted values.

First, lets recap what $R^2$ does: It computes a scaled measure based on the quadratic loss function, which I am sure you are already aware of. To see this, define residual $e_i = y_i - \hat{y}_i$ for your i-th observation $y_i$ and the corresponding fitted value $\hat{y}_i$. Using the convenient notation $SSR := \sum_{i=1}^Ne_i^2$, $SST:=\sum_{i=1}^N(y_i - \bar{y})^2$, $R^2$ is simply defined as $R^2 = 1 - SSR/SST$.

Second, let us see what using $R^2$ for model choice/evaluation means. Suppose we choose from a set of predictions $\bar{Y}_M$ that were generated using a model $M:M \in \mathcal{M}$, where $\mathcal{M}$ is the collection of models under consideration (in your example, this collection would contain Neural networks, random forests, elastic nets, ...). Since $SST$ will remain constant amongst all the models, if minimizing $R^2$ you will choose exactly the model that minimizes $SSR$. In other words, you will choose $M \in \mathcal{M}$ that produces the minimal square error loss!

Third, let us consider why $R^2$ or equivalently, $SSR$ might be interesting for model choice. Traditionally, the square loss ($L^2$ norm) is used for three reasons: (1) It is easier computable than Least Absolute Deviations (LAD, the $L^1$ norm) because no absolute value appears in the computation, (2) it punishes fitted values that are far off from the actual value much more than LAD (in a squared rather than an absolute sense) and thereby makes sure we have less extreme outliers, (3) it is symmetric: Over- or underestimating the price of a car is considered to be equally bad.

Fourth (and last), let us see if this is what you need for your predictions. The point that might be of most interest here is (3) from the last paragraph. Suppose you want to take a neutral stance, and you are neither buyer nor seller of a car. Then, $R^2$ can make sense: You are impartial, and you wish to punish deviations to over- or underpricing exactly identically. The same applies if you just want to model the relation between the quantities without wishing to predict unobserved values. Now suppose you are working for a consumer/buyer on a tight budget: In this situation, you might want to punish overestimation of the price in a quadratic sense, but underestimation in an $L^p$ sense, where $1 \leqslant p <2$. For $p=1$, you would punish in an absolute deviation sense. This can be seen to reflect the goals and intentions of the buyer, and biasing the estimation downward might be of interest for him/her. Conversely, you could flip the thinking if you were to model the price predictions for the seller. Needless to say, any norm $L^p$ could be chosen to reflect the preferences of the modeller/the agent you model for. You can also punish outside of the $L^p$ norm entirely, and use constant, exponential, or log loss on one side and a different loss on the other.

In summary, model choice/evaluation cannot be considered independently of the model's aim.

Related Question