Goodness of Fit – Assessing Model Quality Using Diagnostic Metrics (R^2, AUC, Accuracy, RMSE)

accuracyaucdiagnosticgoodness of fitr-squared

I've fitted my model and am trying to understand whether it's any good. I've calculated the recommended metrics to assess it ($R^2$/ AUC / accuracy / prediction error / etc) but do not know how to interpret them. In short, how do I tell if my model is any good based on the metric? Is an $R^2$ of 0.6 (for example) sufficient to let me proceed to draw inferences or base scientific/business decisions?

This question is intentionally broad, to cover a wide variety of situations that members frequently encounter; such questions could be closed as duplicates of this one. Edits to broaden the scope beyond the metrics mentioned here are welcome, as are additional answers – particularly those that offer insight about other classes of metrics.

Best Answer

This answer will mostly focus on $R^2$, but most of this logic extends to other metrics such as AUC and so on.

This question can almost certainly not be answered well for you by readers at CrossValidated. There is no context-free way to decide whether model metrics such as $R^2$ are good or not. At the extremes, it is usually possible to get a consensus from a wide variety of experts: an $R^2$ of almost 1 generally indicates a good model, and of close to 0 indicates a terrible one. In between lies a range where assessments are inherently subjective. In this range, it takes more than just statistical expertise to answer whether your model metric is any good. It takes additional expertise in your area, which CrossValidated readers probably do not have.

Why is this? Let me illustrate with an example from my own experience (minor details changed).

I used to do microbiology lab experiments. I would set up flasks of cells at different levels of nutrient concentration, and measure the growth in cell density (i.e. slope of cell density against time, though this detail is not important). When I then modelled this growth/nutrient relationship, it was common to achieve $R^2$ values of >0.90.

I am now an environmental scientist. I work with datasets containing measurements from nature. If I try to fit the exact same model described above to these ‘field’ datasets, I’d be surprised if I the $R^2$ was as high as 0.4.

These two cases involve exactly the same parameters, with very similar measurement methods, models written and fitted using the same procedures - and even the same person doing the fitting! But in one case, an $R^2$ of 0.7 would be worryingly low, and in the other it would be suspiciously high.

Furthermore, we would take some chemistry measurements alongside the biological measurements. Models for the chemistry standard curves would have $R^2$ around 0.99, and a value of 0.90 would be worryingly low.

What leads to these big differences in expectations? Context. That vague term covers a vast area, so let me try to separate it into some more specific factors (this is likely incomplete):

1. What is the payoff / consequence / application?

This is where the nature of your field are likely to be most important. However valuable I think my work is, bumping up my model $R^2$s by 0.1 or 0.2 is not going to revolutionize the world. But there are applications where that magnitude of change would be a huge deal! A much smaller improvement in a stock forecast model could mean tens of millions of dollars to the firm that develops it.

This is even easier to illustrate for classifiers, so I’m going to switch my discussion of metrics from $R^2$ to accuracy for the following example (ignoring the weakness of the accuracy metric for the moment). Consider the strange and lucrative world of chicken sexing. After years of training, a human can rapidly tell the difference between a male and female chick when they are just 1 day old. Males and females are fed differently to optimize meat & egg production, so high accuracy saves huge amounts in misallocated investment in billions of birds. Till a few decades ago, accuracies of about 85% were considered high in the US. Nowadays, the value of achieving the very highest accuracy, of around 99%? A salary that can apparently range as high as 60,000 to possibly 180,000 dollars per year (based on some quick googling). Since humans are still limited in the speed at which they work, machine learning algorithms that can achieve similar accuracy but allow sorting to take place faster could be worth millions.

(I hope you enjoyed the example – the alternative was a depressing one about very questionable algorithmic identification of terrorists).

2. How strong is the influence of unmodelled factors in your system?

In many experiments, you have the luxury of isolating the system from all other factors that may influence it (that’s partly the goal of experimentation, after all). Nature is messier. To continue with the earlier microbiology example: cells grow when nutrients are available but other things affect them too – how hot it is, how many predators there are to eat them, whether there are toxins in the water. All of those covary with nutrients and with each other in complex ways. Each of those other factors drives variation in the data that is not being captured by your model. Nutrients may be unimportant in driving variation relative to the other factors, and so if I exclude those other factors, my model of my field data will necessarily have a lower $R^2$.

3. How precise and accurate are your measurements?

Measuring the concentration of cells and chemicals can be extremely precise and accurate. Measuring (for example) the emotional state of a community based on trending twitter hashtags is likely to be…less so. If you cannot be precise in your measurements, it is unlikely that your model can ever achieve a high $R^2$. How precise are measurements in your field? We probably do not know.

4. Model complexity and generalizability

If you add more factors to your model, even random ones, you will on average increase the model $R^2$ (adjusted $R^2$ partly addresses this). This is overfitting. An overfit model will not generalize well to new data i.e. will have higher prediction error than expected based on the fit to the original (training) dataset. This is because it has fit the noise in the original dataset. This is partly why models are penalized for complexity in model selection procedures, or subjected to regularization.

If overfitting is ignored or not successfully prevented, the estimated $R^2$ will be biased upward i.e. higher than it ought to be. In other words, your $R^2$ value can give you a misleading impression of your model’s performance if it is overfit.

IMO, overfitting is surprisingly common in many fields. How best to avoid this is a complex topic, and I recommend reading about regularization procedures and model selection on this site if you are interested in this.

5. Data range and extrapolation

Does your dataset extend across a substantial portion of the range of X values you are interested in? Adding new data points outside the existing data range can have a large effect on estimated $R^2$, since it is a metric based on the variance in X and Y.

Aside from this, if you fit a model to a dataset and need to predict a value outside the X range of that dataset (i.e. extrapolate), you might find that its performance is lower than you expect. This is because the relationship you have estimated might well change outside the data range you fitted. In the figure below, if you took measurements only in the range indicated by the green box, you might imagine that a straight line (in red) described the data well. But if you attempted to predict a value outside that range with that red line, you would be quite incorrect.

[The figure is an edited version of this one, found via a quick google search for 'Monod curve'.]

6. Metrics only give you a piece of the picture

This is not really a criticism of the metrics – they are summaries, which means that they also throw away information by design. But it does mean that any single metric leaves out information that can be crucial to its interpretation. A good analysis takes into consideration more than a single metric.

Suggestions, corrections and other feedback welcome. And other answers too, of course.

Best Answer

This answer will mostly focus on $R^2$, but most of this logic extends to other metrics such as AUC and so on.

Related Solutions

Best Meta-Analysis Methods for Diagnostic Test Accuracy Studies – How to Choose

Solved – Good accuracy despite high loss value

Related Question