Solved – When we plot data and then use nonlinear transformations in a regression model are we data-snooping

biasmachine learningmodel selectionmodelingregression

I've been reading up on data snooping, and how it can mean the in-sample error does not provide a good approximation of the out-of-sample error.

Suppose we are given a data set $(x_1,y_1),(x_2,y_2),…,(x_n,y_n)$, which we plot, and observe what appears to be a quadratic relationship between the variables. So we make the assumption that
$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon,
$$

where $\varepsilon$ is a variable representing noise.

Isn't this data snooping? We have let the data affect our model. So what implications does this have for the coefficients $\beta_0,\beta_1,\beta_2$ that we find; can they be considered reliable for making future predictions with different input variables?

I ask because there are countless notes/articles/books/etc.. on regression where they recommend looking at the data and then choosing a model that looks like it will fit well with the data. For example, here the author has some data, tries a linear model, and upon finding it unsatisfactory, he moves to a quadratic model which better fits the data. Similarly, here, people are discussing log transformations and the original poster is given the following advice:

If there is no theory to guide you, graphical exploration of the
relationship between the variables, or looking at fitted vs observed
plots both ways will tell you which model is appropriate.

So when we base our model on an observation of the plotted data, is this data snooping or not? If it isn't, then could someone give an explanation why this isn't data snooping?

If it is data snooping, then:

  1. What are the consequences of this on the out-of-sample performance?
  2. What should we do to avoid/overcome the data snooping issue in a regression model so that we will have good out-of-sample performance?

Best Answer

There is a way to estimate the consequences for out-of-sample performance, provided that the decision-making process in the modeling can be adequately turned into an automated or semi-automated process. That's to repeat the entire modeling process on multiple bootstrap re-samples of the data set. That's about as close as you can get to estimating out-of-sample performance of the modeling process.

Recall the bootstrap principle.

The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modelled by resampling the sample data and performing inference about a sample from resampled data (resampled → sample). As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable.

Following that principle, if you repeat the full model building process on multiple bootstrap re-samples of the data, then test each resulting model's performance on the full data set, you have a reasonable estimate of generalizability in terms of how well your modeling process on the full data set might apply to the original population. So, in your example, if there were some quantitative criterion for deciding that quadratic rather than linear modeling of the predictor is to be preferred, then you use that criterion along with all other steps of the modeling on each re-sample.

It's obviously best to avoid such data snooping. There's no harm in looking at things like distributions of predictors or outcomes on their own. You can look at associations among predictors, with a view toward combining related predictors into single summary measures. You can use knowledge of the subject matter as a guide. For example, if your outcome is strictly positive and has a measurement error that is known to be proportional to the measured value, a log transform makes good sense on theoretical grounds. Those approaches can lead to data transformations that aren't contaminated by looking at predictor-outcome relationships.

Another useful approach is to start with a highly flexible model (provided the model isn't at risk of overfitting), and pulling back from that toward a more parsimonious model. For example, with a continuous predictor you could start with a spline fit having multiple knots, then do an analysis of variance of nested models having progressively fewer knots to determine how few knots (down to even a simple linear term) can provide statistically indistinguishable results.

Frank Harrell's course notes and book provide detailed guidance for ways to model reliably without data snooping. The above process for validating the modeling approach can also be valuable if you build a model without snooping.

Related Question