Solved – How to interpret the direction of the Harvey-Collier test and Rainbow test for linearity

diagnosticlinearlinearitynonlinearityr

I implemented both those tests with R, using the lmtest package. Both tests directionally say the same thing (I think) with a very similar p-value of very close to 0. But, are those tests saying that the underlying regression model's residuals are adequately linear. Or are they saying just the opposite. I know that the tests have slightly different nuances. The Harvey-Collier test indicates whether the residuals are linear. Meanwhile, the Rainbow test indicates whether the linear fit of the model is adequate even if some underlying relationships are not linear. Any insight, on the interpretation of those results is greatly appreciated.

I am posting the results of the tests below:

In R with lmtest package.

harvtest(Regression, order.by = NULL)

    Harvey-Collier test

data: Regression
HC = 4.3826, df = 119, p-value = 2.543e-05

raintest(Regression, fraction = 0.5, order.by = NULL, center = NULL)

    Rainbow test

data: Regression
Rain = 1.7475, df1 = 62, df2 = 58, p-value = 0.01664

Best Answer

OK, I can't find great references for the Harvey-Collier test; they appear to almost all be paywalled. However, the intuition behind the rainbow test is easy to describe.

Suppose you're trying to fit a linear model where it's inadequate. Let's use a very simple quadratic model for example: $X \sim N(1, 1); Y \sim X^2 + N(0, 0.2)$.

The idea of the rainbow test is that, when you "zoom in" on the curve $Y = X^2 + \epsilon$ (by looking only at the central data), the curve looks less curvy--more like a line--and so the model's fit improves. So for instance, if we make a linear model on a full dataset of 100 draws from the model above, here's what we get:

Full dataset regression

By contrast, if we restrict to points 1 SD away from the mean, here's what we get:

Regression on subset

As you can see, the fit improves noticeably, and the restriction also looks visibly more like a linear model.

On the other hand, if the true model were linear, we wouldn't expect the fit to get much better in this scenario. (It might get a little better, because we'd be fitting to fewer data points, but linear regression would converge to the same model on the restricted data as the full dataset, so in the limit you'd get the same model on both.)

The rainbow test basically quantifies how much better we'd expect the fit to get when we remove data, under the null hypothesis that the true model is linear. If the true model is not linear, then the improvement will be bigger than we expected.


As for your specific question about the direction of the tests, the documentation for harvtest states:

The Harvey-Collier test performs a t-test (with parameter degrees of freedom) on the recursive residuals. If the true relationship is not linear but convex or concave the mean of the recursive residuals should differ from 0 significantly.

(Emphasis added.) This means that a significant result means that you can reject the null hypothesis that the true model is linear.

Similarly, the documentation for raintest states:

The basic idea of the Rainbow test is that even if the true relationship is non-linear, a good linear fit can be achieved on a subsample in the "middle" of the data. The null hypothesis is rejected whenever the overall fit is significantly worse than the fit for the subsample.

This means that a significant result (rejecting the null) occurs when the fit is better with a range restriction (which is what happens if the model is nonlinear).

So both tests suggest that the true model is not linear.