Solved – Contradictory results between Breusch-Pagan test and Goldfeld-Quandt test in Python

heteroscedasticityhypothesis testingmathematical-statisticsregressionvariance

I am reading Python regression diagnostic for statsmodel in Python.

Under the heteroskedasticity tests, they introduced two test: the Breusch-Pagan test and the Goldfeld-Quandt test.

From my understanding, the null hypothesis test of both tests asserts that heteroskedasticity does not exist.
However, in the webpage, they have p-value 0.08794028782673029 and 0.3820295068692507 respectively.
This means that the Breusch-Pagan test asserts heteroskedasticity exists whereas the Goldfeld-Quandt test asserts that heteroskedasticity exists.

What is happening here? Why would they give contradictory results?

Best Answer

Because the tests look at different ways in which heteroskedasticity can manifest itself, and hence, a given data set may "look" heteroskedastic to one test and not so to another.

A bit more specifically, the Breusch-Pagan test (BP) looks at whether squared residuals can be explained by observed regressors $z_i$, while the Goldfeld-Quandt test (GQ) relies on the split-sample exercise. Hence, it is conceivable that the former test picked up heteroskedasticity from relation to a variable that did not serve as a splitting variable in the latter, which GQ could then not detect.

Here is a little example (code below - in R though, I do not know Python):

enter image description here

Errors are generated in a way that heteroskedasticity arises from x1, which shows in the left hand side of the plot, where the variance of the residuals increases with x1, but not with x2 (rhs). So when using GQ and splitting your sample according to x2, the test will have nothing to pick up in terms of heteroskedasticity, while it does in the lhs. So, not only can BP and GQ contradict themselves, so can different versions of GQ.

The same behavior can be produced with the BP test of course, depending on the specification of the auxiliary regression, again see the example code below.

library(lmtest)

n <- 10000
x1 <- 3 + rnorm(n)
x2 <- rnorm(n)
u <- x1*rnorm(n)
y <- u

reg <- lm(y~x1+x2)
par(mfrow=c(1,2))
plot(x1, resid(reg), cex=.5, col="green")
plot(x2, resid(reg), cex=.5, col="red")

gqtest(reg, order.by = x1) # split according to variable that reveals heteroskedasticity
gqtest(reg, order.by = x2) # split does not reveal heteroskedasticity, leading to higher p values

bptest(reg) 
bptest(reg, varformula = ~x1) # auxiliary regression that can pick up the heteroskedasticity
bptest(reg, varformula = ~x2) # this one cannot - leading to higher p-value

Output:

> gqtest(reg, order.by = x1)

    Goldfeld-Quandt test

data:  reg
GQ = 2.908, df1 = 4997, df2 = 4997, p-value < 2.2e-16
alternative hypothesis: variance increases from segment 1 to 2


> gqtest(reg, order.by = x2)

    Goldfeld-Quandt test

data:  reg
GQ = 1.0519, df1 = 4997, df2 = 4997, p-value = 0.03685
alternative hypothesis: variance increases from segment 1 to 2


> bptest(reg) 

    studentized Breusch-Pagan test

data:  reg
BP = 1214.4, df = 2, p-value < 2.2e-16


> bptest(reg, varformula = ~x1)

    studentized Breusch-Pagan test

data:  reg
BP = 1213.2, df = 1, p-value < 2.2e-16


> bptest(reg, varformula = ~x2) 

    studentized Breusch-Pagan test

data:  reg
BP = 2.0869, df = 1, p-value = 0.1486

In general, I would say it is to be expected that different widely used tests tend to sometimes give different answers. If they did not, then I would expect one test to be superseded, based on considerations such as ease of computation, reputation of the authors who published the different tests, discussion in well-known textbooks, availability of convenient software, etc.