Cointegration – Impact of Series Order in Cointegration Tests

augmented-dickey-fullercointegrationpythonstatsmodelstime series

Update 2022-07-13

  • My question could be rephrased as Why do I get a different result for cointegration test when I swap the independent and dependent variables? The link leads you to a thread on this platform that has been answered. However, it leaves me with the practical problem of what to do if coint(series1, series2) is significant and coint(series2, series1) is not. Also, the magnitude of difference comes as a surprise. My updated example highlights this, check USO vs. EWC and EWC vs. USO in my updated example below.
  • I have made the example below accessible via Google Colab
  • I have updated the example to use a different timeframe and different stocks to show how big the difference in pvalue can get for coint(series1, series2) vs. coint(series2, series1).
  • Expanded the example by manually doing the regressions, running Augmented Dickey-Fuller tests on the spread, and visualizing the residuals/spread. This came after a suggestion in the comments.

Summary
I have trouble understanding why the pvalue of a cointegration test is not symmetric, i.e., why the pvalue of coint(series1, series2) is not equal, or at least really close, to the pvalue of coint(series2, series1). Why does the order matter in this case? I am using statsmodels.tsa.stattools.coint to test for cointegration.

Details
I am analyzing stock price data. I have created a small toy project below to showcase the issue. The last output is a heatmap that shows the pvalue of the cointegration test for the two respective stocks. My expectation would have been to see a symmetric matrix or at least pvalues that are very close for coint(stock1, stock2) and coint(stock2, stock1). It seems that I am lacking a fundamental piece of knowledge to understand the behavior.

import itertools

import pandas as pd
import seaborn as sns
import yfinance as yf

from matplotlib import pyplot as plt
from statsmodels.tsa.stattools import coint


stocks = ["SILJ", "USO", "EWA", "EWC"]
df = yf.download(stocks, interval="1d", start='2020-01-01', end='2022-01-01')["Adj Close"] # daily prices
df.head(3)

Table with the first 3 rows of stock price data

df.plot(figsize=(16,8)

Plot of the price series of stocks.

pairs = list(itertools.combinations(stocks, 2))

# Create dataframes to represent heat maps of cointegration pvalues
df_coint = pd.DataFrame(index=stocks, columns=stocks, dtype=float)
df_coint.fillna(0, inplace=True)

# analyze cointegration
for pair in pairs:

    df_coint[pair[0]][pair[1]] = coint(df[pair[0]], df[pair[1]])[1] # pvalue
    df_coint[pair[1]][pair[0]] = coint(df[pair[1]], df[pair[0]])[1] # pvalue

fig, ax = plt.subplots(figsize=(16,8)) 
sns.heatmap(df_coint, vmax=.05, cmap="crest", annot=True)
ax.set_title(f"p_value of the prices being cointegrated")
plt.show()

heatmap of pairwise pvalues for cointegration

User Christoph Hanck suggested in the comments to run the linear regressions manually and plot the residuals. I get the same asymmetry when I run a linear regression and run the Augmented Dickey-Fuller test to check if the spread is stationary. I have used sklearn.linear_model.LinearRegression and statsmodels.tsa.stattools.adfuller.

from sklearn.linear_model import LinearRegression
from statsmodels.tsa.stattools import adfuller
pair = ("USO", "EWC")

# Calculate hedge ratio with regression based on price data
# LinearRegression().fit() expects 2D numpy arrays. We can use Series.values
# to get the values as a numpy array. Since these are 1D arrays,
# we can use numpy.reshape(-1,1).
a_0 = df[pair[0]].values.reshape(-1, 1)
a_1 = df[pair[1]].values.reshape(-1, 1)

lr = LinearRegression()
lr.fit(a_0, a_1) # lr.fit(X, y)

hedge_ratio = lr.coef_[0][0]
intercept = lr.intercept_[0]

spread =  df[pair[1]] - (df[pair[0]] * hedge_ratio + intercept)

# Augmented Dickey-Fuller test to check if the spread is stationary.
# If yes, the series are cointegrated.
pvalue = adfuller(spread)[1]
print(f"Spread stationary at p_level .05: {pvalue < .05}; pvalue: {pvalue}")

Spread stationary at p_level .05: False; pvalue: 0.7377835885339065

spread.plot(figsize=(16,8))
plt.axhline(spread.mean(), color='black') 
plt.xlabel("time")
plt.legend([f"df[{pair[1]}] - (df[{pair[0]}] * hedge_ratio + intercept)", "average spread"])
plt.show()

spread1

lr = LinearRegression()
lr.fit(a_1, a_0) # lr.fit(X, y)

hedge_ratio = lr.coef_[0][0]
intercept = lr.intercept_[0]

spread =  df[pair[0]] - (df[pair[1]] * hedge_ratio + intercept)

# Augmented Dickey-Fuller test to check if the spread is stationary.
# If yes, the series are cointegrated.
pvalue = adfuller(spread)[1]
print(f"Spread stationary at p_level .05: {pvalue < .05}; pvalue: {pvalue}")

Spread stationary at p_level .05: True; pvalue: 0.00023488247539811303

spread.plot(figsize=(16,8))
plt.axhline(spread.mean(), color='black') 
plt.xlabel("time")
plt.legend([f"df[{pair[0]}] - (df[{pair[1]}] * hedge_ratio + intercept)", "average spread"])
plt.show()

spread2

What else I have tried

  • I have done this on a lot more time-series that also had an order of magnitude more data points. The same issue appears.

Best Answer

The described effect can be explained by the regression via Ordinary Least Squares that is used prior to conducting the Augmented Dickey-Fuller test (this also happends internally when using statsmodels.tsa.stattools.coint).

I want to share two sources that explain what is happening:

However, if the linear combination (more precisely, the coefficients on the different variables) is not known in advance, it has to be estimated. One way of doing that is by regressing one variable on the others by ordinary least squares. This particular estimation technique yields an effect of "direction"; it appears from the formulation of the estimation problem. The errors of y projected on x are not the same as those of x projected on y.

Source: Why do I get a different result for cointegration test when I swap the independent and dependent variables?

Implications of using ordinary least squares: You might have noticed that since we used ordinary least squares (OLS) to find our hedge ratio, we will get a different result depending on which price series we use for the dependent (y) variable, and which one we choose for the independent (x) variable. The different hedge ratios are not simply the inverse of one another, as one might reasonably and intuitively expect.

Source: https://robotwealth.com/practical-pairs-trading/

I have made this visible by augmenting the example I have constructed in my original question. I have also put the code on Google Colab. See how the OLS regression line is different depending on which series gets declared as the dependent and independent variable.

spread1

sns.jointplot(x=df[pair[0]], y=df[pair[1]], height=8, kind="reg")
plt.show()

jointplot1

spread2

sns.jointplot(x=df[pair[1]], y=df[pair[0]], height=8, kind="reg")
plt.show()

jointplot2

Related Question