I learned about the spurious regression problem during a course at the Booth school of business. It’s well known among econometricians because it is in the classic text by Hamilton but I don’t think it’s known more widely.

A first-order measure of association between two variables x,y is their correlation. Equivalently, we can fit a univariate linear regression to the data:

y=α+βx

If we have N observations that are independent, given a couple mild assumptions, we get a CLT:

N(ˆββ)N(0,σ2yx/σ2x),

where σ2x=var(x) and σ2yx=var(yαβx).

We can test for association (β0) using a standard F-test.

The independent observation assumption is crucial. Without it, you can get very surprising and unusual behavior.

Consider observations of pairs xt,yt, which are generated from random walks:

xt=xt1+ut,yt=yt1+wt,

where the errors ut,wt N(0,1), are independent and x0=y0=0, so both series are independent. Econometricians like to call processes like this “unit root” processes.

Theorem (Hamilton, 1994)

Suppose we have samples {(x0,y0),,(xT,yT)} generated as described above. Then

ˆβ10W1(r)W2(r)dr10W2(r)2dr,

where W1,W2 are independent Brownian motions [1].


Since the two series are independent, we expect the unscaled sample correlation to converge to zero, ˆβ0. But the theorem shows it actually converges to a random quantity. Since the limiting distribution doesn’t have a closed-form, I plot a simulation of it below. It is not that unusual for the regression coefficient to converge to a number larger than 1 in absolute value.

image

Figure 2: Simulation of regression coefficient distribution

What does this mean? As the number of samples increases, the sample correlation will actually approach something nonzero (with probability one). And so in the large-n limit, the F-test will always reject the hypothesis of association. As a consequence we can easily generate a simulation of two independent variables which the F-test says are certainly dependent!

image

Figure 2: Two independent random walks. n=100, R2=0.69, F-test p-value < 2.2e-16.

Cointegration

So what to do? It’s clear that an F-test is certainly the wrong thing to do to test association between two nonstationary time series, since the specificity of the test is asymptotically zero. A better approach is to test for cointegration.

Two processes are cointegrated if they are each marginally nonstationary (unit root) processes, and there exists a constant γ such that

et=ytγxt

is stationary.

A classic example are bid/ask prices. The bid and ask are respectively the posted price for immediate sale or purchase by a market maker.  These prices generally differ– their difference is called the spread. But if they differ by too large an amount, someone else will inevitably enter the market to provide liquidity, and the spread will revert back towards zero. Thus the spread

st=patpbt

should be a stationary process.

image

Figure 3: Bid/ask prices and the bid/ask spread for a stock

To test for coinegration, there are two possibilities:

1. If the constant γ is known (perhaps in the bid/ask scenario we are willing to assume γ= 1), compute the residual series et and perform the Dickey-Fuller test for a unit root.

2. If the constant γ is unknown, estimate ˆγ with regression, and perform a special “augmented Dickey-Fuller test”.

image

Figure 4: A spurious correlation


[1] I won’t get into what a Brownian motion is, or what the integral over a Brownian motion is. You can look at the original derivation in the source.