Solved – Regression with autocorrelated, lagged independent variable

regressiontime series

Dear Cross Validated Community

I have a question about handling dependencies in time-series regression. This is not an urgent issue. However, it would be nice to have a little discussion here – if someone is intereste in this topic and has time to answer.

Here is the situation:
I have land surface temperature (LST) and NDVI data for a 600×600 pixel scene (it does not matter what NDVI is; for an explanation, see [1]). This datasets spans 15 years, one observation every 16th day. Further, I have a classification dataset that shows different types of vegetation, like 'forest', 'crop', 'grassland' etc.

From the NDVI time series, I extracted phenological metrics for every pixel and every year: Start of Season (SoS), End of Season (EoS) and Length of Season (LoS) (See [2] for more detailed information). Let's talk only about SoS, to simplify the problem.

The SoS should not be autocorrelated for the respective pixels, there is only one value per year, but there might be a trend. Now, I want to do a linear regression for every pixel where the dependent variable is the SoS and the independent variable is the LST at the same pixel. Obviousely, the SoS is not only dependent on the LST at the day of year (t), but also on previous values (t-1, t-2, …).

My idea: Investigate the dependencies for the whole dataset to derive a 'global rule'.

  • Let $SOS(p_{i}, s_j)$ be a vector containing the SoS of pixel $p_i$ and season $s_j$.
  • $B$ is a vector containing the coefficients of the regression
  • $LST$ is a matrix containing $LST(p_i, s_j, t_k)$, the LST at a pixel $p_i$, season $s_j$ and time $t_{0-k}$:

Estimating $B$ for time lags $t$, $t-1$,…,$t-k$:

$$\begin{pmatrix}
SoS(p_1, s_1)\\
SoS(p_1, s_2)\\
…\\
SoS(p_n, s_m)
\end{pmatrix}
=
\begin{pmatrix}
1 \ \ \ LST(p_1, s_1, t_0) \ \ \ LST(p_1, s_1, t_{-1}) \ \ \ …\\
1 \ \ \ LST(p_1, s_2, t_0) \ \ \ LST(p_1, s_2, t_{-1}) \ \ \ …\\
…\\
1 \ \ \ LST(p_n, s_m, t_0) \ \ \ LST(p_n, s_m, t_{-1}) \ \ …
\end{pmatrix}
\begin{pmatrix}
B_{0}\\
B_{t}\\
…\\
B_{t-k}
\end{pmatrix}$$

I didn't try this yet. If there is no significance at any time lag, I could convert the model to a mixed effects model, such that the intercept follows a gaussian distribution. Or I could even go further and include the vegetation type as factor, because the dependency on the LST could be different for different vegetation types (I somehow have the feeling that I'm not taking into account the autocorrelation of the LST).

Once I know until which time lag $l$ the dependency is significant, I can use this knowledge to do a regression for every point, where the independent variable is $$LST^* = B_tLST_t+B_{t-1}LST_{t-1}+…+B_{t-l}LST_{t-l}$$

$$SoS=a_0+a_1*LST^*$$

This smells like a time series problem, but I'm dealing with it in a classic regression mindset… The SoS is static for a certain pixel and season, maybe that's why i'm confused.

  • Does this make ANY sense at all?
  • Is there an easyer way to deal with such problems?
  • What do you think in general about this approach?

I know, this is a lot of text. I would also be happy if someone could give me a keyword to look it up myself.

Cheers,
Basil


[1]
Normalized Difference Vegetation Index (NDVI) it is a proxy for vegetation activity or 'greenness'. The index is usually built from remote sensing data (satellites, airplanes), including the RED and the NIR band, where RED is the wavelength of red light (about 660nm) and NIR is the near infrared wavelength (abut 800nm). NDVI = (NIR-RED)/(NIR+RED). The NDVI shows a strong seasonality.

[2]
The values indicate the day of year of the respective event, or the length from SoS to EoS in the case of LoS. They were derived by smoothing the time series and extracting the day of year at the 50% treshold for each year and each pixel.

Best Answer

Ordinary Least Squares estimation is inadequate in the existence of heteroscedasticity- unequal and out-of-diagonal covariance terms among the observations.

In your example, LST is a time-series which you assume to have autoregressive nature. Therefore, when using successive terms from this time series as regressors, you should be aware the OLS may be inefficient and yield over-optimistic error estimates. Two methods that address these issues are:

  1. Generalized Least Squares - This approach employs the observation correlation matrix (or in most practical cases, an estimate of the correlation matrix) and can be intuitively explained as equivalent to performing the regression in a transformed vector space where the observations are uncorrelated.

  2. OLS with robust variance estimation - With this approach, the regression coefficients (B in your example) are calculated the same way as with OLS, but the error terms' variance is estimated in any of the available techniques that make the estimation more robust to the existence of correlations and unequal variances.

Googling "Generalized least squares" or "robust variance estimation" should get you useful resources. In addition, see Gretl User's Guide, chapter 17.