Regression – How to Assess Significant Differences in Statistical Relationships Over Years

multiple regressionregressiontime series

I am measuring the productivity of a plant and several meteorological variables. For ease of discussion, let us reduce the variables to the following where the dependent variable y is the productivity of the plant and x1 is light availibility and x2 is water availibity (and x3 is a dummy variable for phenology). I have six years of measurements with hourly resolution, but I am working with daily averages here.

Now the following situation:

  • The productivity is unnaturally low in year 5.
  • It is likely that this is caused by damage to the plant in year 4 (lateral effect).

In order to "proof" this, I need to show that low productivity in year 5 is not caused by low light or low water availability. In other words I try to answer the following question:
How "likely" is the data of year 5 given the relationship of the variables in the other years (I assume it is unlikely).

I started by building a multiple linear regression with all my observations:

Call:
lm(formula = Productivity ~ Light + Water + Phenology, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5005 -1.0894  0.0600  0.9473  6.6023 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.2889532  0.1731193  -13.22   <2e-16 ***
Light        0.0067407  0.0003174   21.23   <2e-16 ***
Water        0.0749991  0.0072588   10.33   <2e-16 ***
Phenology    6.6646481  0.1365343   48.81   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.886 on 2187 degrees of freedom
  (1 Beobachtung als fehlend gelöscht)
Multiple R-squared:  0.8125,    Adjusted R-squared:  0.8123 
F-statistic:  3160 on 3 and 2187 DF,  p-value: < 2.2e-16

When we look at the results, we can see that the observed productivity is lower in year 5 than predicted.

plot(predict(mod))
points(data$Productivity, col = "red")

prediction

My idea now is to build a model with all observations except year 5 and then use that model to predict year 5. I would probably get a high RMSE or a low R-squared, for example. But that probably wouldn't answer my question about how unlikely the year 5 data is given the relationship of the variables in the other years, would it?. Do you have a better idea?

I am also afraid that the model is just following the phenological cycle of the plant. Should I remove the seasonality from the data? What do you think?

Best Answer

You're on the right track, and you can answer your question using some old-fashioned tools.

In your current model, the intercept reflects the predicted average productivity when Light, Water and Phenology are all at zero. You can make this a little more intuitive by centring these variables to have a mean of zero. Then, the intercept is just the average baseline productivity.

Your hypothesis is that this baseline is lower in year 5. You can test this by creating a new variable, year5, which has a value of 1 for data from year 5 and 0 otherwise, and including it as a predictor. The weight given to this predictor will reflect changes in productivity in year 5 that are not explained by changes in light, water, or phenology (whatever that is).

Take care, though, in how you interpret p-values from any analysis like this. If this pattern in the data is the reason you've decided to do this analysis in the first place (the data generated the hypothesis), this remains an exploratory analysis unless you can find some fresh, independent data to confirm your results on.

Related Question