Solved – Selecting the best time lagged moving average for time series analysis

feature selectionmoving averagetime series

I am studying the effect of weather on agricultural outputs. I have yield data from one farm over 5 years and a number of weather inputs (rainfall, temperature, soil moisture, etc.) for the entire period. Both yield and inputs are available at the daily level. There are clear seasonal trends that I can account for pretty well using factor variables for months and days of the week.

As I plan to use this model to be applied to new farms and without the past data of those new farms, I am loathe to use a traditional time series ARIMA model. I will have the past data of the inputs for the farms but not the past data of the output upon which I can base an ARIMA model.

I am operating under the assumption that some function of past weather inputs is predictive of current yields. For example, the correlation between yield and the rolling average of the last n days of rainfall increases with n for about 2 months before it starts to decline.

I am testing a number of different machine learning algorithms, namely standard OLS, Random Forest, and the Elastic Net, to make predictions. My main question is, what is the best way to determine the appropriate lag for the input variables? Currently, I am using the lagged moving average for each feature that has the highest correlation with output.

Would you use a different structure for the random forest since it doesn't have built-in linearity assumptions?

Best Answer

This is only a partial answer, plus a few comments.


There are clear seasonal trends that I can account for pretty well using factor variables for months and days of the week.

Why account for days of the week? Is there a reason why weather should follow a weekly pattern? On the other hand, the agricultural production might have a weekly pattern if the farmer likes to take a day off on Sunday.

Also, why use monthly seasonality? You would not expect abrupt changes when calendar month changes, would you? If you want to account for the effect of the moon's cycle (which may be relevant for agriculture), that would require using 29.5-day seasonality.

Thus when it comes to seasonality, I would consider using Fourier terms (perhaps one set for the solar and another set for the lunar calendar) as a compromise between smoothness and flexibility.


When using elastic net regression, you would rather include too many rather than too few lags. The less relevant lags would be penalized towards zero.

Related Question