Regression – Should a Date Variable Be Used in Regression Analysis?

rregressiontime series

I'm not used to using variables in the date format in R. I'm just wondering if it is possible to add a date variable as an explanatory variable in a linear regression model. If it's possible, how can we interpret the coefficient? Is it the effect of one day on the outcome variable?

See my gist with an example what I'm trying to do.

Best Answer

Building on earlier comments on Stack Overflow:

Yes, it makes sense. Here I address the general question and am happy to let R experts fill in the crucial details. In my view, as this is now on Cross-Validated, we should not focus too narrowly on the poster's favourite software, important though that is for like-minded people.

Dates in any software if not numeric can be converted to numeric variables, expressed in years, days, milliseconds or whatever since some time origin. The coefficient associated with each date has denominator units which are whatever the units of the date are. The numerator units depend on those of the response or dependent variable. (Non-identity link functions complicate this, naturally.)

However, it usually makes most sense when dates are shifted to an origin that makes sense for the study. Usually, but not necessarily, the origin should be a date within the time period of study or very close to it.

Perhaps the simplest case is linear regression on a date variable in years. Here a regression of some response on date expressed as dates like 2000 or 2010 implies an intercept which is the value of response in year 0. Setting aside the calendrical detail that there was no such year, such an intercept is often absurdly large positive or negative, which is logical but a distraction in interpretation and presentation (even to well-informed audiences).

In a real example from working with undergraduate students, the number of cyclones per year in a certain area was increasing slightly with the date and a linear trend looked a reasonable first stab. The intercept from regression was a large negative number, which caused much puzzlement until it was realised that this was, as always, an extrapolation to year 0. Shifting the origin to 2000 produced better results. (Actually, a Poisson regression ensuring positive predictions was even better, but that's a different story.)

Regressing on date - 2000 or whatever is thus a good idea. The substantive details of a study often indicate a good base date, i.e. a new origin.

The use of other models and/or other predictors doesn't undermine this principle; it just obscures it.

It is also a good idea to graph results using whatever dates are easiest to think about. These may be the original dates; that's not a contradiction, as it is just the same principle of using whatever is easiest to think about.

A little thought shows that the principle is much more general. We are often better off with (age - 20) or some such, to avoid logical but awkward predictions for age 0.

EDIT 21 March 2019 (original 29 Jul 2013): These arguments have been discussed in a Stata context in Cox, N.J. 2015. Species of origin. Stata Journal 15: 574-587 see here

EDIT 2 also 4 Dec 2015 @whuber in comments raises also the important issue of numerical precision. Often the time units are fine and the resulting dates or date-times can be very large, raising important issues for sums of squares, and so on and so forth. He raises an example from R. To that we can add (e.g.) that date-times in Stata are milliseconds since the start of 1960. This problem is not at all specific to dates, as it can arise generally with numbers that are very big or very small, but it is worth flagging too.