Linear Regression – Using Date as Dummy Variable in Linear Regression Analysis

pythonr

I have a doubt about including the date – year, month as dummy 'X' variables in linear regression in Python. The issue I encounter is when forecasting 'y' variable on a date where the 'year' is new. How do I proceed on this?

E.g. Suppose I have only Y variable – sales and X as 'date'. when X is dummified to 'year' and 'month' (from 2016 to 2018) and I am predicting for 1 year that include 2019 as well, then how do I choose the dummy coding for Year when it didn't appear earlier in the data?

My question is also to understand whether it is a good practise to dummify the date to categorical binary variable or convert it to some cubic spline

Best Answer

Don't use the date or the year as a dummy variable. Don't, don't, don't.

Dummy coding is used for categorical data, e.g., car brands or hair colors. Dates and years aren't. They are interval scaled. Interval scaled data should be translated into a single predictor that counts the number of days, years (or seconds) since an arbitrary origin. (The choice of origin will influence your intercept parameter estimate.)

In forecasting out, it is often good practice to not extrapolate this predictor linearly, but to dampen it. This has the effect of dampening any trend your model fits.

Even better, take a look at a standard forecasting textbook, like this one.