Solved – Is time of the day (predictor in regression) a categorical or a continuous variable

categorical datacircular statisticsmultiple regressionregression

I am trying to perform multiple regression. One of the feature variables is time of the day, represented by 0 to 23. I am confused as to whether I need to use dummy coding or not. Is this a categorical variable or continuous variable?

Best Answer

It is neither. Actually, it is what you make it to be in your model formula, there are more than two possibilities, and there is not necessarily one correct answer among them!

If you make it categorical, then your model will have a separate, independent coefficient (or more precisely, degree of freedom) for each hour of the day. This could be too many variables to fit with your limited available data, in which case you could divide the day into halves or quarters instead of 24ths, which is what hours do.

If you make the hours variable numeric, your model will have an effect with magnitude proportional to the hour. You might want to think twice about that: it will cause a discontinuity between 11pm and midnight (23 and 0), which is not realistic for most situations (unless you have a process that is accumulating through the day and getting reset every midnight). Consider instead fitting a periodic formula like $$y \sim A \sin(2\pi h/24) + B \cos(2\pi h/24)$$ where h is the hour (numeric not categorical) and $A$,$B$ are the fit coefficients. This is just one of many possible periodic functions, all of which will have no discontinuity.

If a smooth, periodic function $f(h)$ is desired, one especially appealing option could be to find the best such curve using Generalized Additive Modeling (GAM) and cyclic regression splines. GAM is fully nonparametric for univariate functionals, automatically searching a (potentially) infinite-dimensional space of smooth, periodic functions for the one that best describes your data.

The key takeaway here is that numeric vs categorical is better thought of as a modeling choice, not a property of the data, and there are many modeling choices besides just those two. You have to consider your situation and try to find the most appropriate one.