Solved – Linear regression with “hour of the day”

circular statisticsdata transformationlinear modelmachine learning

I am trying to fit a linear model using "hour of the day" as parameter. What I'm struggling with, is, that I've found two possible solutions on how to handle this:

Dummy encoding for every hour of the day
Transform hours into cyclic variable

I don't quite understand the use cases of both approaches and thus I am not certain which one will lead to a better outcome.

The Data I'm using is from this Kaggle challenge. The goal is to predict nyc taxi fares. Given attributes are pickup and dropoff coordinates, pickup datetime, passenger count and the fare amount.
I extracted the hour of the day to take possible congestions into consideration and am trying to implement it into my model. I should also probably mention that I'm pretty inexperienced.

Best Answer

Dummy encoding would destroy any proximity measure (and ordering) among hours. For example, the distance between 1 PM and 9 PM would be the same as the distance between 1 PM and 1 AM. It'd be harder to say something like around 1 PM.

Even leaving them as is, e.g. numbers in 0-23, would be a better approach than dummy encoding in my opinion. But, this way has a catch as well: 00:01 and 23:59 would be seen very distant but actually they're not. To remedy this, your second listed approach, i.e. cyclic variables, is used. Cyclic variables map hours onto a circle (like a 24-h mechanical clock) so that the ML algorithm can see the neighbours of individual hours.

Related Solutions

Categorical Data – Is Hour of Day a Categorical Variable?

Depending on what you want to model, hours (and many other attributes like seasons) are actually ordinal cyclic variables. In case of seasons you can consider them to be more or less categorical, and in case of hours you can model them as continuous as well.

However, using hours in your model in a form that does not take care of cyclicity for you will not be fruitful. Instead try to come up with some kind of transformation. Using hours you could use a trigonometric approach by

xhr = sin(2*pi*hr/24)
yhr = cos(2*pi*hr/24)

Thus you would instead use xhr and yhr for modelling. See this post for example: Use of circular predictors in linear regression.

Solved – Including time of day in a linear regression model

I think a partially linear modeling framework may be suitable for your problem. If you focus on one flower at the time, note that both the flower data and the air temperature data exhibit strong temporal cycles which peak roughly at the same time. So the simplest partially linear model you could consider for one flower would look like this:

FT_h = beta0 + beta1*AT_h + m(h) + epsilon_h,

where FT_h is the flower temperature for the chosen flower at hour h, AT_h is the air temperature at hour h, m() is a smooth, unknown function meant to capture the temporal cycles you see in the temperature data and epsilon_h is an unknown error term. Here, h = 1, 2, 3, ..., H is an index which counts how many hours you have represented in total in your flower data. In other words, this index counts your hours from the first to the last. If you have 9,000 hours represented in your data, for example, then H = 9,000. In this model, beta1 represents the hourly effect of air temperature on flower temperature, after controlling for temporal effects.

The model can be expanded by adding a linear effect for incident solar radiation (ISR):

FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h + epsilon_h.

If you wanted to throw in wind direction as well, you could code this variable as taking the values North, South, East, West (or add variations like North-East, North-West, etc.) and include it in your model using dummy variables. For example, if you only code this variable as taking the values North, South, East or West, the flower-specific model could be expressed as:

FT_h = beta0 + beta1*AT_h + m(h) + beta2*ISR_h +
       beta3*NorthDummy_h + beta3*EastDummy_h + beta4*WestDummy_h +       
       epsilon_h,

where South is treated as the reference direction against which all others will be compared and NorthDummy_h is set to 1 if wind direction was North at hour h and 0 otherwise, EastDummy_h is set to 1 if wind direction was East at hour h and 0 otherwise and WestDummy_h is set to 1 if wind direction was West at hour h and 0 otherwise.

The challenging aspects of these models are:

The need to estimate the (unknown) degree of smoothness of the (unknown) temporal effect m() carefully, given that this is just a nuisance effect and the real interest is in estimating beta1;
The possibility that the error terms epsilon_h might be temporally correlated, which in turns can affect how item 1. above is addressed.

Many years ago, I conducted research on this very topic - see, for example - http://www.ghement.ca/217.pdf. However, I have not stayed current on the topic so it's possible there have been several advances on ways to handle item 1.

Intuitively, the temporal signal seen in the data is really strong while the air temperature signal is likely tiny by comparison. So you need to find the right balance when determining the degree of smoothness of the temporal effect, so as not to throw the baby with the bath water.

If you are interested in comparing effects of air temperature across flowers, you can expand the model even further. But I would start small to make sure I get a handle first on the simpler, flower-specific models.

Best Answer

Related Solutions

Categorical Data – Is Hour of Day a Categorical Variable?

Solved – Including time of day in a linear regression model

Related Question