Solved – Assigning more weight to more recent observations in regression

rrandom forestregressiontime series

How do I assign more weight to more recent observations in R?

I assume this as a commonly asked question or desire but I have a hard time figuring out exactly how to implement this. I have tried to search alot for this but I am unable to find a good practical example.

In my example I would have a large dataset over time. I want to say apply some sort of exponential weighting of the rows of data that are more recent. So I would have some sort of exponential function saying observations in 2015 are ___ more important to training the model than observations in 2012.

My dataset variables contain a mix of categorical and numerical values and my target is a numerical value – if that matters.

I would want to test/try this out using models such as GBM/Random Forest, ideally in the CARET package.

update-question

I appreciate response given below on how to exponentially decay the weight by the date distance between two points.

However, when it comes to training this model in caret, how exactly do the weights factor in? The weight value in each of the training rows is the distance between some point in the future and when that point historically occured.

Do the weights come into play only during the prediction? Because if they come into play during the training, wouldn't that cause all sorts of problems as various cross-folds would have varying weights, trying to predict something that may have actually at a point in time before it?

Best Answer

How do I assign more weight to more recent observations in R?

I guess you have a timestamp associated with each observation. You can compute a variable timeElapsed = modelingTime - observationTime. Now you apply a simple exponential function as W=K*exp(-timeElapsed/T), where K is a scaling constant and T is the time-constant for the decay function. W works as case-weight.

To the best of my knowledge, many function in caret allow weight as a parameter, which is a column of case-weights to be provided to corresponding observation(thus having same length as #rows).