Solved – Predict longitudinal data with machine learning in R

caretmachine learningpanel datapredictionr

I am currently working on a prediction model from which the data is longitudinal data/panel data/cross sectional data. The data contains multiple companies for which I have a response variable and multiple explanatory variables for multiple years.

I would like to make a prediction of the response variable and would like to test the importance of each explanatory variable. I have made some predictions using the package plm() in R, but I would like to make a prediction using machine learning algorithms. Does any of you know which models I could use and where I can find more material on this topic? Are there models available in the package caret which could deal with this longitudinal data?

Many thanks in advance!

My data looks like this:

    data <- read.table(header = TRUE, 
               stringsAsFactors = FALSE, 
               text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
               1 2.5 2000 1 2
               1 4 2001 3 1
               1 3 2002 5 7
               2 1 2000 3 2
               2 2.4 2001 0 4
               2 6 2002 2 9
               3 10 2000 8 3")

Best Answer

There are a few approaches you could take.

First, you could project-out the fixed effects, then run ridge or lasso:

library(lfe)
library(glmnet)
"%ni%" <- Negate("%in%")

mm <- model.matrix(ResponseVariable ~ ., data = data[,-1])
xdm <- demeanlist(mm, list(data[,1])
ydm <- demeanlist(data$ResponseVariable, list(data[,1])

ridge <- cv.glmnet(y = ydm, x = xdm, alpha = 0)
lasso <- cv.glmnet(y = ydm, x = xdm, alpha = 1)

You'd then go back and calculate the fixed effects using getfe or something, to make the prediction.

If your dataset is small like in the example, you could make model matrices of all squares and cross-products, etc.

Also, if your data is small, you could simply put your cross-sectional unit into the design matrix of the ridge regression as a factor -- (penalized) least squares dummy variables. It turns out that L2-penalized LSDV is equivalent to random effects. If you don't care about unbiased parameter estimates, you should always prefer random effects to fixed effects.

You could also simply ignore the cross-sectional unit:

library(randomForest)
rf <- randomForest(ResponseVariable ~ . -CompanyNumber - Year, data = data)

Here you'd want to take out year, because RF can't extrapolate. I assume you're making predictions for the next period. You could consider detrending your data before putting it into the random forest.

Finally, there is an experimental package here that projects-out fixed effects from the top level of a neural network. It might not yet be very reliable, however.

Related Question