Solved – Machine Learning methods / Regression Trees for Longitudinal/Panel Count Data

cartcount-datamachine learningoverdispersionpanel data

I have a data set where the dependent or the response variable is a non-negative and integer variable and exhibits over-dispersion where the variance is greater than the mean. Below is the figure

enter image description here

There exists methods such as Poisson, Negative Binomial (NB) Regression etc to model this kind of data. However, I have been searching for some machine learning or regression tree methods for panel count data which can provide alternatives for Poisson or NB. There exists regression tree methods for panel data such as REEMtree: Regression trees with random effects for longitudinal (panel) data by Sela and Siminoff. But this procedure implements a linear mixed effects model along with the non-parametric procedure of tree building based on CART methodology by Breiman and I am not sure if this is right method to model count data.

My query is, does there exist any machine learning or regression tree procedure to model count/over-dispersed data?

Best Answer

The glmertree package on R-Forge (https://R-Forge.R-project.org/R/?group_id=261) extends the REEM tree approach in two directions: First, the response variable can come from the exponential family (including Gaussian and Poisson distributions among others). Second, the tree can be employed to not only learn a segment-wise constant mean but also include segment-specific regression slopes. Our working paper introducing the method is available from RePEc at http://EconPapers.RePEc.org/RePEc:inn:wpaper:2015-10

Having said that, however, the spread in your data is so large that I heavily doubt you will need a count data regression. I would start out by using log(y) as the response (or log(y + 0.5) if there are zeros) and try to build a flexible model for a continuous response.

Related Question