Solved – Applying machine learning techniques to panel data

machine learningpanel datarecurrent neural networksupervised learning

I have a panel data in which I observe 1500 companies and many individuals work for those companies for multiple periods. I have explanatory variables at both individual (e.g. race, age, education) and company level (e.g. company age, R&D investment, spending on advertising, industry). So there are different types of explanatory variables i.e. continuous, categorical, binary. In this dataset, same individual might work for more than one company at the same time (given that some of them are consultants). My dependent variable is sales per year.

By using this data, I want to make a prediction of the dependent variable and want to test the importance of each explanatory variable. Does anyone know which models would be more suitable and where I could find a reliable material on this topic? I was thinking about applying RNN to panel data (how to do it ?) but also open to other suggestions.

I know still ML and econometrics are not talking to each other with regard to causality but do you know any recent paper/ development related to this issue?

Best Answer

I don't believe I can offer what you're looking for, but the first step is to use the repeated individual_id as a variable to ensure that each individual is in 1 partition. For example if you're using cross-fold validation, then an individual should only show up in 1 fold and not be spread out amongst the others.

As far as what machine learning algorithms to try - that is ultimately up to the data. In my experience though, I think your best results will come from some sort of boosted tree such as LightGBM or xGBoost. This will lead to you deciding how to encode the categorical variables, for which I recommend category_encoders library in python, if you're using python.

I'm sure there's interesting and novel ideas around RNN's but to be honest I don't think this problem is suited for that type of algorithm. This sounds like a classic regression problem to me.

Related Question