I am working with a dataset of flight records and I model the flight delay. I have variables for the origin and destination airport , but each of them has about 300 categories. I think about grouping the less significant airports. But I am not sure how to do that. Should I:
-Group the airport with less number of registries.
-Group airports with similar average flight delay.
Or, actually, maybe I shouldn't group at all? For record, I have about 5 million total entries.
Best Answer
IIUC, you want to model flight delay as a function of the attributes of the origin and destination airport. But I am not sure whether you have many categorical variables or just one categorical variable (per airport) which could contain many different categories. I presume that your situation is the former.
You could, of course, fit a model with all features. E.g. deep neural networks can be built with many hundreds of input variables. But this also very much depends on your variables. E.g. if you need to do lots of dummy coding, this would further increase your number of input variables.
Rather than grouping, a good idea would probably be to first check which variables are the important ones. There are many methods you could use here. One very popular one is to use random forests for this task. Scikit-learn has an implementation available.