Machine Learning – How to Encode Categorical Variables with Hundreds of Levels

categorical datacategorical-encodingmachine learningmany-categories

Despite the fact that this problem is probably overdone and heavily researched, I want to use machine learning to solve problems related to basketball (predicting which team will win the game, will a certain shot go in, etc.). To do that, I want to encode information about the line-ups, because ultimately the starting line-up for both teams will have a huge impact on a team's performance and ultimately their ability to win a game (eg. Cavs' chance of winning a game is significantly lower without Lebron, and much higher if Kevin Durant is not playing).

Given the player name (or some unique id to identify the player) is a categorical variable, how do I encode this to fit the appropriate format for prediction tasks? Ideally, I do not want to create a ton of binary variables for each player in the NBA (which represents whether the given player will play or not). I was thinking of tackling this problem with Python/sci-kit learn, which I usually feed in numpy arrays. Is it possible to modify my data to still use numpy/sci-kit for the prediction task?

Similarly, another variable would be the opponent team (LAL, GSW, BOS, etc). There are significantly fewer choices than the number of players of course, but still 29 which is a considerable amount. While I could create 30 binary variables, is there another option? If I match each team to a number, is that legitimate (If so, are there any downsides? If not, can you expand on why? I am somewhat reluctant because it is unclear what these numbers would actually represent.)

I have read a few articles about encoding categorical variables, and it seems like a fuzzy area. I did not see any clear/obvious answers and could always use the guidance and expertise. If you have any ideas, suggestions, or relevant videos/blogs/papers, please let me know.

Best Answer

I've seen feature hashing and embedding mentioned in comments. Apart from that you can try clustering players by IDs if you have some additional data.

Another approach which is suitable for categorical data with many level is mean encoding.

Mean encoding (also sometimes called target encoding) consists of encoding categories with means of target (for example in regression if you have classes 0 and 1 then class 0 is encoded by mean of response for examples with 0 and so on). There are some answers on this site on that which provide more detail. I also encourage you to see this video if you want to get more about how it works and how you can implement it (there are several ways that to do mean encoding and each has its pros and cons).

In Python you can do mean encoding yourself (some approaches are shown in the video from the series I linked) or you can try Category Encoders from scikit-learn contrib.