Solved – Using categorical features for regression problem in deep learning

categorical datadeep learningneural networksregression

I have a dataset which has 131 features. my goal is to estimate a value based on these features using deep learning(regression problem). However, 5 of my features only has 0 or 1 values, i want to know how can i implement these categorical variables in my model for regression problem?

should i remove them ? or what?

I really appreciate any help

Best Answer

In general, binary data types are used to represent membership in particular categories. Suppose you have some data in a row-column format, so that the columns are features and rows are observations. Perhaps you're interested in the relationship between height and gender. So your data look like

Height    Gender
6 ft      Male
5 ft 4 in Female

And so on. Clearly, there's not a great numerical representation of "male" and "female." But another way to look at the problem is as a question of membership, to answer the question "is this person male?" or, equivalently, "is this person female?" In this way, we can take the categories "male" and "female" and translate them into binary, numerical quantities (conventionally, $1$ and $0$). Whichever we choose to treat as 1 or 0 is irrelevant from a mathematical standpoint. Importantly, it's standard practice to expand a variable with $k$ categories into $k-1$ binary columns. This is because you don't want to make your columns linearly dependent with an intercept column of all $1$s, and make it impossible to uniquely estimate these quantities. I don't know what the particular details of your so-called "deep learning" regression are, but it probably has something like an intercept.

You've asked @Sheep

Do you know the underlying process for these kind of variables?

But this is a fundamentally unanswerable question without actually doing the analysis. Run the regression and find out!

Does it make sense to estimate a variable( my target variable) based on categorical variables?

Maybe. If the two quantities are completely unrelated -- for example, political party in power in a given territory and number of hurricanes over the Atlantic ocean -- then it probably wouldn't make sense. Instead of looking at your research as a purely rote, quantitative exercise, I would encourage you to think critically about what your data represent, and what the underlying causal mechanism is. Is there a biological process at work? Are people acting according to their self-interest? What do we know about the climate that casues hurricanes?

Related Question