Solved – Using categorical features for regression problem in deep learning

categorical datadeep learningneural networksregression

I have a dataset which has 131 features. my goal is to estimate a value based on these features using deep learning(regression problem). However, 5 of my features only has 0 or 1 values, i want to know how can i implement these categorical variables in my model for regression problem?

should i remove them ? or what?

I really appreciate any help

Best Answer

In general, binary data types are used to represent membership in particular categories. Suppose you have some data in a row-column format, so that the columns are features and rows are observations. Perhaps you're interested in the relationship between height and gender. So your data look like

Height    Gender
6 ft      Male
5 ft 4 in Female

And so on. Clearly, there's not a great numerical representation of "male" and "female." But another way to look at the problem is as a question of membership, to answer the question "is this person male?" or, equivalently, "is this person female?" In this way, we can take the categories "male" and "female" and translate them into binary, numerical quantities (conventionally, $1$ and $0$). Whichever we choose to treat as 1 or 0 is irrelevant from a mathematical standpoint. Importantly, it's standard practice to expand a variable with $k$ categories into $k-1$ binary columns. This is because you don't want to make your columns linearly dependent with an intercept column of all $1$s, and make it impossible to uniquely estimate these quantities. I don't know what the particular details of your so-called "deep learning" regression are, but it probably has something like an intercept.

You've asked @Sheep

Do you know the underlying process for these kind of variables?

But this is a fundamentally unanswerable question without actually doing the analysis. Run the regression and find out!

Does it make sense to estimate a variable( my target variable) based on categorical variables?

Maybe. If the two quantities are completely unrelated -- for example, political party in power in a given territory and number of hurricanes over the Atlantic ocean -- then it probably wouldn't make sense. Instead of looking at your research as a purely rote, quantitative exercise, I would encourage you to think critically about what your data represent, and what the underlying causal mechanism is. Is there a biological process at work? Are people acting according to their self-interest? What do we know about the climate that casues hurricanes?

Related Solutions

Solved – Summarization of text documents (legal domain) using deep learning techniques

By themselves deep networks are not summarization algorithms, but they can help find key sentences that between them do summarize. The network is a model of text that you are likely to see in the corpus. The sentences you want in a summary should be typical of the corpus but different from each other so that between them a small set of sentences "covers" most of the corpus. So you want sentences that have a high probability according to the network individually, but also have low pointwise mutual information. PMI can be estimated by the network as well. I do not know how well this approach would compare with state of the art summarization.

Solved – Deep Learning for Regression

For standard regression I recommend using Multi-Layer Perceptron (MLP) with Mean Square Error (MSE) loss function. The model can be defined in Keras as follows:

model = Sequential()
model.add(Dense(20, input_dim=13, init='normal', activation='relu'))
model.add(Dense(10, init='normal', activation='relu'))
model.add(Dense(1, init='normal'))

model.compile(loss='mean_squared_error', optimizer='adam')

See this article for a complete example. For a TensorFlow implementation, consider the following code.

Best Answer

Related Solutions

Solved – Summarization of text documents (legal domain) using deep learning techniques

Solved – Deep Learning for Regression

Related Question