Solved – How to deal with mixed data type in deep neural network

data preprocessingmachine learningmixed type dataneural networks

My dataset has 300 numeric features, each of them ranges from 1 to 500. In addition, I have 1000 categorical features (0 or 1), around 90% are 0's (kind of sparse).

To run deep neural network, I typically standardize my dataset for all numeric values. My questions is:

With these mixed data type, how should I do standardization on these data? Should I just standardize the numeric variables and leave those categorical variables as it is? or take everything and do standardization?
If I just have those categorical variables (0 and 1), should I just run the model as it is or normalize them? What if those data are very sparse?
What if those categorical variables are counts data (from 0 to 10), does it make sense to standardize them?

Best Answer

With these mixed data type, how should I do standardization on these data? Should I just standardize the numeric variables and leave those categorical variables as it is? or take everything and do standardization?

According to my knowledge, you should either normalize or standardize the whole feature vector. If you keep the numeric values standardized and the categorical variables as they are , then they could cause a large variance in the vector. Another option is to standardize the numeric values and normalize the categorical values.

If I just have those categorical variables (0 and 1), should I just run the model as it is or normalize them? What if those data are very sparse?

AND

What if those categorical variables are counts data (from 0 to 10), does it make sense to standardize them?

If they are in a range other than 0-1 then it is better to normalize them in the range 0-1.

Note : These observations might prove to be incorrect. The final decision depends on trials and errors. The best thing is to experiment with the data.

Best Answer

Related Solutions

Solved – Unsupervised Dimensional reduction for mixed data types

Solved – Why are mixed data a problem for euclidean-based clustering algorithms

Related Question