Solved – How to distinguish the continuous and categorical variable based on the number of unique values

I have a very large dataset contained 600,000 rows and 400 columns. Most of the variables are anonymous and named as Vi, i=1,2,…,400, and all of them are either int or float. When I am dealing with the dataset, I have to point out which variable is categorical and which is continuous. Cause I do not know the meaning of the features, I thought the only clue I have is the number of unique values. So I ran the codes and get such outcomes:

for f in X_train.columns:
  print(X_train[f].nunique())

And the outcomes I got is:

As you can see some of them are obvious, they are lower than 10, and I will definitely think they are categorical variables. And some of them have higher than 100,000 unique values, which is obviously continuous.

The hard part is how to decide the threshold. Those variables have #unique values between 50~500, they seem to be continuous but considered my dataset is very large, even if one variable has 500 categories, I will also it is reasonable to treat it as a categorical variable.

Does anyone have any good suggestions?
I will thank you in advance!

Best Answer

Without some underlying knowledge of the data source or the meaning of each feature or at least the distribution from which the data comes, it isn't possible to deterministically find the categorical variables (and I'm assuming you do indeed mean categorical and not discrete as in comment above). Thus, you'll have to find some empirical metric by which to perform the separation of categorical and continuous variables. some of these techniques could include:

Number of unique values, as you have already done
Average number of occurrences of each unique value

I think an important question to ask yourself given the total lack of feature labels is "does it matter that some data is categorical and some data is not?" Lots of machine learning techniques and data analysis techniques can accommodate both types of variables (such as decision trees). One potential issue could be if integer values have been used to encode membership in some non-numerical classes (e.g. dog = 1, cat = 2, frog = 3, ...). If you're worried about this, I'd say you should treat every variable below some threshold of unique occurrences as categorical (as there's not a disastrous negative effect to treating a continuous feature as categorical other than model generalization error). If you're not worried about this and think that even the categorical features represent real-numbered values, treat everything as a continuous variable.

In conclusion, this is a pretty canonical example of "garbage in garbage out" in that most data analyses and machine learning problems require intelligent feature selection and domain-conscious treatment of the inputs to produce good results. If at all possible, return to your data source and try to get more information.

Best Answer

Related Solutions

Data Analysis – Continuous and Categorical Variable Analysis

Solved – Is a count variable with a large, but finite, number of possible values categorical or continuous

Related Question