Solved – How to distinguish the continuous and categorical variable based on the number of unique values

categorical datacategorical-encodingcontinuous datahigh-dimensional

I have a very large dataset contained 600,000 rows and 400 columns. Most of the variables are anonymous and named as Vi, i=1,2,…,400, and all of them are either int or float. When I am dealing with the dataset, I have to point out which variable is categorical and which is continuous. Cause I do not know the meaning of the features, I thought the only clue I have is the number of unique values. So I ran the codes and get such outcomes:

for f in X_train.columns:
  print(X_train[f].nunique())

And the outcomes I got is:

20902
36807
57145
18683
24821
7311
8727
41838
39847
23028
22774
10437
10530
217850
90375
3126
7
1787
3
3
3
3
3
3
3
3
4
39974
51727
54282
157077
101
9288
77
101
13
14
14
5854
10938
2338
5775
2426
4650
2366
1509
5971
4729
2207
6813
6674
80299
172652
82646
176011
3444
2125
70656
3
25
43
3
1231
1476
205
3
3
4
1253
1103
3
3
1328
1260
319
1657
5
13553
1108
1597
1216
1199
5
3
60
61
3
219
881
89
12332
4444
641
5529
11
62
33
11377
114
76
119
500
74
332
4
8
8
7
52
6
8
32
5
4
9
49
55
649
9
7
5
10
10
10
641
688
2651
77
2340
19
24
50
77
26
365
115655
49
2552
31
62
8
9
15
20
17
2836
3451
215
2240
2282
8
32
2747
49
104
522
394
93
101
81
9
17
79

As you can see some of them are obvious, they are lower than 10, and I will definitely think they are categorical variables. And some of them have higher than 100,000 unique values, which is obviously continuous.

The hard part is how to decide the threshold. Those variables have #unique values between 50~500, they seem to be continuous but considered my dataset is very large, even if one variable has 500 categories, I will also it is reasonable to treat it as a categorical variable.

Does anyone have any good suggestions?
I will thank you in advance!

Best Answer

Without some underlying knowledge of the data source or the meaning of each feature or at least the distribution from which the data comes, it isn't possible to deterministically find the categorical variables (and I'm assuming you do indeed mean categorical and not discrete as in comment above). Thus, you'll have to find some empirical metric by which to perform the separation of categorical and continuous variables. some of these techniques could include:

  1. Number of unique values, as you have already done
  2. Average number of occurrences of each unique value

I think an important question to ask yourself given the total lack of feature labels is "does it matter that some data is categorical and some data is not?" Lots of machine learning techniques and data analysis techniques can accommodate both types of variables (such as decision trees). One potential issue could be if integer values have been used to encode membership in some non-numerical classes (e.g. dog = 1, cat = 2, frog = 3, ...). If you're worried about this, I'd say you should treat every variable below some threshold of unique occurrences as categorical (as there's not a disastrous negative effect to treating a continuous feature as categorical other than model generalization error). If you're not worried about this and think that even the categorical features represent real-numbered values, treat everything as a continuous variable.

In conclusion, this is a pretty canonical example of "garbage in garbage out" in that most data analyses and machine learning problems require intelligent feature selection and domain-conscious treatment of the inputs to produce good results. If at all possible, return to your data source and try to get more information.

Related Question