Solved – statistical data types: Difference between (categorical) ordinal and (numerical) discrete data

categorical datadata transformationmachine learning

I just found this explanation for the different statistical data types, but I'm still not sure what the difference between ordinal categorical and numerical discrete variables is.

The example for ordinal data given in this link was the rating of a restaurant in stars. But they define numerical discrete data as follows:

Discrete data represent items that can be counted; they take on possible values that can be listed out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on to infinity (making it countably infinite). For example, the number of heads in 100 coin flips takes on values from 0 through 100 (finite case), but the number of flips needed to get 100 heads takes on values from 100 (the fastest scenario) on up to infinity (if you never get to that 100th heads). Its possible values are listed as 100, 101, 102, 103, . . . (representing the countably infinite case).

In my opinion the star rating of a restaurant perfectly fits into this definition. Its a finite number of stars and i can count them. Why is this ordinal data? Is this example just wrong? If yes: could anybody provide me another example for ordinal data?

I want to understand this because i want to convert this data to numerical features. I can convert nominal categorical data to onehot vectors, but I'm not sure what's the right way to convert categorical ordinal features.

Best Answer

Indeed, nothing shows the deficiencies of a classification system better than applying it to the real world.

The difference between ordinal and discrete numerical data is not always as clear cut as we'd like, and this is exactly what you're running into here.

The difference between the star rating of the restaurants and the coin flips they use as an example for discrete data is the distance between the numbers. For example: while everyone would agree that a two-star restaurant is better than a one-star restaurant, giving a restaurant two stars does not imply that you like it twice as much as the one-star restaurant. Similarly people do not necessarily agree that the "quality difference" between a four and a five star restaurant is the same as that between say a one and a two star restaurant, if only because the five star restaurant absorbs all restaurants at the top.

In contrast two coin flips is exactly twice as many as one coin flip, and the difference between four and five coin flips is the exact same difference as between one and two coin flips.

A consequence of the difference between these data types is that calculating descriptive statistics like the mean are a doubtful practice for ordinal data (although it's often done) while it's perfectly reasonable for discrete numeric data.

Related Question