Count data is numerical or categorical

categorical dataclusteringdata transformationpython

I am working on road network data (csv) and my data has 6 features. Two of these features are continues and three of them are categorical. One of the these features is number of lanes of street. This feature shows how many lanes a road has. For instance, a road has two lanes a road has 4 lanes and so on. Plus, following image shows a road which has three lanes (this is an image just to show what is my mean exactly and my data is csv file).

enter image d## Heading ##escription here

I have three questions as follows:

  1. The number of lanes is categorical or numerical? in some resources I read count data are categorical and some resources categorized them as numerical!!!! could you please let me know number of lanes is categorical or numerical?

  2. What approaches you suggest to cluster a data with various data type (mixed quantitative and qualitative)?

  3. Can I transform the continues data to categorical data and then perform clustering analysis?

Best Answer

The word numerical means 'consisting of numbers' ('expressed in or counted by numbers' in one dictionary). Counts are clearly numerical. Indeed they have a meaningful zero and '6' is literally twice as much as '3' and three times as much as '2' ... and so forth (3 bricks + 3 bricks = 6 bricks, etc,.. so 6 bricks is twice as many bricks as 3 bricks), so if you're considering Stevens' typology, arguably ratio scale to boot.

What matters more is how you see it coming into your model.

If you're reading a book that's telling you how to treat variables based only on the division 'categorical' or not, you may sometimes be led into poor choices of analysis.

You can bin variables. You can forget the bin boundaries and make them into (ordered) categories. You can even ignore the ordering. Every one of those steps result in loss of information, and in many cases the introduction of bias, so the larger question is not whether you can, but whether you should. If you step back from "must use clustering" for a moment, what are you trying to achieve with it?