Solved – How should I classify stores based on the demographics of their customers

categorical dataclassificationclustering

I've got a dataset of demographic details of store customers and which store they (most frequently) visit. I would like to categorize the stores based on their customers.

To clarify: The issue here is to create clusters of shops, on the basis of the characteristics of the customers who have attended them. In other words, the aim is to create clusters of shops having a similar clientele.

I have around 7,000 customer records, distributed (unevenly) across about 50 stores. Most of the customer data is categorical, but there are a couple of continuous variables. How should I go about categorizing the different stores?

Best Answer

You have to aggregate the data at the levels of the 50 stores. Then you can apply your cluster algorithm on these aggregated data.

Regarding the categorical variables, I would not use the modes. I would recode all the categorical variables into binary 0/1 variables, and compute the means. If you have a variable equal to 1 if a customer is a men and equal to 0 otherwise, the mean gives you the proportion of men who have visited a particular shop. You have to set up your data as follows. If a categorical variable has two categories, it has to be recoded into a 0/1 variable. If a categorical variable has more than two categories, you have to create a binary (0/1) variable for each category of the original variable.

Related Question