Solved – Clustering customers by their orders sequence patterns

clusteringsequence analysistime seriestraminer

I have dataset with clients orders. Example:

Customer_1 07.06.2017 Order_1 Product_1
Customer_1 15.06.2017 Order_2 Product_2
Customer_1 01.09.2017 Order_2 Product_1
Customer_2 07.05.2017 Order_3 Product_3
Customer_2 07.06.2017 Order_4 Product_2
Customer_2 25.09.2017 Order_5 Product_3
Customer_2 05.12.2017 Order_5 Product_1
....
Customer_N

How can I cluster these customers behavior? This dataset looks like time series. But It's difficult for me to find the right way for solving this problem. The history of each customer has different length. And I can't use simple clustering algorithms.

My major aim is to distinguish different customer behaviors, find persons who have started buy more frequently, who have changed their preferences in products (started buy other products), who have tried new for them products but back to previous products. How can I cluster patterns of behavior?

Best Answer

You data are timestamped event sequences. A solution to cluster your customers is to compute the pairwise dissimilarities between the sequences and then input the resulting matrix into any clustering procedure that works with such kind of input.

You can compute the pairwise dissimilarities with the optimal matching method for event sequences, OME, (see Ritschard et al., 2013) that is implemented in the TraMineRextras R package, a companion of the TraMineR package.

I illustrate below how you get the dissimilarity matrix for your two example sequences. We first need to create a TraMineR event sequence object. We need for that numeric ids and dates as integers. So we first make these transformations. Also, I use Product as the event and ignore Order (which I do not understand what it is).

library(TraMineRextras)

d <- c(
"Customer_1", "07.06.2017", "Order_1", "Product_1",
"Customer_1", "15.06.2017", "Order_2", "Product_2",
"Customer_1", "01.09.2017", "Order_2", "Product_1",
"Customer_2", "07.05.2017", "Order_3", "Product_3",
"Customer_2", "07.06.2017", "Order_4", "Product_2",
"Customer_2", "25.09.2017", "Order_5", "Product_3",
"Customer_2", "05.12.2017", "Order_5", "Product_1"
)
md <- matrix(d, nrow = 7, ncol=4, byrow=TRUE)
md <- as.data.frame(md)
md[,1] <- as.integer(gsub("Customer_", md[,1], replacement=""))
md[,2] <- as.integer(as.Date(md[,2], format ="%d.%m.%Y"))
names(md) <- c("Id","Timestamp","Order","Product")
md
##   Id Timestamp   Order   Product
## 1  1     17324 Order_1 Product_1
## 2  1     17332 Order_2 Product_2
## 3  1     17410 Order_2 Product_1
## 4  2     17293 Order_3 Product_3
## 5  2     17324 Order_4 Product_2
## 6  2     17434 Order_5 Product_3
## 7  2     17505 Order_5 Product_1

## Creating the event sequence object
eseq <- seqecreate(id=md$Id, timestamp=md$Timestamp, event=md$Product)
## event sequences with number indicating time intervals in days
eseq
## [1] 17324-(Product_1)-8-(Product_2)-78-(Product_1)                  
## [2] 17293-(Product_3)-31-(Product_2)-110-(Product_3)-71-(Product_1)

Now computing the dissimilarities between sequences with OME

## you may have to play with the parameters idcost and vparam
idcost <- rep(1,3)
diss <- seqedist(eseq, idcost = idcost, vparam = .01)
diss
##           [,1]      [,2]
## [1,] 0.0000000 0.7307344
## [2,] 0.7307344 0.0000000

You can then cluster your sequences by inputting the diss matrix to a hierarchical clustering method (e.g. the hclust function) or to a partitioning around medoids method (see e.g. WeightedCluster package that is specifically designed for sequences). Note that you may have to input diss as distance matrix object as.dist(diss).

Related Question