The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).
TraMineR
does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).
The distance to the center is returned by the disscenter
function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.
## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
max.sil <- max(sil[fclust==levclust[i]])
imax.sil[i] <-
which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust,
weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr
Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.
You get a plot of the medoid of each cluster with seqrplot
seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
criteria = "centrality", nrep=1)
Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows
icenter <- disscenter(mvad.dist, group = clust4,
medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])
You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1]
in seqdist
, for the distance to the medoid of the first cluster.
Using the seqrep.grp
function from TraMineRextras
package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)
library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist,
criteria = "centrality", nrep=1, ret = "both")
Hope this helps.
I will state what I think you are asking. If I have misunderstood your question, please comment and I will delete this answer.
I think that you are saying that you have some text data. Cosine is usually used to measure similarity of documents, but the similarity matrix can be converted to a distance/dissimilarity measure and it sounds like you have done that. You used this to perform clustering and want to visualize the results to see if the clustering makes sense and possibly gain some insight from the clusters. But you have only very high dimensional text (which is hard to plot) and a distance matrix. How can you get a useful visualization?
One way that is used to get a plot that shows clusters is to use principal components analysis on your data, then project the data onto the first two principal components. The two dimensional data can be plotted. The x-y coordinates are in terms of the principal components which are linear combinations of the original dimensions. This can be hard to interpret.
There are several other good methods to go from a distance matrix to a low-dimensional representation of your data suitable for graphing. The methods try to create a representation (probably 2-dimensional for graphing) that preserves the distance relations stored in the distance matrix. Of course, it is not generally possible to do this exactly, but still these methods can produce useful visualizations.
I will point you to two such methods:
Multi-dimensional Scaling
and
t-distributed Stochastic Neighbor Embedding
(tSNE)
Both can produce useful results from a distance matrix. Both have easy-to-use implementations in R and presumably other languages.
Both MDS and tSNE use optimization methods to construct a two-dimensional representation of the data and so are not even as simple as the linear combinations of dimensions that you get from PCA. Because of this, the two dimensions that are produced cannot generally be interpreted in terms of the original dimensions. They preserve the distance between points, but not the meaning of the dimensions.
I believe that the picture that you copied from the Code Project k-means page was merely meant to be illustrative of what happens when the original data has two dimensions, where the process is easier to understand. In that picture, the x and y are the x and y of the original data. A different example from the Code Project is closer to your use. It clusters words using cosine similarity and then creates a two-dimensional plot. The axes there are simply labeled x[,1] and x[,2]. The two coordinates were created by tSNE. Thus, you cannot really interpret the coordinates themselves. But there is reason to think that the relationships between the words are preserved as much as possible in reducing this to two dimensions.
Best Answer
If you just want to say that there is a statistically significant dependence between the occurence of the event and the sign of the change in the stock market, you can model it the following way: create two binary variables for each day of your observation period:
$X_t=1$ if the stock market rose on day $t$, 0 otherwise
$Y_t=1$ if the event occurred on day $t$, 0 otherwise
You can then check the independence of variables $X$ and $Y$ using a chi-squared test.
If you want to get more fancy, you can also fit a linear regression model, regress the magnitude of stock market changes on $Y$, and use the t-test p-value of $Y$'s coefficient. If you want to take account for the number of occurrences on each day (other than just 0-1), you can do Poisson regression of the occurrences, using the S&P change as the explanatory variable.
NB: neither the chi-squared test nor the regressions take into account the time series aspect of the data, they work as if all days in the dataset were randomly drawn days in the whole period. Therefor they don't take into account the fact that the occurence of the event might also depend on its having occurred in the previous days.