Solved – How to test for correlation between frequency of an event and the stock market

clusteringfrequency

I am currently running an event study, for which I need to find out if my events are clustered and/or if their frequency is tied to the stock market. Creating a scatter plot of the event dates and the stock market index gives me a rough idea that they are correlated (points on the scatter plot are denser during market upswings, and become less so in downturns), however I do not know how to prove this at a statistically significant level.

During my search I have come across a Seemingly Unrelated Regression, but I am not sure if this is the right tool to use and, if it is, how to use it. I have attached the scatter plot, the x-axis is the date of the event and the y-axis is the S&P 500 index at the time of the event.

If you know how I can quantify this, please let me know, as I would like to use something more solid than "look at the graph".

Much appreciated

LE: to get a better idea of how I have my data structured, I have only two variables: date which is the date of the event and sp500 which is the level of the S&P 500 for every date. From these two, I am trying to see whether date is clustered (happen at the same time or closely related to each other) when the sp500 is high, and viceversa.

scatter plot

Best Answer

If you just want to say that there is a statistically significant dependence between the occurence of the event and the sign of the change in the stock market, you can model it the following way: create two binary variables for each day of your observation period:

$X_t=1$ if the stock market rose on day $t$, 0 otherwise
$Y_t=1$ if the event occurred on day $t$, 0 otherwise

You can then check the independence of variables $X$ and $Y$ using a chi-squared test.

If you want to get more fancy, you can also fit a linear regression model, regress the magnitude of stock market changes on $Y$, and use the t-test p-value of $Y$'s coefficient. If you want to take account for the number of occurrences on each day (other than just 0-1), you can do Poisson regression of the occurrences, using the S&P change as the explanatory variable.

NB: neither the chi-squared test nor the regressions take into account the time series aspect of the data, they work as if all days in the dataset were randomly drawn days in the whole period. Therefor they don't take into account the fact that the occurence of the event might also depend on its having occurred in the previous days.

Related Solutions

Solved – How to estimate the centroid of clustered sequences

The cluster centroid, i.e., the theoretical true center sequence which minimizes the sum of distances to all sequences in the cluster, is generally something virtual which would be defined as a mix of states at each position (similarly as the average between integer values can take non integer values).

TraMineR does not compute such virtual centers. However, it can compute the distance to the virtual center (for the used formula, see Studer, Ritschard, Gabadinho and Muller, 2011, Discrepancy analysis of state sequences, Sociological Methods and Research, Vol. 40(3), pp. 471-510).

The distance to the center is returned by the disscenter function. To get the distance to the center from the sequence with highest silhouette in each cluster, we first retrieve the indexes of those sequences.

## Looking for the index of the first sequence with max
## silhouette in each cluster
fclust <- factor(clust4)
levclust <- levels(factor(clust4))
imax.sil <- rep(NA,length(levclust))
for (i in 1:length(levclust)){
  max.sil <- max(sil[fclust==levclust[i]])
  imax.sil[i] <- 
    which(sil == max.sil & fclust == levclust[i])[1]
}
## computing distance to center
d.to.ctr <- disscenter(mvad.dist, group=fclust, 
           weights = mvad$weight)[imax.sil]
names(d.to.ctr) <- fclust[imax.sil]
d.to.ctr

Now, you may also consider comparing the sequence with maximum silhouette value to the medoid, i.e., the the sequence in the data with the smallest sum of distances to the other sequences in the cluster.

You get a plot of the medoid of each cluster with seqrplot

seqrplot(mvad.seq, group = fclust, dist.matrix = mvad.dist,
         criteria = "centrality", nrep=1)

Alternatively, you can retrieve the index number of the medoids, and then print or plot the medoids as follows

icenter <- disscenter(mvad.dist, group = clust4, 
            medoids.index="first", weights = mvad$weight)
print(mvad.seq[icenter,], format="SPS")
seqiplot(mvad.seq[icenter,])

You could indeed also compute the distances to the medoids by setting for instance refseq = icenter[1] in seqdist, for the distance to the medoid of the first cluster.

Using the seqrep.grp function from TraMineRextras package, you get representativeness quality measures of the medoids (see Gabadinho, Ritschard, Studer, Muller, 2011, "Extracting and Rendering Representative Sequences", In Fred, A., Dietz, J.L.G., Liu, K. & Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. Series: Communications in Computer and Information Science (CCIS). Volume 128, pp. 94-106. Springer-Verlag)

library(TraMineRextras)
seqrep.grp(mvad.seq, group = fclust, mdis = mvad.dist, 
           criteria = "centrality", nrep=1, ret = "both")

Hope this helps.

Solved – What are the X and Y axes of Clustering Plots

I will state what I think you are asking. If I have misunderstood your question, please comment and I will delete this answer.

I think that you are saying that you have some text data. Cosine is usually used to measure similarity of documents, but the similarity matrix can be converted to a distance/dissimilarity measure and it sounds like you have done that. You used this to perform clustering and want to visualize the results to see if the clustering makes sense and possibly gain some insight from the clusters. But you have only very high dimensional text (which is hard to plot) and a distance matrix. How can you get a useful visualization?

One way that is used to get a plot that shows clusters is to use principal components analysis on your data, then project the data onto the first two principal components. The two dimensional data can be plotted. The x-y coordinates are in terms of the principal components which are linear combinations of the original dimensions. This can be hard to interpret.

There are several other good methods to go from a distance matrix to a low-dimensional representation of your data suitable for graphing. The methods try to create a representation (probably 2-dimensional for graphing) that preserves the distance relations stored in the distance matrix. Of course, it is not generally possible to do this exactly, but still these methods can produce useful visualizations.

I will point you to two such methods: Multi-dimensional Scaling and t-distributed Stochastic Neighbor Embedding (tSNE) Both can produce useful results from a distance matrix. Both have easy-to-use implementations in R and presumably other languages.

Both MDS and tSNE use optimization methods to construct a two-dimensional representation of the data and so are not even as simple as the linear combinations of dimensions that you get from PCA. Because of this, the two dimensions that are produced cannot generally be interpreted in terms of the original dimensions. They preserve the distance between points, but not the meaning of the dimensions.

I believe that the picture that you copied from the Code Project k-means page was merely meant to be illustrative of what happens when the original data has two dimensions, where the process is easier to understand. In that picture, the x and y are the x and y of the original data. A different example from the Code Project is closer to your use. It clusters words using cosine similarity and then creates a two-dimensional plot. The axes there are simply labeled x[,1] and x[,2]. The two coordinates were created by tSNE. Thus, you cannot really interpret the coordinates themselves. But there is reason to think that the relationships between the words are preserved as much as possible in reducing this to two dimensions.

Best Answer

Related Solutions

Solved – How to estimate the centroid of clustered sequences

Solved – What are the X and Y axes of Clustering Plots

Related Question