Solved – Correlation between two variables of unequal size

correlationfinancemissing datatime series

In a problem I am working on, I have two random variables, X and Y. I need to figure out how closely correlated the two of them are, but they are of different dimensions. The rank of the row space of X is 4350, and the rank of the row space of Y is substantially larger, in the tens of thousands. Both X and Y have the same number of columns.

I need a measure of correlation between the two variables, and Pearson's r requires X and Y to have equal dimension (at least R requires the two r.v.'s to be).

Do I have any hope of doing a correlation between these two, or should I find some way of pruning off observations from Y?

 EDIT

Adding information from the comments, which should be in the question.

I suppose I forgot to mention this. X and Y are stock prices. Company X has been public for a much shorter time period than Y. I wanted to tell how correlated the prices of X and Y are. I could definitely get a correlation for the period of time that X and Y both exist. I wanted to know if knowing the stock prices for several extra years of Y that X did not exist yielded me any additional information.

Best Answer

No amount of imputation, time series analysis, GARCH models, interpolation, extrapolation, or other fancy algorithms will do anything to create information where it does not exist (although they can create that illusion ;-). The history of Y's price before X went public is useless for assessing their subsequent correlation.

Sometimes (often preparatory to an IPO) analysts use internal accounting information (or records of private stock transactions) to retrospectively reconstruct hypothetical prices for X's stock before it went public. Conceivably such information could be used to enhance estimates of correlation, but given the extremely tentative nature of such backcasts, I doubt the effort would be of any help except initially when there are only a few days or weeks of prices for X available.

Related Solutions

Solved – Correlation between two Decks of cards

You can measure the relative level of correlation (or more precisely, the increasing level of randomness) using the Shannon entropy of the difference in face value between all pairs of adjacent cards.

Here's how to compute it, for a randomly shuffled deck of 52 cards. You start by looping once through the entire deck, and building a sort of histogram. For each card position $i=1,2,...,52$, calculate the difference in face value $\Delta F_{i} = F_{i+1} - F_{i}$. To make this more concrete, let's say that the card in the $(i+1)$th position is the king of spades, and the card in the $i$th position is the four of clubs. Then we have $F_{i+1} = 51$ and $F_{i} = 3$ and $\Delta F_{i} = 51-3 = 48$. When you get to $i=52$, it's a special case; you loop around back to the beginning of the deck again and take $\Delta F_{52} = F_{1} - F_{52}$. If you end up with negative numbers for any of the $\Delta F$'s, add 52 to bring the face value difference back into the range 1-52.

You will end up with a set of face value differences for 52 pairs of adjacent cards, each one falling into an allowed range from 1-52; count the relative frequency of these using a histogram (i.e., a one-dimensional array) with 52 elements. The histogram records a sort of "observed probability distribution" for the deck; you can normalize this distribution by dividing the counts in each bin by 52. You will thus end up with a series of variables $p_{1}, p_{2}, ... p_{52}$ where each one may take on a discrete range of possible values: {0, 1/52, 2/52, 3/52, etc.} depending upon how many pairwise face value differences ended up randomly in a particular bin of the histogram.

Once you have the histogram, you can calculate the Shannon entropy for a particular shuffle iteration as $$E = \sum_{k=1}^{52} -p_{k} ln(p_{k})$$ I have written a small simulation in R to demonstrate the result. The first plot shows how the entropy evolves over the course of 20 shuffle iterations. A value of 0 is associated with a perfectly ordered deck; larger values signify a deck which is progressively more disordered or decorrelated. The second plot shows a series of 20 facets, each containing a plot similar to the one that was originally included with the question, showing shuffled card order vs. initial card order. The 20 facets in the 2nd plot are the same as the 20 iterations in the first plot, and they are also color coded the same as well, so that you can get a visual feel for what level of Shannon entropy corresponds to how much randomness in the sort order. The simulation code that generated the plots is appended at the end.

Shannon information entropy vs. shuffle iteration

Shuffle order vs. start order for 20 iterations of shuffling, showing cards becoming progressively less correlated and more randomly distributed over time.

library(ggplot2)

# Number of cards
ncard <- 52 
# Number of shuffles to plot
nshuffle <- 20
# Parameter between 0 and 1 to control randomness of the shuffle
# Setting this closer to 1 makes the initial correlations fade away
# more slowly, setting it closer to 0 makes them fade away faster
mixprob <- 0.985 
# Make data frame to keep track of progress
shuffleorder <- NULL
startorder <- NULL
iteration <- NULL
shuffletracker <- data.frame(shuffleorder, startorder, iteration)

# Initialize cards in sequential order
startorder <- seq(1,ncard)
shuffleorder <- startorder

entropy <- rep(0, nshuffle)
# Loop over each new shuffle
for (ii in 1:nshuffle) {
    # Append previous results to data frame
    iteration <- rep(ii, ncard)
    shuffletracker <- rbind(shuffletracker, data.frame(shuffleorder,
                            startorder, iteration))
    # Calculate pairwise value difference histogram
    freq <- rep(0, ncard)
    for (ij in 1:ncard) {
        if (ij == 1) {
            idx <- shuffleorder[1] - shuffleorder[ncard]
        } else {
            idx <- shuffleorder[ij] - shuffleorder[ij-1]
        }
        # Impose periodic boundary condition
        if (idx < 1) {
            idx <- idx + ncard
        }
        freq[idx] <- freq[idx] + 1
    }
    # Sum over frequency histogram to compute entropy
    for (ij in 1:ncard) {
        if (freq[ij] == 0) {
            x <- 0
        } else {
            p <- freq[ij] / ncard
            x <- -p * log(p, base=exp(1))
        }
        entropy[ii] <- entropy[ii] + x
    }
    # Shuffle the cards to prepare for the next iteration
    lefthand <- shuffleorder[floor((ncard/2)+1):ncard]
    righthand <- shuffleorder[1:floor(ncard/2)]
    ij <- 0
    ik <- 0
    while ((ij+ik) < ncard) {
        if ((runif(1) < mixprob) & (ij < length(lefthand))) {
            ij <- ij + 1
            shuffleorder[ij+ik] <- lefthand[ij]
        }
        if ((runif(1) < mixprob) & (ik < length(righthand))) {
            ik <- ik + 1
            shuffleorder[ij+ik] <- righthand[ik]
        }
    }
}
# Plot entropy vs. shuffle iteration
iteration <- seq(1, nshuffle)
output <- data.frame(iteration, entropy)
print(qplot(iteration, entropy, data=output, xlab="Shuffle Iteration", 
            ylab="Information Entropy", geom=c("point", "line"),
            color=iteration) + scale_color_gradient(low="#ffb000",
            high="red"))

# Plot gradually de-correlating sort order
dev.new()
print(qplot(startorder, shuffleorder, data=shuffletracker, color=iteration,
            xlab="Start Order", ylab="Shuffle Order") + facet_wrap(~ iteration,
            ncol=4) + scale_color_gradient(low="#ffb000", high="red"))

Spurious Correlation – Does It Really Matter in Regression and Time Series Analysis?

Here's a simulated example of two prices that are very highly correlated ($\rho = 0.9875$). When you attempt to predict the price change in one using the lagged value of the other, very little of the variation in the price change is explainable:

. clear

. set seed 12092021

. set obs 102
Number of observations (_N) was 0, now 102.

. gen t = _n

. tsset t

Time variable: t, 1 to 102
        Delta: 1 unit

. gen p1 = 1 + 3*t + rnormal(0,5) 

. gen p2 = 3 + 2*t + rnormal(0,10)

. corr p1 p2
(obs=102)

             |       p1       p2
-------------+------------------
          p1 |   1.0000
          p2 |   0.9875   1.0000


. reg FD.p2 p1

      Source |       SS           df       MS      Number of obs   =       101
-------------+----------------------------------   F(1, 99)        =      0.01
       Model |  .727541841         1  .727541841   Prob > F        =    0.9436
    Residual |  14322.4337        99  144.671048   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0100
       Total |  14323.1613       100  143.231613   Root MSE        =    12.028

------------------------------------------------------------------------------
       FD.p2 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          p1 |   .0009672   .0136392     0.07   0.944    -.0260959    .0280303
       _cons |   1.665843   2.420693     0.69   0.493    -3.137338    6.469024
------------------------------------------------------------------------------

. reg FD.p1 p2

      Source |       SS           df       MS      Number of obs   =       101
-------------+----------------------------------   F(1, 99)        =      0.01
       Model |  .683934381         1  .683934381   Prob > F        =    0.9171
    Residual |  6210.52068        99  62.7325321   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0100
       Total |  6211.20461       100  62.1120461   Root MSE        =    7.9204

------------------------------------------------------------------------------
       FD.p1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          p2 |  -.0013704   .0131245    -0.10   0.917    -.0274123    .0246715
       _cons |   3.260085   1.574913     2.07   0.041     .1351165    6.385054
------------------------------------------------------------------------------

Here FD is the first difference of subsequent value (so $FD.p_t = (p_{t+1}-p_t)$).

The $R^2$ (aka R-squared) of both models is around zero, so very little of the variation in price changes tomorrow can be explained by the price today. This illustrates the intuition that knowing what you know today, you cannot act on this correlation to make money tomorrow.

You can play around with variations on this approach (using the lagged price change as a predictor, non-linear models, adding more data, more noise, or adding trends), with identical results.

You might object that my toy example is flawed because the high correlation is contemporaneous, so if you knew p1 today, you could predict p2 today. I think that is wrong for the following reason. Suppose the DGP is as above, but unknown to you. You are an executive at company 1, and you learn that your CEO had been falsifying earnings and pinching bottoms. The news will become public shortly and lower p1. You can’t short your own stock without a vacation at Club Fed. Should you short the stock of company 2 if you know the correlation between p1 and p2 is ~1? I think that would be a terrible idea. This is what makes the correlation spurious and why that matters.

You could also have a causal relationship, but no correlation. When a house has air-conditioning with a preset desired temperature, there will be a strong positive non-spurious correlation between the amount of electricity used by the AC and the temperature outside. But there will be no correlation between the amount of electricity consumed and the inside temperature. The outside temperature and the inside temperature will also be uncorrelated. The last two are spurious non-correlations in my mind. But all three correlation are valid (though that has no formal definition in statistics) since a correlation is just a transformation of the data.

This is all to say that a strong correlation is not necessary for a causal dependence to exist. And it is certainly not sufficient. Even the sign on the causal relationship could be different from the sign of the correlation. This matters for using correlations to do things out in the real world (i.e., interventions). This is not just an issue with time series data, but can happen with observational data.

Best Answer

Related Solutions

Solved – Correlation between two Decks of cards

Spurious Correlation – Does It Really Matter in Regression and Time Series Analysis?

Related Question