Weighted Data – Is There Such a Thing as a Weighted Correlation?

correlationpoolingpredictionweighted-data

I have some interesting data on the most popular musical artists streamed divided by location into about 200 congressional districts. I want to see if it's possible to poll a person on his or her musical preferences and determine whether he or she "listens like a Democrat" or "listens like a Republican." (Naturally this is light hearted, but there's real entropy in the data!)

I have data on about 100 artists, plus the average percentage votes for Republicans and Democrats in each district over the past three election cycles. So I ran a correlation on each artist to see which ones were most disproportionately listened-to as a function of vote share for Democrats. Those correlations run from about -0.3 to 0.3 for any given artist, with plenty in the middle that have little or no predictive power.

I have two questions: First, the overall number of streams per district varies widely. Right now, I'm correlating percentage of all streams per district belonging to, say, Beyonce, against the percentage of votes cast for Democrats. But total streams in one district might be in the millions, while another is in the low 100,000s. Do I need to weight the correlation somehow to account for this?

Second, I'm curious how to combine these correlations into a composite guess as to the user's politics. Let's say I take the 20 artists with the highest absolute correlative values (positive and negative), ten in each direction, and poll a user on how much he or she likes each artist. So I have up or down votes on each artist plus the correlation to the politics for all 20 values. Is there a standard way to combine these correlations into a single estimate? (I'm thinking something like the NYTimes' famous dialect quiz, where it combined the regional probabilities for 25 questions into a heat map. But in this case, I just need a single value on how Democratic or Republican one's taste in music is.

Thank you!

Best Answer

Formula for weighted Pearson correlation can be easily found on the web, StackOverflow, and Wikipedia and is implemented in several R packages e.g. psych, or weights and in Python's statsmodels package. It is calculated like regular correlation but with using weighted means,

$$ m_X = \frac{\sum_i w_i x_i}{\sum_i w_i}, ~~~~ m_Y = \frac{\sum_i w_i y_i}{\sum_i w_i} $$

weighted variances,

$$ s_X = \frac{\sum_i w_i (x_i - m_X)^2}{ \sum_i w_i}, ~~~~ s_Y = \frac{\sum_i w_i (y_i - m_Y)^2}{ \sum_i w_i} $$

and weighted covariance

$$ s_{XY} = \frac{\sum_i w_i (x_i - m_X)(y_i - m_Y)}{ \sum_i w_i} $$

having all this you can easily compute the weighted correlation

$$ \rho_{XY} = \frac{s_{XY}}{\sqrt{s_X s_Y}} $$

As for your second question, as I understand it, you would have data about correlations between political orientation and preference for the twenty artists and users binary answers about his/her preference and you want to get some kind of aggregate measure of it.

Let's start with averaging correlations. There are multiple methods for averaging probabilities, but there don't seem to be so many approaches to averaging correlations. One thing that could be done is to use Fisher's $z$-transformation as described on MathOverflow, i.e.

$$ \bar\rho = \tanh \left(\frac{\sum_{j=1}^K \tanh^{-1}(\rho_j)}{K} \right) $$

It reduces the skewness of the distribution and makes it closer to normal. This procedure was also described by Bushman and Wang (1995) and Corey, Dunlap, and Burke (1998).

Next, you have to notice that if $r = \mathrm{cor}(X,Y)$, then $-r = \mathrm{cor}(-X,Y) = \mathrm{cor}(X,-Y)$, so positive correlation of musical preference with some political orientation is the same as negative correlation of musical dislike to such political orientation, and the other way around.

Now, let's define $r_j$ as correlation of musical preference of $j$-th artist to some political orientation, and $x_{ij}$ as $i$-th users preference for $j$-th artist, where $x_{ij} = 1$ for preference and $x_{ij} = -1$ for dislike. You can define your final estimate as

$$ \bar r_i = \tanh \left(\frac{\sum_{j=1}^K \tanh^{-1}(r_j x_{ij})}{K} \right) $$

i.e. compute average correlation that inverts the signs for correlations accordingly for preferred and disliked artists. By applying such a procedure you end up with the average "correlation" of users' preference and political orientation, that as regular correlation ranges from $-1$ to $1$.

But...

Don't you think that all of this is overkill for something that is basically a multiple regression problem? Instead of all the weighting and averaging you could simply use weighted multiple regression (linear or logistic depending if you predict binary preference or degree off preference in either direction) where weights are based on sizes of subsamples. You would use musical preference for each artist as a predictor. In the end, you'll use the user's preference to make predictions. This approach is simpler and more statistically elegant. It also applies relative weights to the artists while averaging the correlations doesn't correct for their relative "impact" on the final score. Moreover, regression takes into consideration the base rate (or default political orientation), while averaging correlations does not. Imagine that the vast majority of the population prefers party $A$, this should make you less eager to predict $B$'s, and regression accounts for that by including the intercept. The only problem is multicollinearity but when averaging correlations you ignore it rather than dealing with it.


Bushman, B.J., & Wang, M.C. (1995). A procedure for combining sample correlation coefficients and vote counts to obtain an estimate and a confidence interval for the population correlation coefficient. Psychological Bulletin, 117(3), 530.

Corey, D.M., Dunlap, W.P., and Burke, M.J. (1998). Averaging Correlations: Expected Values and Bias in Combined Pearson rs and Fisher's z Transformations, The Journal of General Psychology, 125(3), 245-261.

Related Question