Weighted Data – Is There Such a Thing as a Weighted Correlation?

correlationpoolingpredictionweighted-data

I have some interesting data on the most popular musical artists streamed divided by location into about 200 congressional districts. I want to see if it's possible to poll a person on his or her musical preferences and determine whether he or she "listens like a Democrat" or "listens like a Republican." (Naturally this is light hearted, but there's real entropy in the data!)

I have data on about 100 artists, plus the average percentage votes for Republicans and Democrats in each district over the past three election cycles. So I ran a correlation on each artist to see which ones were most disproportionately listened-to as a function of vote share for Democrats. Those correlations run from about -0.3 to 0.3 for any given artist, with plenty in the middle that have little or no predictive power.

I have two questions: First, the overall number of streams per district varies widely. Right now, I'm correlating percentage of all streams per district belonging to, say, Beyonce, against the percentage of votes cast for Democrats. But total streams in one district might be in the millions, while another is in the low 100,000s. Do I need to weight the correlation somehow to account for this?

Second, I'm curious how to combine these correlations into a composite guess as to the user's politics. Let's say I take the 20 artists with the highest absolute correlative values (positive and negative), ten in each direction, and poll a user on how much he or she likes each artist. So I have up or down votes on each artist plus the correlation to the politics for all 20 values. Is there a standard way to combine these correlations into a single estimate? (I'm thinking something like the NYTimes' famous dialect quiz, where it combined the regional probabilities for 25 questions into a heat map. But in this case, I just need a single value on how Democratic or Republican one's taste in music is.

Thank you!

Best Answer

Formula for weighted Pearson correlation can be easily found on the web, StackOverflow, and Wikipedia and is implemented in several R packages e.g. psych, or weights and in Python's statsmodels package. It is calculated like regular correlation but with using weighted means,

$$ m_X = \frac{\sum_i w_i x_i}{\sum_i w_i}, ~~~~ m_Y = \frac{\sum_i w_i y_i}{\sum_i w_i} $$

weighted variances,

$$ s_X = \frac{\sum_i w_i (x_i - m_X)^2}{ \sum_i w_i}, ~~~~ s_Y = \frac{\sum_i w_i (y_i - m_Y)^2}{ \sum_i w_i} $$

and weighted covariance

$$ s_{XY} = \frac{\sum_i w_i (x_i - m_X)(y_i - m_Y)}{ \sum_i w_i} $$

having all this you can easily compute the weighted correlation

$$ \rho_{XY} = \frac{s_{XY}}{\sqrt{s_X s_Y}} $$

As for your second question, as I understand it, you would have data about correlations between political orientation and preference for the twenty artists and users binary answers about his/her preference and you want to get some kind of aggregate measure of it.

Let's start with averaging correlations. There are multiple methods for averaging probabilities, but there don't seem to be so many approaches to averaging correlations. One thing that could be done is to use Fisher's $z$-transformation as described on MathOverflow, i.e.

$$ \bar\rho = \tanh \left(\frac{\sum_{j=1}^K \tanh^{-1}(\rho_j)}{K} \right) $$

It reduces the skewness of the distribution and makes it closer to normal. This procedure was also described by Bushman and Wang (1995) and Corey, Dunlap, and Burke (1998).

Next, you have to notice that if $r = \mathrm{cor}(X,Y)$, then $-r = \mathrm{cor}(-X,Y) = \mathrm{cor}(X,-Y)$, so positive correlation of musical preference with some political orientation is the same as negative correlation of musical dislike to such political orientation, and the other way around.

Now, let's define $r_j$ as correlation of musical preference of $j$-th artist to some political orientation, and $x_{ij}$ as $i$-th users preference for $j$-th artist, where $x_{ij} = 1$ for preference and $x_{ij} = -1$ for dislike. You can define your final estimate as

$$ \bar r_i = \tanh \left(\frac{\sum_{j=1}^K \tanh^{-1}(r_j x_{ij})}{K} \right) $$

i.e. compute average correlation that inverts the signs for correlations accordingly for preferred and disliked artists. By applying such a procedure you end up with the average "correlation" of users' preference and political orientation, that as regular correlation ranges from $-1$ to $1$.

But...

Don't you think that all of this is overkill for something that is basically a multiple regression problem? Instead of all the weighting and averaging you could simply use weighted multiple regression (linear or logistic depending if you predict binary preference or degree off preference in either direction) where weights are based on sizes of subsamples. You would use musical preference for each artist as a predictor. In the end, you'll use the user's preference to make predictions. This approach is simpler and more statistically elegant. It also applies relative weights to the artists while averaging the correlations doesn't correct for their relative "impact" on the final score. Moreover, regression takes into consideration the base rate (or default political orientation), while averaging correlations does not. Imagine that the vast majority of the population prefers party $A$, this should make you less eager to predict $B$'s, and regression accounts for that by including the intercept. The only problem is multicollinearity but when averaging correlations you ignore it rather than dealing with it.

Bushman, B.J., & Wang, M.C. (1995). A procedure for combining sample correlation coefficients and vote counts to obtain an estimate and a confidence interval for the population correlation coefficient. Psychological Bulletin, 117(3), 530.

Corey, D.M., Dunlap, W.P., and Burke, M.J. (1998). Averaging Correlations: Expected Values and Bias in Combined Pearson rs and Fisher's z Transformations, The Journal of General Psychology, 125(3), 245-261.

Related Solutions

Solved – On the use of weighted correlations in aggregated survey data

I imagine this is history by now, but just in case...

1) Yes, this seems appropriate. Your research question must be "are teacher attitudes/behaviours at a school related to student attitudes/behaviours at that school?" If this is your question, a school is the appropriate unit of analysis (and there would be no way to match up individual teachers to students anyway).

I would just add caveats on the use of Pearson's correlation coefficient, unrelated to the question of the unit of analysis or sampling strategy. The correlation coefficient cannot pick up non-linear relationships, can be misleading to interpret, is easily distorted by a few outliers, and classical inference based on it depends on Normality (which won't hold exactly with your proportion data, although it may be a reasonable approximation). At a minimum I would carefully use graphical methods to check that this is a sensible approach and there is not a better way of inferring the relationship between the two variables.

2) I don't think you need to weight the data but I would certainly try it (and hope it doesn't change the results). But I would weight by your sample size in the school, not by the enrollment size. The reason would be about estimation rather than either your unit of analysis or any need to "weight to population". You only have an estimate of the true teacher and student responses in each school, drawing on your finite sample. Schools where you had a larger sample you are more confident in your estimate, and hence it would be good if they were taken more seriously in fitting your correlation or linear regression.

Solved – Does normalisation affect the values of Mean Squared Error, Mean Absolute Percentage Error etc.

I went and simulated some data that qualitatively looked more or less like your point cloud.

require(mvtnorm)
set.seed(1)
foo <- rmvnorm(1000,c(0.77,0.77),cbind(c(.001,.0009),c(.0009,.001)))
plot(foo,pch=19,cex=0.6,xlab="",ylab="")
mean((foo[,1]-foo[,2])^2)
# [1] 0.000215418
100*mean(abs(foo[,1]-foo[,2])/foo[,1])
# [1] 1.512891

Your MSEs seem to make sense. My simulation gets a somewhat larger one, but that may simply be because you may have more dots in the center of your cloud.

I can't really answer your question about normalization, because your target variable is not normalized in any meaningful sense. All the values are between 0.70 and 0.84. If this were normalized, then the range between -1 and 1 would be completely used. (And then MAPEs would not make sense.)
As above, I get a MAPE of about 1.5%, which is not far away from your 0.5%, and the difference may again be because you may have more points in your data cloud.

Why am I getting such "good" results when in reality my figure looks like this: ... In my opinion, the points should be more scattered "along" the line in an elongated fashion, not in a bulky way as shown in this figure.

The relationship between forecast error measures and scatterplots between forecasts and actuals is not straightforward. MSEs of course depend on scaling - multiply both actuals and forecasts by 10, and your cloud will look exactly the same, except for the axes, but the MSE will be 100 times as large. Add 10 to both forecasts and actuals, and the cloud will again look exactly the same, except for the axes, but this time, the MAPE will be smaller by a factor of about 10.

Don't try to relate error measures to scatterplots. It won't work.
We don't know why your forecasts are not better. (We don't even know whether your plot is for a holdout sample, or in-sample.) You may be overfitting, or not capturing enough information, or there may simply be residual noise that you cannot capture. If there is information in there you have not yet leveraged, then plotting residuals against each predictor may suggest possible remedies, like transformations of predictors. Otherwise, I'm afraid that as long as you can't investigate your data more deeply, there is little you can do. How to know that your machine learning problem is hopeless?

Best Answer

But...

Related Solutions

Solved – On the use of weighted correlations in aggregated survey data

Solved – Does normalisation affect the values of Mean Squared Error, Mean Absolute Percentage Error etc.

Related Question