[Math] Weighing correlation by sample size

correlationstatistical-inference

I'm a scholar in the humanities trying to not be a complete idiot about statistics. I have a problem relevant to some philological articles I'm writing. To avoid introducing the obscure technicalities of my field I'll recast this as a simple fictional "archaeology" problem.

In the Valley of Witches there are 29 tombs. Each contains an assortment of coins and gemstones. Some of the coins are gold coins and some of the gemstones are saphires.

There is a hypothesis in the field which predicts that the proportion of gold coins to total coins should correlate positively with the proportion of sapphires to total gemstones. Let's call this Angmar's prediction.

I would like to test Angmar's prediction for the dataset below. If I run a straightforward Pearson correlation on all 29 data points I get a correlation very close to zero (0.01). This looks bad for Angmar – but is it the whole story?

Some of the data points are clearly better than others. Tomb 1 has 46 gems and 990 coins. That seems to be a much more solid data point than Tomb 29, which has only 4 gems and 80 coins. In the dataset below I've arranged the tombs in order of "size", defined as the geometric mean of total gemstones and total coins. Now, if we only look at the 13 largest tombs we get a correlation of 0.67. This looks good for Angmar after all. If we include 25 tombs, all but the 4 smallest ones, we still have a correlation of 0.37.

correlation by number of tombs

It looks reasonable to look only at large tombs or exclude small ones but there is no non-arbitrary way to decide where to put the cut-off. And it seems wrong to throw any data away.

My question: Is there a way to make use of all the data and calculate some sort of properly weighted correlation?

My attempted answer: There are functions for weighted correlation out there (I've used this) – but what should I weigh by? If I weigh by total gems I get 0.28. If I weigh by total coins I get 0.16. Either seems reasonable but ideally I would make use of both. If I weigh by the product of total gems and total coins I get a correlation of 0.47. Is this a legitimate method?

To be clear – it's not that I want to gin up as large a correlation as possible – I have publishable data any which way. I just want to get this right.

Edit 1: There is no particular reason to think that the relationship should be linear. A rank correlation solution might also make sense.

Edit 2: We've settled on a rank correlation but the weighting formula is still unclear to me. Summing the sample sizes gives an intuitively wrong result in the case where one sample size is much larger than the other. But the geometric mean of the sample sizes also gives an intuitively wrong result for big numbers. A hundred zillion coins should not weigh a hundred times as heavily as a zillion coins. What might intuitively work in a case like that would be to use the sum of the sizes of the confidence interval (assuming a binomial distribution). Or maybe simply the reciprocal of the sums of the reciprocals – like with parellel resistors. But that's something I just pulled out of my behind. I don't feel on solid ground yet and additional answers would be very much appreciated.

The dataset is as follows. It is based on real data:

$$\begin{array}{c|c|c|c|c|c|c} \text{Tomb number} & \text{Sapphires} & \text{Total gems} & \text{Sapphire ratio} & \text{Gold coins} & \text{Total coins} & \text{Gold ratio}\\ \hline
\text{Tomb 1} & 44 & 46 & 0.96 & 33 & 990 & 0.03\\
\text{Tomb 2} & 35 & 41 & 0.85 & 3 & 761 & 0.00\\
\text{Tomb 3} & 21 & 25 & 0.84 & 13 & 558 & 0.02\\
\text{Tomb 4} & 23 & 25 & 0.92 & 12 & 368 & 0.03\\
\text{Tomb 5} & 14 & 18 & 0.78 & 2 & 426 & 0.00\\
\text{Tomb 6} & 13 & 17 & 0.76 & 6 & 350 & 0.02\\
\text{Tomb 7} & 12 & 14 & 0.86 & 3 & 418 & 0.01\\
\text{Tomb 8} & 8 & 13 & 0.62 & 3 & 318 & 0.01\\
\text{Tomb 9} & 11 & 12 & 0.92 & 4 & 269 & 0.01\\
\text{Tomb 10} & 6 & 6 & 1.00 & 17 & 503 & 0.03\\
\text{Tomb 11} & 9 & 10 & 0.90 & 8 & 286 & 0.03\\
\text{Tomb 12} & 4 & 6 & 0.67 & 3 & 454 & 0.01\\
\text{Tomb 13} & 9 & 10 & 0.90 & 10 & 255 & 0.04\\
\text{Tomb 14} & 7 & 10 & 0.70 & 12 & 250 & 0.05\\
\text{Tomb 15} & 7 & 7 & 1.00 & 6 & 351 & 0.02\\
\text{Tomb 16} & 9 & 9 & 1.00 & 8 & 218 & 0.04\\
\text{Tomb 17} & 6 & 7 & 0.86 & 3 & 251 & 0.01\\
\text{Tomb 18} & 7 & 7 & 1.00 & 5 & 246 & 0.02\\
\text{Tomb 19} & 5 & 5 & 1.00 & 7 & 304 & 0.02\\
\text{Tomb 20} & 4 & 4 & 1.00 & 10 & 336 & 0.03\\
\text{Tomb 21} & 4 & 4 & 1.00 & 15 & 274 & 0.05\\
\text{Tomb 22} & 6 & 6 & 1.00 & 3 & 175 & 0.02\\
\text{Tomb 23} & 5 & 6 & 0.83 & 5 & 174 & 0.03\\
\text{Tomb 24} & 4 & 4 & 1.00 & 4 & 174 & 0.02\\
\text{Tomb 25} & 4 & 4 & 1.00 & 5 & 150 & 0.03\\
\text{Tomb 26} & 1 & 2 & 0.50 & 15 & 218 & 0.07\\
\text{Tomb 27} & 2 & 2 & 1.00 & 8 & 201 & 0.04\\
\text{Tomb 28} & 1 & 3 & 0.33 & 2 & 108 & 0.02\\
\text{Tomb 29} & 4 & 4 & 1.00 & 1 & 80 & 0.01\end{array}$$

Best Answer

To correctly answer this interesting question, there are three issues to be considered. The first one refers to the opportunity of weighting. The problem of exploring the relationship between two variables by taking into account a third weighting variable is common in statistical research. For example, we could be interested in assessing the correlation between age and the value of a certain blood parameter in a sample of subjects where the blood parameter value in some of them represents the average of multiple measurements. In this case we could choose to give more importance to values representing averages than those representing single measurements, under the hypothesis that they are less affected by within-subject variability and can be considered more "reliable". The size or number of observations is not the only possible weighting variable: we can decide to weight, say, according to time of observation (e.g., if we want to give more importance to recent observations than old ones because are more relevant for the present situation), to the standard deviation of values (as correctly noted in the comments) in samples with aggregated data, to the order of preferences when one of the variables is a rank, and so on.

In the context described by the OP, considering the size discrepancies, a weighted analysis is fully appropriate. The utility of this choice is also highlighted by the type of variables considered in this case (proportions), since their precision is well known to be highly sensitive to small sample sizes. This concept is a classical problem in power calculations for studies on proportions, and we can better visualize it by considering that the sample size required to estimate a proportion with a specified level of confidence and precision is given by the formula $\displaystyle N=\frac{Z_\alpha^2 p(1-p)}{e^2}$, where $Z_\alpha$ is the value from the standard normal distribution corresponding to our predefined $\alpha$ error (e.g., Z=1.96 if we want a 95% CI), $p$ is the expected "true" proportion in the underlying population, and $e$ is the desired level of precision. As a result of this inverse relation, small sample sizes can be associated with very high levels of imprecision. For example, let us consider to get a sample from a population where the true underlying proportion is $50\%$, and to observe a proportion $p$. The precision of this observed proportion for an $\alpha<0.05$ (i.e. the range over which $p$ is distributed 95% of times if I take infinite samples of that size), is $\pm5\%$ for a sample of $385$ observations, but falls to $\pm10\%$ for a sample of $97$ observations and to $\pm20\%$ (clearly unacceptable) for a sample of $25$ observations. These considerations point out that caution is required when managing proportions given by small sample sizes. In our case, this problem is more evident for gems, since half of tombs has a size $<10$. In these conditions, weighting is clearly recommended.

The second issue is that we have to choose the method of weighting. As stated above, weighting can be performed according to different variables, the choice of which depends on several factors, including purpose of the study, underlying distribution, type of data aggregation, and so on. In our case, we are interested in finding a weighting variable that impacts on the reliability of observed proportions. According to the considerations above, and taking into account the marked impact of sample size on proportion precision, the size of each observation (in our case, the number of gems and that of coins in each tomb) is an appropriate choice. Weighting by standard deviation, which is correctly performed in many cases of aggregated data, is less appropriate in this context, since here we have no aggregated data (also, even if we had aggregated data, we could not assume that the distribution of observed data in the tombs is normal). To quantify the size of each tomb, the geometric mean of the number of gems and coins is the optimal choice and has to be preferred to the arithmetic mean. In fact, the geometric mean takes better into account that, to be reliable, an observation must have a precise proportion of both gems and coins, and that therefore a balance between the two elements is advantageous for the purpose of our analysis. To better explain this: if, for instance, we have a tomb $i$ with $2$ gems and $198$ coins, and another tomb $j$ with $100$ gems and $100$ coins, the overall reliability of the observation $x_i,y_i$ (where $x$ and $y$ are the proportions of gems and coins, respectively) is probably inferior to that of the observation $x_j,y_j$. The geometric mean captures this information and gives a size of $19.9$ in the first case and of $100$ in the second case. The arithmetic mean does not capture this information and gives a size of $100$ in both cases.

The third issue is that we have to identify the most appropriate method to assess correlation. In this regard, the most important choice to make is between parametric and nonparametric measures. Several assumptions must be satisfied before applying the classical Pearson correlation, which is the typical parametric test: 1) variables must be continuous; 2) variables must be approximately normally distributed; 3) outliers (observations that lie at an abnormal distance from the other data) have to be minimized or removed; 4) data have to be homoscedastic (i.e., the variances along the line of fit have to be approximately similar as we move along the line; 5) a linear relationship must be plausible (this is usually checked by visual estimation of scatterplots). We can use specific tests to check these assumptions, but looking at the data shown in the OP it seems highly unlikely that all them are adequately satisfied. This suggests that nonparametric measures of correlation have to be preferred in this case.

The most used types of nonparametric correlation coefficients are Spearman's R, Kendall's Tau, and Goodman-Kruskal's Gamma. All these methods overcome the problems related to the assumptions of parametric tests, since they only require that individual observations can be ranked into two ordered series. Spearman's R can be interpreted as the Pearson correlation coefficient computed from ranks, so that it provides a similar message in terms of variability accounted for. Kendall's Tau is equivalent to Spearman R in terms of statistical power, but its results have a different interpretation since it represents a probability: in particular, it is the difference between the probability that, given any pair of observations ($x_i, y_i$ and $x_j, y_j$), the ranks of the two variables are in the same order (i.e., $x_i>x_j$ and $y_i>y_j$, or $x_i<x_j$ and $y_i<y_j$). Goodman-Kruskal's Gamma is basically equal to Kendall's tau, with the only difference that it takes into account ties (observations with identical value), and is preferable when data show several cases of equal values.

In summary, an optimal choice for this analysis could be a nonparametric test (e.g. the Spearman's R) weighted for size, where size is calculated as the geometric mean of the number of gems and that of coins. I have not tested whether this analysis, applied to the tomb data, yields a significant correlation. However, this analysis surely represents a very "robust" approach.

Related Question