Solved – How to calculate ONE number from Pearson correlation distance of more than two variables

correlationmathematical-statisticsr

Pearson correlation distance:$$d_{cor}(x,y)=1- \frac{ \sum \limits_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y}) }{ \sqrt{ \sum \limits_{i=1}^{n} (x_i-\bar{x})^2 \sum \limits_{i=1}^{n} (y_i-\bar{y})^2 } }$$

I'm using package 'factoextra' in R to calculate correlation distance measures. This is the tutorial. The dataset contains 4 continuous variables (Murder, Assault, UrbanPop, Rape), and here is the Pearson correlation distance output:
enter image description here

My question is, how can be the correlation distance of 4 variables ONE exact number between 0 and 2? Maybe every distance value (Texas-Iowa) are the output is the weighted value of 4 distances (Murder, Assault, UrbanPop, Rape), aren't they? I couldn't find the documentation of the R function. What would be the rational explanation to this problem?

Best Answer

This sounds right. The Pearson distance, as you have written above, is defined as $d_p = 1 - r$, where $r=\frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}$ is the Pearson correlation coefficient. As the Pearson correlation coefficient falls within the range $[-1, 1]$, then the Pearson distance lies somewhere in the interval $[0, 2]$. The function get_dist in the package allows you to choose the method to obtain the distances in the distance matrix above (i.e. must be one of "euclidean", "manhattan","minkowski", "pearson" etc. You get these values as you've chosen "pearson", so it computes the distance using the formula for $d_p$.

Here we are not finding the correlation distance between the 4 variables. We find a distance between each pair of states, and explore possible clusters. Each state has data for these 4 continuous variables. To find the Pearson distance between two states, and plot as shown above: take the $x_1,x_2,x_3,x_4$ values representing the observed values for the 4 variables for state 1, and $y_1,y_2,y_3,y_4$ for state 2. Use their values in the Pearson distance formula and you get the plot above.

(If you chose Euclidean distance instead, the distance between the two states would be $\sum_1^4(x_i-y_i)^2$.)

Hope this helps!