Solved – How to calculate ONE number from Pearson correlation distance of more than two variables

correlationmathematical-statisticsr

Pearson correlation distance:$$d_{cor}(x,y)=1- \frac{ \sum \limits_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y}) }{ \sqrt{ \sum \limits_{i=1}^{n} (x_i-\bar{x})^2 \sum \limits_{i=1}^{n} (y_i-\bar{y})^2 } }$$

I'm using package 'factoextra' in R to calculate correlation distance measures. This is the tutorial. The dataset contains 4 continuous variables (Murder, Assault, UrbanPop, Rape), and here is the Pearson correlation distance output:

My question is, how can be the correlation distance of 4 variables ONE exact number between 0 and 2? Maybe every distance value (Texas-Iowa) are the output is the weighted value of 4 distances (Murder, Assault, UrbanPop, Rape), aren't they? I couldn't find the documentation of the R function. What would be the rational explanation to this problem?

Best Answer

This sounds right. The Pearson distance, as you have written above, is defined as $d_p = 1 - r$, where $r=\frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}$ is the Pearson correlation coefficient. As the Pearson correlation coefficient falls within the range $[-1, 1]$, then the Pearson distance lies somewhere in the interval $[0, 2]$. The function get_dist in the package allows you to choose the method to obtain the distances in the distance matrix above (i.e. must be one of "euclidean", "manhattan","minkowski", "pearson" etc. You get these values as you've chosen "pearson", so it computes the distance using the formula for $d_p$.

Here we are not finding the correlation distance between the 4 variables. We find a distance between each pair of states, and explore possible clusters. Each state has data for these 4 continuous variables. To find the Pearson distance between two states, and plot as shown above: take the $x_1,x_2,x_3,x_4$ values representing the observed values for the 4 variables for state 1, and $y_1,y_2,y_3,y_4$ for state 2. Use their values in the Pearson distance formula and you get the plot above.

(If you chose Euclidean distance instead, the distance between the two states would be $\sum_1^4(x_i-y_i)^2$.)

Hope this helps!

Related Solutions

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Let's start with the simpler case where there is no correlation structure for the residuals:

fit <- gnls(model=model,data=data,start=start)
logLik(fit)

The log likelihood can then be easily computed by hand with:

N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))

Since the residuals are independent, we can just use dnorm(..., log=TRUE) to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:

sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))

Note that fit$sigma is not the "less biased estimate of $\sigma^2$" -- so we need to make the correction manually first.

Now for the more complicated case where the residuals are correlated:

fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:

N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)

Solved – Instability of one-pass algorithm for correlation coefficient

You can experience problems whenever a term like $\sum y_i^2$ or $\sum x_iy_i$ is very large, and yet close to the second term, potentially leading to a large loss in digits of accuracy when almost all of the significant digits cancel. In the case of the variance, it happens when the standard deviation is small compared to the mean.

It's possible to construct one-pass forms for all the terms under the $\sqrt{}$ signs that don't suffer this sort of problem.

There's an example calculation for a variance given here. Similar calculations for covariance can be done.

Best Answer

Related Solutions

Solved – Calculate log-likelihood “by hand” for generalized nonlinear least squares regression (nlme)

Solved – Instability of one-pass algorithm for correlation coefficient

Related Question