Solved – What correlation coefficient and graph is appropriate with this data

data visualizationranksspearman-rho

I'm not in Statistics field. I conducted the case study and collected the data as shown below
I have data as shown in the table below:

enter image description here

I would like to find correlation coefficient from this two table data(between NOA and HVOC, and between NOA and HVOL). I conducted the case study with the source code.

I measured software metrics named "NOA" and "HVOL" for all the method/function before I modified this source code. And then, after I modified the code, I again measureed the same metrics for all the method.

NOA Diff field in the table is calculated from NOA (after modifying the code) minus NOA (before modifying the code). That is "NOA Diff = NOA(after)-NOA(before)". The same way was applied to HVOC metric; HVOC Diff = HVOC(after)-HVOC(before)

My questions are

What type of correlation coefficient should I use?
What kind of graph should I create to illustrate my data?
The table above is all data, i mean it's population not a sample, can I use the method that is used with a sample
Is Spearman is for non normally distributed data?

Best Answer

To echo everyone else: MORE DETAILS ABOUT YOUR DATA. Please give a qualitative description of what your independent and dependent variable(s) is/are.

EDIT: Yes this is confusing; hopefully it's cleared up now.

In general, you probably want to avoid using sample statistics to estimate population parameters if you have the population data. This is because sample statistics are estimates of population parameters, thus the methods used to compute sample statistics always have less power than those same methods in their population parameter version(s). Of course, most of the time you have to use sample statistics because you don't have complete population data.

In your case either way you slice it inferring anything about a population from a case study is dubious because case studies are, by definition, case by case. You could make an inference about the case on which you collected data, but how useful is that? Maybe in your case it is.

Either way, forget about whether or not you can/should use a sample method when you have the population data. You don't have population data if it's a case study. Also, sample vs. population has to do with making inferences. You do not need to worry about sample vs. population methods if all you want is a correlation coefficient, because it is a purely descriptive statistic.

Your fourth bullet point is completely unintelligible. Please clear that up if you would like people to help you with it.

@mpiktas A Spearman rank correlation is NOT the proper correlation coefficient to use here. To use that test all data must be ranked and discrete (unless >= 2 values compete for a rank), i.e., they must be ordinal data. Maybe the HVOC table could be analyzed via Spearman's $\rho$, however more information must be provided by the poster to make that conclusion.

@whuber Yes all data are discrete when represented on a computer, however in this case it seems like what BB01 was referring to was the scale of measurement, not the electronic representation of numbers.

Related Solutions

Solved – Significance test on the difference of Spearman’s correlation coefficient

The paper you cite explains the method in the following terms:

[...] we show the statistical significance of the difference between the performance of ESA-Wikipedia (March 26, 2006) version) and that of other algorithms by using Fisher's z-transformation (Press, Teukolsky,Vetterling, & Flannery, Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1997, Section 14.5).

I suggest you follow that reference, or have a look at the Wikipedia page on the Spearman coefficient for details.

Solved – Interpreting the Spearman’s Rank Correlation Coefficient output in R. What is ‘S’

S is the test statistic which is the sum of all squared rank differences. To make it more understandable. Assume we have to following data:

v1 <- c(1, 2, 3, 4)
v2 <- c(3, 4, 2, 1)

Now, we get the ranks.

#v1 rank(v1)  v2  rank(v2) d = |rank(v2) - rank (v1)|  d^2
# 1        4   3         2                          2    4
# 2        3   4         1                          2    4 
# 3        2   2         3                          1    1
# 4        1   1         4                          3    9

The sum over all d^2 is 4 + 4 + 1 + 9 = 18. (Another example can be found here: https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide-2.php)

We find the same thing in R:

test <- cor.test(v1,v2,method="spearman")
test$statistic #S is 18

S is derived from random variables and can be assumed to have a distribution (like a t-distribution or normal-distribution). And depending on the distribution and their parameters you can say how likeli it is to observe this (or a more extreme) value under this distribution. This is your p-value.

In moste cases I would say that the major part of the readers is happy with the correlation-coefficient, the p-value and the cases numbers ("n"). But this is a next question that would better fit at https://academia.stackexchange.com/.

Best Answer

Related Solutions

Solved – Significance test on the difference of Spearman’s correlation coefficient

Solved – Interpreting the Spearman’s Rank Correlation Coefficient output in R. What is ‘S’

Related Question