Solved – R Matrix correlation non-numeric

correlationr

Dear More Advanced that I R users,
I am having a problem creating and interpreting some data. Here we go with the question and please let me know which parts are not clear.

I have some data that I have imported and attached with R. Lets call the data set Fun_D. Now two of the columns of the data are non- numeric data that looks like this in my data frame.

             V1OS  V2Browser
Person 1: Windows     Opera
Person 2:     Mac    Safari
Person 3:   Linux    Chrome

ect…

Where V1 is the vector of the different OS's and V2 is the vector for the different Browsers. I'v got 100 rows of data like this it doesn't stop at just 3…

Ok Now I'v got this. Finally the question. I want to see a correlation matrix (from 0-1) between the OSs and the Browsers. Say for example there are more people on the Windows OS who use Opera than on the Mac OS who use Opera, based on that we should see a higher correlation number between Windows and Opera than Mac and Opera. Does this make sense?

I'v used the Goodman and Kruskal test with some success in the past but I don't know if it is correct to use it here or how I would even interpret the results of a Goodman and Kruskal test here.

Please let me know what I can clarify as I greatly appreciate any advice you can provide,

Thank you very much

Sam

Best Answer

Your data is not suitable for neither ANOVA/Kruskal-Wallis nor correlation analysis. What you need is a Chi-square test for independence to determine whether the operating system is related to browser preference at all.

To study specific relations, you could then make something like a hierarchical graph (or a tree) where all OS's (level 1 nodes) are linked to all of your browsers (level 2 nodes). The weight of an edge on the graph would indicate the strength of a connection between an OS and a browser. If you represent this graph as a matrix, you can easily normalise your connection strength in the unit interval [0,1] by dividing every element in your matrix by its largest element.

Related Solutions

Solved – how can i compare two groups of data

From reading your previous post, I see that you have two groups with 15 subjects, each with multiple observations (3 each). Each subject appears in each group, except subject 15 who has 0 observation in group 1.

So, basically, you have a paired design. A way to test whether Group 1 and Group 2 are different is by using a paired wilcoxon signed rank sum test. In R, this can be done using the following code:

df<- structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                              1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                              1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                              1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                              2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                              2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Subject = c(1L, 
                                                                                                   1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
                                                                                                   6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 
                                                                                                   11L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 1L, 1L, 1L, 
                                                                                                   2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 
                                                                                                   7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L, 
                                                                                                   12L, 12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L), 
                    Value = c(29.89577946, 29.51885854, 29.77429604, 33.20695108, 
                              32.09027292, 31.90909894, 30.88358173, 30.67547731, 30.82494595, 
                              31.70128247, 31.57217504, 31.61359752, 30.51371055, 30.42241945, 
                              30.44913954, 26.90850496, 0, 0, 0, 0, 0, 28.94047335, 29.27188604, 
                              29.78511206, 28.18475423, 27.54266717, 26.99873401, 29.26941344, 
                              28.50457189, 28.78050443, 31.39038527, 31.19237052, 30.74053275, 
                              28.68618888, 28.42109545, 28.58222544, 28.99337177, 29.31797, 
                              28.4541501, 28.18475423, 27.54266717, 26.99873401, 28.07576794, 
                              28.96344894, 28.48358437, 27.02527663, 27.1308483, 26.96091103, 
                              27.04019758, 27.51900858, 28.14559621, 26.83569136, 26.90724462, 
                              26.82675, 0, 0, 0, 27.62449786, 26.82335228, 26.66925534, 
                              0, 25.81254792, 26.61666776, 26.12545858, 0, 0, 0, 0, 0, 
                              28.84580419, 29.11003424, 29.24723895, 28.72919768, 29.70673437, 
                              29.31274377, 30.73133587, 30.44805655, 30.61561583, 27.06896964, 
                              27.04249553, 27.15990629, 31.54738209, 31.51643714, 31.8055509, 
                              31.291867, 31.89146186, 31.65812735)), .Names = c("Group", 
                                                                                "Subject", "Value"), class = "data.frame", row.names = c(NA, 
                                                                                                                                         -87L))



df$Value[df$Value == 0] <- NA

df[is.na(df$Value),] ## missing data

table(df$Group, df$Subject) ## check to see if all groups have equal obs


## perform wilcoxon signed rank sum test 
wilcox.test(formula = Value ~ Group, data = df[!df$Subject == 15,]) ## omit the 15th patient

Wilcoxon rank sum test with continuity correction

data:  Value by Group
W = 900, p-value = 0.0006732
alternative hypothesis: true location shift is not equal to 0

Warning message:
In wilcox.test.default(x = c(29.89577946, 29.51885854, 29.77429604,  :
                               cannot compute exact p-value with ties

## we can reject the null hypothesis that both groups are equal

From the R documentation,

If exact p-values are available, an exact confidence interval is obtained by the algorithm described in Bauer (1972), and the Hodges-Lehmann estimator is employed. Otherwise, the returned confidence interval and point estimate are based on normal approximations. These are continuity-corrected for the interval but not the estimate (as the correction depends on the alternative).

Related Question