Solved – Correlation between two variables measured on a “strongly agree” to “strongly disagree” scale

ordinal-datap-valuersurvey

The questions on a survey asked:

Do you actively participate in a study group?
Do you think the class is going too quickly?

For both, the responses are one of the following: strongly agree, agree, neutral, disagree, strongly disagree

So I want to analyze if there is a correlation between those who are in a study group and those who think the class is going to quickly.

So I have two columns in a data frame. The first column is labeled "group" and the second is "fast". I have converted the group variable into a numerical variable, like so: strongly agree = 5, agree = 4, neutral = 3, disagree = 2, strongly disagree = 1

So now I have a data frame with two columns, one full of numbers and the other still full of the original answers ("strongly agree", "agree", etc).

I have found the means of every option quickly, now I just need to see if there is a statistical significance, but I am clueless. How exactly should I calculate the p-value on this? I have tried several methods but the p-value seems wrong.

Sorry if this is easy stuff, I think I have made this way more complicated in my mind than it should be and I appreciate any help.

quickly <- CSExperiencesAllWithHeaders$CEQuickly
    groups <-CSExperiencesAllWithHeaders$CEStudyGroup

levels(groups) <- (c(levels(groups), 5, 4, 3, 2, 1))
groups[groups == "strongly agree"] <- 5
groups[groups == "agree"] <- 4
groups[groups == "neutral"] <- 3
groups[groups == "disagree"] <-2
groups[groups == "strongly disagree"] <- 1
groups[groups == ""] <- NA
groups[groups == "N/A"] <- NA
quickly[quickly == "N/A"] <- NA
quickly[quickly == ""] <- NA

groups <- factor(groups)
quickly <- factor(quickly)
analysis3 <- data.frame(groups,quickly)
analysis3 <- na.omit(analysis3)
analysis3$groups <- as.numeric(as.character(analysis3$groups))

sagree2 <- subset(analysis3, quickly  == "strongly agree")
agree2 <- subset(analysis3, quickly == "agree")
neutral2 <- subset(analysis3, quickly == "neutral")
disagree2 <- subset(analysis3, quickly == "disagree")
sdisagree2 <- subset(analysis3, quickly == "strongly disagree")

meansagree2 <- mean(sagree2$groups)
    meanagree2 <- mean(agree2$groups)
meanneutral2 <- mean(neutral2$groups)
    meandisagree2 <- mean(disagree2$groups)
meansdisagree2 <- mean(sdisagree2$groups)

barplot(c(meansagree2, meanagree2, meanneutral2, meandisagree2, 
          meansdisagree2),
        main = "Those Who Think Class is Too Quick: In Study Groups?",
        names.arg=c("Strongly Agree","Agree","Neutral","Disagree", 
                    "Strongly Disagree"),
        xlab = "Class too Quick?",
        ylab = "In a Study Group?")

all this code creates this data frame (I only took the top of the data frame since the real one is over 1000 columns):

    groups  quickly
1   5   'strongly disagree'
2   4   'strongly agree'
3   1   'disagree'
4   1   'disagree'
5   4   'strongly disagree'
6   2   'strongly disagree'
7   1   'neutral'
8   2   'disagree'
9   1   'strongly disagree'
10  2   'strongly disagree'
11  1   'strongly disagree'
12  2   'neutral'
13  5   'disagree'
14  2   'disagree'
15  4   'neutral'
16  2   'disagree'
17  5   'disagree'
18  5   'neutral'
19  4   'strongly disagree'
20  2   'strongly disagree'
21  3   'disagree'
22  1   'strongly disagree'
23  4   'strongly agree'
24  1   'strongly disagree'
26  5   'strongly disagree'
27  1   'strongly disagree'
28  5   'disagree'
29  5   'agree'

This is what I get when I use the dput function:

structure(list(groups = c(5, 4, 1, 1, 4, 2, 1, 2, 1, 2, 1, 2,
5, 2, 4, 2, 5, 5, 4, 2, 3, 1, 4, 1, 5, 1, 5, 5, 5, 5), quickly = structure(c(5L,
4L, 2L, 2L, 5L, 5L, 3L, 2L, 5L, 5L, 5L, 3L, 2L, 2L, 3L, 2L, 2L,
3L, 5L, 5L, 2L, 5L, 4L, 5L, 5L, 5L, 2L, 1L, 2L, 3L), .Label = c("agree",
"disagree", "neutral", "strongly agree", "strongly disagree"), class = "factor"),
qui_fact = structure(c(5L, 1L, 4L, 4L, 5L, 5L, 3L, 4L, 5L,
5L, 5L, 3L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 5L, 4L, 5L, 1L, 5L,
5L, 5L, 4L, 2L, 4L, 3L), .Label = c("strongly agree", "agree",
"neutral", "disagree", "strongly disagree"), class = "factor"),
qui_num = c(5, 1, 4, 4, 5, 5, 3, 4, 5, 5, 5, 3, 4, 4, 3,
4, 4, 3, 5, 5, 4, 5, 1, 5, 5, 5, 4, 2, 4, 3)), .Names = c("groups",
"quickly", "qui_fact", "qui_num"), na.action = structure(c(25L,
31L, 37L, 38L, 86L, 91L, 148L, 209L, 270L, 280L, 285L, 328L,
338L, 340L, 410L, 424L, 456L, 460L, 461L, 480L, 568L, 587L, 593L,
596L, 599L, 600L, 607L, 621L, 658L, 700L, 717L, 731L, 758L, 776L,
827L, 837L, 849L, 862L, 864L, 896L, 899L, 909L, 921L, 946L, 963L,
966L, 977L, 994L, 1007L, 1012L, 1074L, 1079L), .Names = c("25",
"31", "37", "38", "86", "91", "148", "209", "270", "280", "285",
"328", "338", "340", "410", "424", "456", "460", "461", "480",
"568", "587", "593", "596", "599", "600", "607", "621", "658",
"700", "717", "731", "758", "776", "827", "837", "849", "862",
"864", "896", "899", "909", "921", "946", "963", "966", "977",
"994", "1007", "1012", "1074", "1079"), class = "omit"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 26L, 27L, 28L, 29L,
30L, 32L), class = "data.frame")

Best Answer

As you have ordinal factors, means are not so useful. You could use a $\chi^2$ test and/or Spearman correlation to find if the two values are correlated.

Commands:

chisq.test(analysis3$groups,analysis3$quickly) ,

and after converting your "quickly" strings to factors, reordering and extracting the levels to a numeric vector, you can apply Spearman correlation:

analysis3$qui_fact<- as.factor(analysis3$quickly)

levels(analysis$qui_fact) #(alphabetical levels)

analysis$qui_fact<- factor(analysis$qui_fact,levels(analysis$qui_fact)[c(4,1,3,2,5)]) #reorder as needed

analysis$qui_num<- as.numeric(analysis$qui_fact)

cor.test(analysis$groups,analysis$qui_num,alt="two.sided",method="spearman",conf.level=.99)

Using `irutils`

I came across this package some months ago.

As of commit 0573195c07 on Github, the code won't work with a grouping= argument. Let's go for Friday's debugging session.

Start by downloading a zipped version from Github. You'll need to hack the R/likert.R file, specifically the likert and plot.likert functions. First, in likert, cast() is used but the reshape package is never loaded (although there's an import(reshape) instruction in the NAMESPACE file). You can load this yourself beforehand. Second, there's an incorrect instruction to fetch items labels, where a i is dangling around line 175. This has to be fixed as well, e.g. by replacing all occurrences of likert$items[,i] with likert$items[,1]. Then you can install the package the way you are used to do on your machine. On my Mac, I did

% tar -czf irutils.tar.gz jbryer-irutils-0573195
% R CMD INSTALL irutils.tar.gz

Then, with R, try the following:

library(irutils)
library(reshape)

# Simulate some data (82 respondents x 66 items)
resp <- data.frame(replicate(66, sample(1:5, 82, replace=TRUE)))
resp <- data.frame(lapply(resp, factor, ordered=TRUE, 
                          levels=1:5, 
                          labels=c("Strongly disagree","Disagree",
                                   "Neutral","Agree","Strongly Agree")))
grp <- gl(2, 82/2, labels=LETTERS[1:2]) # say equal group size for simplicity

# Summarize responses by group
resp.likert <- likert(resp, grouping=grp)

That should just work, but the visual rendering will be awful because of the high number of items. It works without grouping (e.g., plot(likert(resp))), though.

enter image description here

I would thus suggest to reduce your dataset to smaller subsets of items. E.g., using 12 items,

plot(likert(resp[,1:12], grouping=grp))

I get a 'readable' stacked barchart. You can probably process them afterwards. (Those are ggplot2 objects, but you won't be able to arrange them on a single page with gridExtra::grid.arrange() because of readability issue!)

enter image description here

Alternative solution

I would like to draw your attention on another package, HH, that allows to plot Likert scales as diverging stacked barcharts. We could reuse the above code as shown below:

resp.likert <- likert(resp)
detach(package:irutils)
library(HH)
plot.likert(resp.likert$results[,-6]*82/100, main="")

but that will complicate things a bit because we need to convert frequencies to counts, subset the likert object produced by irutils, detach package, etc. So let's start again with fresh (counts) statistics:

plot.likert(t(apply(resp, 2, table)), main="", as.percent=TRUE,
            rightAxisLabels=NULL, rightAxis=NULL, ylab.right="", 
            positive.order=TRUE)

enter image description here

To use a grouping variable, you'll need to work with an array of numerical values.

# compute responses frequencies separately by grp
resp.array <- array(NA, dim=c(66, 5, 2))
resp.array[,,1] <- t(apply(subset(resp, grp=="A"), 2, table))
resp.array[,,2] <- t(apply(subset(resp, grp=="B"), 2, table))
dimnames(resp.array) <- list(NULL, NULL, group=levels(grp))
plot.likert(resp.array, layout=c(2,1), main="")

This will produce two separate panels, but it fits on a single page.

enter image description here

Edit 2016-6-3

As of now likert is available as separate package.
You do not need reshape library or detach both irutils and reshape

Solved – Visualization of binned frequency distribution in R

This kind of plot could be generated with geom_rect.

Your data:

names <- read.csv("http://samswift.org/files/app_c.csv")
sum50 <- tapply(names$count, (seq_along(names$count)-1) %/% 50, sum)

First, we need additional variables:

The cumulative sum:

cum <- rev(cumsum(rev(sum50)))

Put all into a data frame. The variables start and stop indicate where the rectangles should begin and end, respectively:

data <- data.frame(sum = sum50,
                   names = paste(as.numeric(names(sum50)) * 50 + 1,
                                 as.numeric(names(sum50)) * 50 + 50, sep = "-"),
                   start = c(cum[-1], 0),
                   stop = cum, stringsAsFactors = FALSE)
data$names[nrow(data)] <- paste(as.numeric(names(sum50)[length(sum50)]) * 50 + 1,
                                as.numeric(names(sum50)[length(sum50)]) * 50 + 
                                                          nrow(names) %% 50, sep = "-")

The variable center is the center between start and stop position:

data$center <- (data$stop - data$start)/2 + data$start

For this example, I use the first five rows:

data <- data[1:5, ]

Plot:

library(ggplot2)

ggplot(data, aes(xmin = start, xmax = stop, ymin = 0, ymax = sum)) +
  geom_rect(fill = NA, colour = "black") +
  scale_x_reverse("bin", breaks = data$center, labels = data$names) +
  coord_equal() # because we want squares

enter image description here

This is the version based on the complete data set. You should consider using only a subset of x-axis labels.

enter image description here

Best Answer

Related Solutions

Solved – Visualizing Likert responses using R or SPSS

Using irutils

Alternative solution

Solved – Visualization of binned frequency distribution in R

Related Question

Using `irutils`