See this question: Analyzing Likert scales
Agresti does a lot of this ordinal data analysis (e.g., "Analysis of Ordinal Categorical Data").
For your particular problem, I would suggest looking at three methods: multiple hypothesis testing http://en.wikipedia.org/wiki/Multiple_comparisons, mixed effects models http://en.wikipedia.org/wiki/Mixed_model package lme4
function lmer()
in R, and cumulative link mixed models http://cran.r-project.org/web/packages/ordinal/vignettes/clmm2_tutorial.pdf package ordinal
function clmm()
in R.
In general, I wouldn't recommend doing traditional multiple testing since that assumes the data is ratio (rather than ordinal like you have). If you want to make that assumption though, you can just test to see which questions have an average response different from to the center of the Lickert scale, and then use a correction to take into account the fact that you did 9+6+7+2+4+2 tests.
For the mixed effects models use random effects, and
treat each group of questions separately ("utility of the program", etc.). Treat each question as a random effect (there is a population of possible questions you could have chosen, and you happened to pick these 9 questions about utility), and treat the respondent as a random effect (there is a population of possible people who you want to gather opinions about, and you happened to sample this group). Hence, the model is $y_{ij}=\mu + a_i + b_j + e_{ij}$ where $y_{ij}$ is the response of person $i$ to question $j$, $a_i$ is the random effect due to person $i$ (you have 16 people), $b_j$ is the random effect due to question $j$ (you have 9 questions in the group "utility"), and $e_{ij}$ is the error of how much person $i$'s response to question $j$ differed from the model.
Using the lme4
package, you can estimate $\mu$ and test if it is significantly different from the center of the Likert scale.
Using the ordinal
package, you can do this more carefully taking into account that your data is ordinal instead of ratio, but you lose some of the interpretability of the linear mixed effects model.
Those packages use a sort of funny notation. Suppose your data is in a dataframe called dat
with columns response
, question
, person
. Then you can implement this as follows:
require(lme4)
lmer(response ~ 1 + (1 | question) + (1 | person), data=dat)
require(ordinal)
clmm(ordered(response) ~ 1 + (1 | question) + (1 | person), data=dat)
First thing: If you are going to somehow average or combine responses to multiple questions, you are assuming that answers to all questions all measure the same underlying latent (underlying) variable (e.g., in this case, potentially "user's satisfaction of new web interface"). If these answers to these questions are not all related (e.g., does a person's answer to one question predict their answer to the other questions?), then you cannot combine them.
Typically when people create a new questionnaire, they examine whether the new questionniare is indeed useful (reliability, consistency, validity, etc). For example, check out:
http://www.brighamandwomens.org/medical_professionals/career/cfdd/mentoring%20resources/surveydesign.pdf
For questionnaires using the likert scale that have passed these measures, I have published papers in academic journals using the sum of the answers or mean (with statistics such as regression performed on these values). If we assume that your questionnaire has been designed correctly and that all the items measure some latent factor about "user's satisfaction of new web interface", then a mean of, say, 4.5 (halfway between 'strongly agree' and 'somewhat agree') suggests that the person falls somewhere between 'strongly agree' and 'somewhat agree' in being "satisfied with the new web interface."
In other words, the mean of all the scales is the value of the latent variable "user's satisfaction with the new web interface" for that person. If certain assumptions are met (e.g., see below), the central limit theorem allows the latent variable to be on a continuous scale instead of an ordinal one.
Don't use the mode or median.
p.s. snippet from wikipedia:
"Responses to several Likert questions may be summed providing that all questions use the same Likert scale and that the scale is a defensible approximation to an interval scale, in which case the Central Limit Theorem allows treatment of the data as interval data measuring a latent variable.[citation needed] If the summed responses fulfill these assumptions, parametric statistical tests such as the analysis of variance can be applied. Typical cutoffs for thinking that this approximation will be acceptable is a minimum of 4 and preferably 8 items in the sum.[12][13]"
Best Answer
The "margin of error" of 38% is computed using a formula for 0/1 results that are obtained independently and randomly from a large population. None of these apply. The analogous formula for the present case (with 1..7 results obtained from a small population assuming random non-response) would involve a sample standard deviation with a finite population correction. It would not be very helpful in sorting out the confusion. Instead, let's explore the concept starting from its foundations.
The purpose of a margin of error is to indicate how much the population might differ from the sample, assuming that the data are a random sample of the population.
The problem we have is we don't know what the four missing respondents would have said. We have to cover all possible cases that are consistent with the data.
An interesting possibility is that seven of the 11 CEOs would give a reply of 7, two of them would reply with 5, and the other 2 with 1: this is (hypothetically) the population. How consistent are the data with this scenario? In this case, the chance of observing six 7's and one 5 at random is
$$\frac{\binom{7}{6}\binom{2}{1}\binom{2}{0}}{\binom{11}{7}} = \frac{7 \times 2 \times 1}{330} \approx 4.24\%.$$
I imagine that observing seven 7's ($1/330$ chance) or five 7's and two 5's ($21/330$ chance) would also be considered a "very satisfied" overall rating. The total chance of observing this rating in the sample would therefore equal $(14 + 1 + 22)/330$ = $11.2\%$.
This is about the worst situation that is consistent with the recorded observations in the sense that the conclusion made from our sample ("very satisfied") has at least a 5% chance of arising from a random selection. A reasonable way to express the "margin of error," then, is to note that as many as two of the CEOs, but not more than that, might have been extremely dissatisfied.
It's a good idea to go further with the analysis, because we have no evidence to support the assumption that the seven respondents actually are a representative sample. In fact, most likely they are not. Perhaps, for example, four of the CEOs would reply with 1's but did not care to because they did not want to reveal their dissatisfaction: the missingness is not at random.
A frank and thoughtful exposition of the results would bring up both these points when assessing the credibility of the results and their applicability to the entire population. Its conclusions would necessarily be tentative. They might be stated thus:
One advantage of this plain exposition is that it makes no unnecessary technical demands on the reader (or the writer) by invoking a "margin of error" formula which would need to be explained and interpreted and might well be incorrect by not accounting for the small population size or possibility of non-random response.
The techniques illustrated here apply to similar problems with small population sizes, non-random responses, or qualitative ordinal scales.