The "margin of error" of 38% is computed using a formula for 0/1 results that are obtained independently and randomly from a large population. None of these apply. The analogous formula for the present case (with 1..7 results obtained from a small population assuming random non-response) would involve a sample standard deviation with a finite population correction. It would not be very helpful in sorting out the confusion. Instead, let's explore the concept starting from its foundations.
The purpose of a margin of error is to indicate how much the population might differ from the sample, assuming that the data are a random sample of the population.
The problem we have is we don't know what the four missing respondents would have said. We have to cover all possible cases that are consistent with the data.
An interesting possibility is that seven of the 11 CEOs would give a reply of 7, two of them would reply with 5, and the other 2 with 1: this is (hypothetically) the population. How consistent are the data with this scenario? In this case, the chance of observing six 7's and one 5 at random is
$$\frac{\binom{7}{6}\binom{2}{1}\binom{2}{0}}{\binom{11}{7}} = \frac{7 \times 2 \times 1}{330} \approx 4.24\%.$$
I imagine that observing seven 7's ($1/330$ chance) or five 7's and two 5's ($21/330$ chance) would also be considered a "very satisfied" overall rating. The total chance of observing this rating in the sample would therefore equal $(14 + 1 + 22)/330$ = $11.2\%$.
This is about the worst situation that is consistent with the recorded observations in the sense that the conclusion made from our sample ("very satisfied") has at least a 5% chance of arising from a random selection. A reasonable way to express the "margin of error," then, is to note that as many as two of the CEOs, but not more than that, might have been extremely dissatisfied.
It's a good idea to go further with the analysis, because we have no evidence to support the assumption that the seven respondents actually are a representative sample. In fact, most likely they are not. Perhaps, for example, four of the CEOs would reply with 1's but did not care to because they did not want to reveal their dissatisfaction: the missingness is not at random.
A frank and thoughtful exposition of the results would bring up both these points when assessing the credibility of the results and their applicability to the entire population. Its conclusions would necessarily be tentative. They might be stated thus:
In this self-selected sample of 7 of the 11 CEOs, six reported being "very satisfied" and one "somewhat satisfied" with their bonus. Nothing is known about what the remaining four CEOs feel. We can conclude that as a group, the majority of the 11 CEOs were highly satisfied, but there may be anywhere from none through a significant minority (four) who were dissatisfied.
One advantage of this plain exposition is that it makes no unnecessary technical demands on the reader (or the writer) by invoking a "margin of error" formula which would need to be explained and interpreted and might well be incorrect by not accounting for the small population size or possibility of non-random response.
The techniques illustrated here apply to similar problems with small population sizes, non-random responses, or qualitative ordinal scales.
This answer is based on another answer of mine, but adapted to your question slightly.
If you're sure your items are all measuring the same latent construct, you could use a partial credit model to account for differences in response scaling across all items. If the items with four-point Likert scale (polytomous) measurements are all on the exact same scale though, you might be better off using a rating scale model of the polytomous items and a separate, probably more basic item response theory model for the binary items. John Michael Linacre and Benjamin D. Wright posted some discussions of the differences between partial credit and rating scale models over at rasch.org that might give you a better sense of what you'd be dealing with if you go the item response theory route here.
Some latent variable analysis programs will let you set certain thresholds to be equal across certain items and leave another item's threshold freely estimated. You might be able to blend the partial credit and rating scale models this way by setting your polytomous items' thresholds (each item will have three) to be equal across items, and estimating the binary items' single threshold independently of the polytomous items. Depending on your theory about the binary items, they could all have the same threshold as each other, or they could each have their own, or maybe somewhere between those two extremes...but I'm not exactly sure this is all you'd need to do to have the best of both worlds.
The simple, "classical test theory" approach that weighs every item equally would probably have you just standardize all the items and average the $z$-scores, but I don't think that's a good idea, because four-point Likert scales may not approximate a continuous dimension well enough (and a binary item definitely won't; it might not even make sense), though the average of 12 polytomous items might be approximately continuous enough. I've seen it suggested that each item's Likert scale should have at least five items to approximate a continuous distribution, and at least five Likert scale items should measure the same scale if their simple sum / average is to approximate a continuous dimension. (Can't remember where, but I can look it up and edit it in if you want a source but can't find one yourself; just comment!)
If you're not sure your items are all measuring the same latent construct, I'm afraid you have other things to worry about; see these questions:
Best Answer
See this question: Analyzing Likert scales
Agresti does a lot of this ordinal data analysis (e.g., "Analysis of Ordinal Categorical Data").
For your particular problem, I would suggest looking at three methods: multiple hypothesis testing http://en.wikipedia.org/wiki/Multiple_comparisons, mixed effects models http://en.wikipedia.org/wiki/Mixed_model package
lme4
functionlmer()
in R, and cumulative link mixed models http://cran.r-project.org/web/packages/ordinal/vignettes/clmm2_tutorial.pdf packageordinal
functionclmm()
in R.In general, I wouldn't recommend doing traditional multiple testing since that assumes the data is ratio (rather than ordinal like you have). If you want to make that assumption though, you can just test to see which questions have an average response different from to the center of the Lickert scale, and then use a correction to take into account the fact that you did 9+6+7+2+4+2 tests.
For the mixed effects models use random effects, and treat each group of questions separately ("utility of the program", etc.). Treat each question as a random effect (there is a population of possible questions you could have chosen, and you happened to pick these 9 questions about utility), and treat the respondent as a random effect (there is a population of possible people who you want to gather opinions about, and you happened to sample this group). Hence, the model is $y_{ij}=\mu + a_i + b_j + e_{ij}$ where $y_{ij}$ is the response of person $i$ to question $j$, $a_i$ is the random effect due to person $i$ (you have 16 people), $b_j$ is the random effect due to question $j$ (you have 9 questions in the group "utility"), and $e_{ij}$ is the error of how much person $i$'s response to question $j$ differed from the model.
Using the
lme4
package, you can estimate $\mu$ and test if it is significantly different from the center of the Likert scale.Using the
ordinal
package, you can do this more carefully taking into account that your data is ordinal instead of ratio, but you lose some of the interpretability of the linear mixed effects model.Those packages use a sort of funny notation. Suppose your data is in a dataframe called
dat
with columnsresponse
,question
,person
. Then you can implement this as follows: