I used t test for my data in likert scale with 115 observations, based on an answer to a similar question (see gung, 12 June 2016). Can somebody help to find a credible source to justify my decision? I spent many hours on this with no result.
Solved – Using t test for discrete data
discrete datahypothesis testingt-test
Related Solutions
Unless there is a huge imbalance resulting in almost no Promoters or no Detractors, a t-test should work fine.
Specifically, the NPS method reduces the data to a set of $-1,0,1$ values (representing "Detractors," "Passives," and "Promoters," respectively). In a given dataset $\mathcal{S}$ of $n$ values let the count of the value $x$ be $n_x.$ The NPS is the mean value,
$$NPS_{\,\mathcal{S}} = \frac{1}{n}\left(n_{-1} + n_0 + n_1\right)$$
and its sample variance is an adjusted mean squared difference
$$s_\mathcal{S}^2 = \frac{1}{n-1}\left(n_{-1}(-1-NPS_\mathcal{S})^2 + n_0(0-NPS_\mathcal{S})^2 + n_1(1-NPS_\mathcal{S})^2\right).$$
As explained at https://stats.stackexchange.com/a/18609/919, the square of the standard error (there referred to as "margin of error") is the sample variance divided by the sample size,
$$\operatorname{se}_\mathcal{S}^2 = \frac{s^2_\mathcal{S}}{n}.$$
Given two such sets of data to compare, say $A$ and $B$, the difference in their NPSes is $NPS_A-NPS_B$ and the squared standard error of that difference is $\operatorname{se}_A^2 + \operatorname{se}_B^2.$ The Student $t$ statistic is the ratio of the difference to its standard error,
$$t = \frac{NPS_A - NPS_B}{\sqrt{\operatorname{se}_A^2 + \operatorname{se}_B^2}}.$$
Because we have assumed a situation where there are some promoters or some detractors, the denominator is nonzero, so $t$ is well-defined. The only issue is how to interpret it.
When the size of $t$ is "large," we say the difference in NPS is "significant" and conclude there is some cause for this difference other than sampling error. The only issue concerns the determination of how large is "large." The Student t-test uses quantiles of a Student t distribution with $n_A-1 + n_B-1$ degrees of freedom to determine what is a "large" value of $t$ for any given level of statistical risk $\alpha$ you care to specify. This risk is the chance that two random samples from populations with equal NPSes will produce a "large" value of $t,$ thereby causing you incorrectly to conclude there's a difference in NPS.
The "critical value," or threshold value to determine what "large" means, is the $1-\alpha/2$ quantile of the appropriate Student $t$ distribution.
Let's work an example. Suppose group $A$ has $n_{-1}=2$ Detractors, $n_0=8$ Passives, and $n_1=10$ Promoters for a total of $n=20.$ Its NPS is $NPS_A = (-2 + 10)/20 = 0.4$ (the same as $40\%$ if you prefer to express values as percents) and its variance is $$s^2_A = (2(-1-0.4)^2 + 8(0-0.4)^2 + 10(1-0.4)^2)/19 = 0.463.$$
Similarly, let group $B$ have $5$ Detractors, $20$ Passives, and $5$ Promoters, for a total of $30.$ The balance of Detractors and Promoters shows $NPS_B$ is zero. Its variance is $s_B^2=0.345.$ Thus, the t statistic for comparing these groups is
$$t = \frac{0.4 - 0} {\sqrt{0.463/20 + 0.345/30}} = 2.15.$$
Its size is $|t|=2.15.$ To determine how large this is, we refer to the Student $t$ distribution with $20-1 + 30-1 = 48$ degrees of freedom. It assigns a chance of $3.7\%$ to a value this large. This is the "p-value" of the t-test. If your risk threshold is only, say, $\alpha=5\%,$ then because the p-value is less than the threshold you will conclude this is a significant difference. If your risk threshold is smaller, say $\alpha=1\%,$ then because the p-value is greater than the threshold you will not conclude the observed difference in the samples is significant evidence of a real difference in the population represented by those samples.
Simulation studies indicate the use of the Student $t$ distribution works well when each group has at least 20 people. It also works when there are huge differences in NPS between the groups, where the conclusion to make is obvious. For smaller groups with similar NPSes or where there are extreme imbalances, you should mistrust the p-value. In such circumstances conduct a permutation test or collect more data.
For greater insight, pay attention to the variances: even when the groups have comparable NPSes and those do not differ "significantly," if one of the groups has a much larger variance you might want to take that polarization of your customers into consideration. For instance, a group of $20$ Passives and another group comprised of $10$ Detractors and $10$ Promoters will have identical NPSes of $0,$ whence a $t$ statistic of $0$ (which is never "significant" for any $\alpha$), yet there is a clear difference in how those groups are reacting to your product. This failure to account for the variance in evaluating customers is, IMHO, the chief drawback of using the NPS.
I think there are several challenges to consider.
In terms of how to visualize, the most accurate would be to use a mosaic plot, or a stacked barplot (which are practically the same in this case, but it might be easier to find a stacked barplot in excel or SPSS than the mosaic plot).
It might also be helpful to change the likert scale to a numerical (1-5) scale, and have a boxplot of each of the 4 categories of your second question. Since boxplots are based on percentiles, the meaning of the boxplot can be somewhat consistent (depending on how the quantiles are calculated when dealing with mid points) with the type of data you present.
In terms of how to analyse, there are different questions you can ask. The simplest will be "is there a correlation between the two?", that can easily be answered using the pearson correlation on the ranking of the numerical values of your scales. This correlation will actually give you the Spearman correlation measure (the correlation of the ranks). The ranking is important for cases where you will have ties (for example, the vector: 1,2,2,4 should actually become: 1,2.5,2.5,3).
The wilcoxon test is relevant if you want to answer the question if the ranks of one measure is different than the other measure. But from your question, it doesn't sound like an interesting question. You can also use the Chi-square test for a similar question, but it's power will probably be smaller.
Best Answer
I'll add some comments from a non-statistician perspective.
First idea: Can you re-do your analysis with something more appropriate for your data? That is, more satisfying for the reviewer. Would that be easier than mounting a defense of the t-test?
Second: It is usually important to be clear if you are talking about a Likert item (e.g. a single question on, say, a 1 to 5 scale) or a Likert scale (e.g. the sum of a whole bunch of Likert items, that has a wide range of values). The former is less likely to approximate the t-test's assumptions about the distribution of the data, and you are usually better off treating it like ordinal data.
Perhaps the best defense is to explain that your data met the assumptions of the test you used (to a reasonable degree). That you looked at the distribution of each group and that is was reasonably normal or t-like. And that the groups showed homogeneity (if that's an assumption of t-test you used). And any other assumptions that should be made explicit.
Another piece of defense would be to find good published literature in your field as examples. That doesn't necessarily mean it is correct, but might hold some weight with the reviewer in your field.
Finally, it might help to be explicit about assumptions that make your data interval and continuous in nature. Specifically, that you are assuming that the numbers in the Likert items are evenly spaced, and that they represent an underlying continuous distribution.