Suppose the population, from which we assume you are sampling randomly, contains proportions $p_1$ of promoters, $p_0$ of passives, and $p_{-1}$ of detractors, with $p_1+p_0+p_{-1}=1$. To model the NPS, imagine filling a large hat with a huge number of tickets (one for each member of your population) labeled $+1$ for promoters, $0$ for passives, and $-1$ for detractors, in the given proportions, and then drawing $n$ of them at random. The sample NPS is the average value on the tickets that were drawn. The true NPS is computed as the average value of all the tickets in the hat: it is the expected value (or expectation) of the hat.
A good estimator of the true NPS is the sample NPS. The sample NPS also has an expectation. It can be considered to be the average of all the possible sample NPS's. This expectation happens to equal the true NPS. The standard error of the sample NPS is a measure of how much the sample NPS's typically vary between one random sample and another. Fortunately, we do not have to compute all possible samples to find the SE: it can be found more simply by computing the standard deviation of the tickets in the hat and dividing by $\sqrt{n}$. (A small adjustment can be made when the sample is an appreciable proportion of the population, but that's not likely to be needed here.)
For example, consider a population of $p_1=1/2$ promoters, $p_0=1/3$ passives, and $p_{-1}=1/6$ detractors. The true NPS is
$$\mbox{NPS} = 1\times 1/2 + 0\times 1/3 + -1\times 1/6 = 1/3.$$
The variance is therefore
$$\eqalign{
\mbox{Var(NPS)} &= (1-\mbox{NPS})^2\times p_1 + (0-\mbox{NPS})^2\times p_0 + (-1-\mbox{NPS})^2\times p_{-1}\\
&=(1-1/3)^2\times 1/2 + (0-1/3)^2\times 1/3 + (-1-1/3)^2\times 1/6 \\
&= 5/9.
}$$
The standard deviation is the square root of this, about equal to $0.75.$
In a sample of, say, $324$, you would therefore expect to observe an NPS around $1/3 = 33$% with a standard error of $0.75/\sqrt{324}=$ about $4.1$%.
You don't, in fact, know the standard deviation of the tickets in the hat, so you estimate it by using the standard deviation of your sample instead. When divided by the square root of the sample size, it estimates the standard error of the NPS: this estimate is the margin of error (MoE).
Provided you observe substantial numbers of each type of customer (typically, about 5 or more of each will do), the distribution of the sample NPS will be close to Normal. This implies you can interpret the MoE in the usual ways. In particular, about 2/3 of the time the sample NPS will lie within one MoE of the true NPS and about 19/20 of the time (95%) the sample NPS will lie within two MoEs of the true NPS. In the example, if the margin of error really were 4.1%, we would have 95% confidence that the survey result (the sample NPS) is within 8.2% of the population NPS.
Each survey will have its own margin of error. To compare two such results you need to account for the possibility of error in each. When survey sizes are about the same, the standard error of their difference can be found by a Pythagorean theorem: take the square root of the sum of their squares. For instance, if one year the MoE is 4.1% and another year the MoE is 3.5%, then roughly figure a margin of error around $\sqrt{3.5^2+4.1^2}$ = 5.4% for the difference in those two results. In this case, you can conclude with 95% confidence that the population NPS changed from one survey to the next provided the difference in the two survey results is 10.8% or greater.
When comparing many survey results over time, more sophisticated methods can help, because you have to cope with many separate margins of error. When the margins of error are all pretty similar, a crude rule of thumb is to consider a change of three or more MoEs as "significant." In this example, if the MoEs hover around 4%, then a change of around 12% or larger over a period of several surveys ought to get your attention and smaller changes could validly be dismissed as survey error. Regardless, the analysis and rules of thumb provided here usually provide a good start when thinking about what the differences among the surveys might mean.
Note that you cannot compute the margin of error from the observed NPS alone: it depends on the observed numbers of each of the three types of respondents. For example, if almost everybody is a "passive," the survey NPS will be near $0$ with a tiny margin of error. If the population is polarized equally between promoters and detractors, the survey NPS will still be near $0$ but will have the largest possible margin of error (equal to $1/\sqrt{n}$ in a sample of $n$ people).
The problem is that, as you say, this is a very poorly designed experiment. You have no control group of sick people who didn't get medication; no group of sick people who got Type 1 but not Type 2; and no group who got Type 2 and not Type 1. I think that no amount of statistics will let you reliably test your second and third hypotheses. For example, if you find that their protein levels have changed after they get Type 2 treatment, you will have no way of deciding if the change comes from a delayed effect from Type 1, or just a general natural effect from time. So I won't offer any suggestions for testing those hypotheses as any result will be misleading.
Your first hypothesis you can test if and only if you are confident that people do not get better without treatment. You could not conclude this from your experiment, so you would need to know this from other experience eg clinical experience with this illness that people do not get better naturally. I've no idea if this is realistic or not.
Assuming the condition in the above paragraph is correct, I would measure the difference in the sick people's protein levels at the end of the experiment (after they got both treatments) from their protein levels at the beginning (when they turned up sick but before getting any treatment).
You first look for evidence that the protein levels have increased by a positive amount during this duration. This would be a one-sided t test, based on the differences (hopefully improvements) measured above, comparing it to zero.
The second part of your hypothesis was that the improvement brings the sick people up to the level of the well people. Assume there is no controversy about the fact that the illness reduces protein levels in the first place (as this wasn't one of the hypotheses you wanted to check). In this case, compare the average protein level in the sick group at the end of the experiment with the average protein level in the well group. Again, this is a one-sided t test (assuming protein levels are normally distributed), but this time based on comparing the two average protein levels (as opposed to in the para before where it was based on average improved protein compared to zero).
I don't think the set of measurements after treatment 1 but before treatment 2 can tell us anything.
You will find it easier to analyse this in R than Matlab, I think - R has many more statistical functions built in and ready to go for the user. However, if my answer above is right, you only need to do t-tests, which are pretty straightforward. I would advocate some graphical data analysis as well - if only to check for plausibility, outliers, distributions, etc - which will certainly be easier in R.
Best Answer
Unless there is a huge imbalance resulting in almost no Promoters or no Detractors, a t-test should work fine.
Specifically, the NPS method reduces the data to a set of $-1,0,1$ values (representing "Detractors," "Passives," and "Promoters," respectively). In a given dataset $\mathcal{S}$ of $n$ values let the count of the value $x$ be $n_x.$ The NPS is the mean value,
$$NPS_{\,\mathcal{S}} = \frac{1}{n}\left(n_{-1} + n_0 + n_1\right)$$
and its sample variance is an adjusted mean squared difference
$$s_\mathcal{S}^2 = \frac{1}{n-1}\left(n_{-1}(-1-NPS_\mathcal{S})^2 + n_0(0-NPS_\mathcal{S})^2 + n_1(1-NPS_\mathcal{S})^2\right).$$
As explained at https://stats.stackexchange.com/a/18609/919, the square of the standard error (there referred to as "margin of error") is the sample variance divided by the sample size,
$$\operatorname{se}_\mathcal{S}^2 = \frac{s^2_\mathcal{S}}{n}.$$
Given two such sets of data to compare, say $A$ and $B$, the difference in their NPSes is $NPS_A-NPS_B$ and the squared standard error of that difference is $\operatorname{se}_A^2 + \operatorname{se}_B^2.$ The Student $t$ statistic is the ratio of the difference to its standard error,
Because we have assumed a situation where there are some promoters or some detractors, the denominator is nonzero, so $t$ is well-defined. The only issue is how to interpret it.
When the size of $t$ is "large," we say the difference in NPS is "significant" and conclude there is some cause for this difference other than sampling error. The only issue concerns the determination of how large is "large." The Student t-test uses quantiles of a Student t distribution with $n_A-1 + n_B-1$ degrees of freedom to determine what is a "large" value of $t$ for any given level of statistical risk $\alpha$ you care to specify. This risk is the chance that two random samples from populations with equal NPSes will produce a "large" value of $t,$ thereby causing you incorrectly to conclude there's a difference in NPS.
The "critical value," or threshold value to determine what "large" means, is the $1-\alpha/2$ quantile of the appropriate Student $t$ distribution.
Let's work an example. Suppose group $A$ has $n_{-1}=2$ Detractors, $n_0=8$ Passives, and $n_1=10$ Promoters for a total of $n=20.$ Its NPS is $NPS_A = (-2 + 10)/20 = 0.4$ (the same as $40\%$ if you prefer to express values as percents) and its variance is $$s^2_A = (2(-1-0.4)^2 + 8(0-0.4)^2 + 10(1-0.4)^2)/19 = 0.463.$$
Similarly, let group $B$ have $5$ Detractors, $20$ Passives, and $5$ Promoters, for a total of $30.$ The balance of Detractors and Promoters shows $NPS_B$ is zero. Its variance is $s_B^2=0.345.$ Thus, the t statistic for comparing these groups is
$$t = \frac{0.4 - 0} {\sqrt{0.463/20 + 0.345/30}} = 2.15.$$
Its size is $|t|=2.15.$ To determine how large this is, we refer to the Student $t$ distribution with $20-1 + 30-1 = 48$ degrees of freedom. It assigns a chance of $3.7\%$ to a value this large. This is the "p-value" of the t-test. If your risk threshold is only, say, $\alpha=5\%,$ then because the p-value is less than the threshold you will conclude this is a significant difference. If your risk threshold is smaller, say $\alpha=1\%,$ then because the p-value is greater than the threshold you will not conclude the observed difference in the samples is significant evidence of a real difference in the population represented by those samples.
Simulation studies indicate the use of the Student $t$ distribution works well when each group has at least 20 people. It also works when there are huge differences in NPS between the groups, where the conclusion to make is obvious. For smaller groups with similar NPSes or where there are extreme imbalances, you should mistrust the p-value. In such circumstances conduct a permutation test or collect more data.
For greater insight, pay attention to the variances: even when the groups have comparable NPSes and those do not differ "significantly," if one of the groups has a much larger variance you might want to take that polarization of your customers into consideration. For instance, a group of $20$ Passives and another group comprised of $10$ Detractors and $10$ Promoters will have identical NPSes of $0,$ whence a $t$ statistic of $0$ (which is never "significant" for any $\alpha$), yet there is a clear difference in how those groups are reacting to your product. This failure to account for the variance in evaluating customers is, IMHO, the chief drawback of using the NPS.