I think A and E aren't a good combination, because A says you should pick Mercy and E says you should pick Hope.
A and D have the virtue of advocating the same choice. But, lets examine the line of reasoning in D in further detail, since that seems to be the confusion. The probability of success for the surgeries follows the same ordering at both hospitals, with the A type being most likely to be successful and the E type being the least likely. If we collapse over (i.e., ignore) the hospitals, we can see that the marginal probability of success for the surgeries is:
Type A B C D E All
Prob .81 .78 .56 .21 .08 .52
Because E is much less likely to be successful, it is reasonable to imagine that it is more difficult (although in the real world, other possibilities exist as well). We can extend that line of thinking to the other four types also. Now lets look at what proportion of each hospital's total surgeries are of each type:
Type A B C D E
Mercy .08 .39 .06 .44 .03
Hope .09 .54 .23 .09 .05
What we notice here is that Hope tends to do more of the easier surgeries A-C (and especially B & C), and fewer of the harder surgeries like D. E is pretty uncommon in both hospitals, but, for what it's worth, Hope actually does a higher percentage. Nonetheless, the Simpson's Paradox effect is going to mostly be driven by B-D here (not actually column E as answer choice D suggested).
Simpson's Paradox occurs because the surgeries vary in difficulty (in general) and also because the N's differ. It is the differing base rates of the different types of surgeries that makes this counter-intuitive. What is happening would be easy to see if both hospitals did exactly the same number of each type of surgery. We can do that by simply calculating the success probabilities and multiplying by 100; this adjusts for the different frequencies:
Type A B C D E All
Mercy 81 79 60 21 09 250
Hope 80 76 51 14 04 225
Now, because both hospitals did 100 of each surgery (500 total), the answer is obvious: Mercy is the better hospital.
In your question, you state that you don't know what "causal Bayesian networks" and "back door tests" are.
Suppose you have a causal Bayesian network. That is, a directed acyclic graph whose nodes represent propositions and whose directed edges represent potential causal relationships. You may have many such networks for each of your hypotheses. There are three ways to make a compelling argument about the strength or existence of an edge $A \stackrel?\rightarrow B$.
The easiest way is an intervention. This is what the other answers are suggesting when they say that "proper randomization" will fix the problem. You randomly force $A$ to have different values and you measure $B$. If you can do that, you're done, but you can't always do that. In your example, it may be unethical to give people ineffective treatments to deadly diseases, or they may be have some say in their treatment, e.g., they may choose the less harsh (treatment B) when their kidney stones are small and less painful.
The second way is the front door method. You want to show that $A$ acts on $B$ via $C$, i.e., $A\rightarrow C \rightarrow B$. If you assume that $C$ is potentially caused by $A$ but has no other causes, and you can measure that $C$ is correlated with $A$, and $B$ is correlated with $C$, then you can conclude evidence must be flowing via $C$. The original example: $A$ is smoking, $B$ is cancer, $C$ is tar accumulation. Tar can only come from smoking, and it correlates with both smoking and cancer. Therefore, smoking causes cancer via tar (though there could be other causal paths that mitigate this effect).
The third way is the back door method. You want to show that $A$ and $B$ aren't correlated because of a "back door", e.g. common cause, i.e., $A \leftarrow D \rightarrow B$. Since you have assumed a causal model, you merely need to block the all of the paths (by observing variables and conditioning on them) that evidence can flow up from $A$ and down to $B$. It's a bit tricky to block these paths, but Pearl gives a clear algorithm that lets you know which variables you have to observe to block these paths.
gung is right that with good randomization, confounders won't matter. Since we're assuming that intervening at the the hypothetical cause (treatment) is not allowed, any common cause between the hypothetical cause (treatment) and effect (survival), such as age or kidney stone size will be a confounder. The solution is to take the right measurements to block all of the back doors. For further reading see:
Pearl, Judea. "Causal diagrams for empirical research." Biometrika 82.4 (1995): 669-688.
To apply this to your problem, let us first draw the causal graph. (Treatment-preceding) kidney stone size $X$ and treatment type $Y$ are both causes of success $Z$. $X$ may be a cause of $Y$ if other doctors are assigning tratment based on kidney stone size. Clearly there are no other causal relationships between $X$,$Y$, and $Z$. $Y$ comes after $X$ so it cannot be its cause. Similarly $Z$ comes after $X$ and $Y$.
Since $X$ is a common cause, it should be measured. It is up to the experimenter to determine the universe of variables and potential causal relationships. For every experiment, the experimenter measures the necessary "back door variables" and then calculates the marginal probability distribution of treatment success for each configuration of variables. For a new patient, you measure the variables and follow the treatment indicated by the marginal distribution. If you can't measure everything or you don't have a lot of data but know something about the architecture of the relationships, you can do "belief propagation" (Bayesian inference) on the network.
Best Answer
Simpson's paradox is an extreme form of confounding where the apparent sign of correlation is reversed; you haven't said this is the position here.
I can see at least three possibilities here: the heterogenity between the subgroups, the reduction in sample sizes in each, and poor definition of the subgroups which presuppose the results. Ignoring the third, both of the first two can have an impact: from past experience it is often the small sample size which lead to non-significance in the smaller subgroup and heterogenity which causes the whole group to produce a significant result wile the large subgroup does not.
That was an over-generalisation - each case will have its own issues.