In general if you have any suspicion that your errors are heteroskedastic, you should use robust standard errors. The fact that your estimates become non-significant when you don't use robust SEs suggests (but does not prove) the need for robust SEs! These SEs are "robust" to the bias that heteroskedasticity can cause in a generalized linear model.
This situation is a little different, though, in that you're layering them on top of Poisson regression.
Poisson has a well known property that it forces the dispersion to be equal to the mean, whether or not the data supports that. Before considering robust standard errors, I would try a Negative Binomial regression, which does not suffer from this problem. There is a test (see the comment) to help determine whether the resultant change in standard errors is significant.
I do not know for sure whether the change you're seeing (moving to robust SEs narrows the CI) implies under-dispersion, but it seems likely. Take a look at the appropriate model (I think negative binomial, but a quick googling also suggests quasi-Poisson for under-dispersion?) and see what you get in that setting.
The observation that in an example involving data drawn from a contaminated Gaussian distribution, you'd get better estimates of the parameters describing the bulk of the data by using the $\text{mad}$ instead of $\text{med}|x-\text{med}(x)|$ where $\text{mad}(x)$ is:
$$\text{mad}=1.4826\times\text{med}|x-\text{med}(x)|$$
--where, $(\Phi^{-1}(0.75))^{-1}=1.4826$ is a consistency factor designed to ensure that $$\text{E}(\text{mad}(x)^2)=\text{Var}(x)$$
when $x$ is uncontaminated-- was originally made by Gauss (Walker, H. (1931)).
I cannot think of any reason not to use the $\text{med}$ instead of the sample mean in this case. The lower efficiency (at the Gaussian!) of the $\text{mad}$ can be a reason not to use the $\text{mad}$ in your example. However, there exist equally robust and highly-efficient alternatives to the $\text{mad}$. One of them is the $Q_n$. This estimator has many other advantages beside. It is also very insensitive to outliers (in fact nearly as insensitive as the mad). Contrary to the mad, it is not built around an estimate of location and does not assume that the distribution of the uncontaminated part of the data is symmetric. Like the mad, It is based on order statistics, so that it is always well defined even when the underlying distribution of your sample has no moments. Like the mad, It has a simple explicit form. Even more than for the mad, I see no reasons to use the sample standard deviation instead of the $Q_n$ in the example you describe (see Rousseeuw and Croux 1993 for more info about the $Q_n$).
As for your last question, about the specific case where $x\sim\Gamma(\nu,\lambda)$, then
$$\text{med}(x)\approx\lambda(\nu-1/3)$$
and
$$\text{mad}(x)\approx\lambda\sqrt{\nu}$$
(in both cases the approximations become good when $\nu>1.5$) so that
$$\hat{\nu}=\left(\frac{\text{med}(x)}{\text{mad}(x)}\right)^2$$
and
$$\hat{\lambda}=\frac{\text{mad}(x)^2}{\text{med}(x)}$$
See Chen and Rubin (1986) for a complete derivation.
- J. Chen and H. Rubin, 1986.
Bounds for the difference between median and
mean of Gamma and Poisson distributions, Statist. Probab. Lett., 4
, 281–283.
- P. J. Rousseeuw and C. Croux, 1993.
Alternatives to the Median Absolute Deviation
Journal of the American Statistical Association , Vol. 88, No. 424, pp. 1273-1283
- Walker, H. (1931). Studies in the History of the Statistical Method. Baltimore, MD: Williams & Wilkins Co. pp. 24–25.
Best Answer
Because, assuming normal errors is effectively the same as assuming that large errors do not occur! The normal distribution has so light tails, that errors outside $\pm 3$ standard deviations have very low probability, errors outside of $\pm 6$ standard deviations are effectively impossible. In practice, that assumption is seldom true. When analyzing small, tidy datasets from well designed experiments, this might not matter much, if we do a good analysis of residuals. With data of lesser quality, it might matter much more.
When using likelihood-based (or bayesian) methods, the effect of this normality (as said above, effectively this is the "no large errors"-assumption!) is to make the inference very little robust. The results of the analysis are too heavily influenced by the large errors! This must be so, since assuming "no large errors" forces our methods to interpret the large errors as small errors, and that can only happen by moving the mean value parameter to make all the errors smaller. One way to avoid that is to use so-called "robust methods", see http://web.archive.org/web/20160611192739/http://www.stats.ox.ac.uk/pub/StatMeth/Robust.pdf
But Andrew Gelman will not go for this, since robust methods are usually presented in a highly non-bayesian way. Using t-distributed errors in likelihood/bayesian models is a different way to obtain robust methods, as the $t$-distribution has heavier tails than the normal, so allows for a larger proportion of large errors. The number of degrees of freedom parameter should be fixed in advance, not estimated from the data, since such estimation will destroy the robustness properties of the method (*) (it is also a very difficult problem, the likelihood function for $\nu$, the number degrees of freedom, can be unbounded, leading to very inefficient (even inconsistent) estimators).
If, for instance, you think (are afraid) that as much as 1 in ten observations might be "large errors" (above 3 sd), then you could use a $t$-distribution with 2 degrees of freedom, increasing that number if the proportion of large errors is believed to be smaller.
I should note that what I have said above is for models with independent $t$-distributed errors. There have also been proposals of multivariate $t$-distribution (which is not independent) as error distribution. That propsal is heavily criticized in the paper "The emperor's new clothes: a critique of the multivariate $t$ regression model" by T. S. Breusch, J. C. Robertson and A. H. Welsh, in Statistica Neerlandica (1997) Vol. 51, nr. 3, pp. 269-286, where they show that the multivariate $t$ error distribution is empirically indistinguishable from the normal. But that criticism do not affect the independent $t$ model.
(*) One reference stating this is Venables & Ripley's MASS---Modern Applied Statistics with S (on page 110 in 4th edition).