I fully agree that Pearl's tone is arrogant, and his characterisation of "statisticians" is simplistic and monolithic. Also, I don't find his writing particularly clear.
However, I think he has a point.
Causal reasoning was not part of my formal training (MSc): the closest I got to the topic was an elective course in experimental design, i.e. any causality claims required me to physically control the environment. Pearl's book Causality was my first exposure to a refutation of this idea. Obviously I can't speak for all statisticians and curricula, but from my own perspective I subscribe to Pearl's observation that causal reasoning is not a priority in statistics.
It is true that statisticians sometimes control for more variables than is strictly necessary, but this rarely leads to error (at least in my experience).
This is also a belief that I held after graduating with an MSc in statistics in 2010.
However, it is deeply incorrect. When you control for a common effect (called "collider" in the book), you can introduce selection bias. This realization was quite astonishing to me, and really convinced me of the usefulness of representing my causal hypotheses as graphs.
EDIT: I was asked to elaborate on selection bias. This topic is quite subtle, I highly recommend perusing the edX MOOC on Causal Diagrams, a very nice introduction to graphs which has a chapter dedicated to selection bias.
For a toy example, to paraphrase this paper cited in the book: Consider the variables A=attractiveness, B=beauty, C=competence. Suppose that B and C are causally unrelated in the general population (i.e., beauty does not cause competence, competence does not cause beauty, and beauty and competence do not share a common cause). Suppose also that any one of B or C is sufficient for being attractive, i.e. A is a collider. Conditioning on A creates a spurious association between B and C.
A more serious example is the "birth weight paradox", according to which a mother's smoking (S) during pregnancy seems to decrease the mortality (M) of the baby, if the baby is underweight (U). The proposed explanation is that birth defects (D) also cause low birth weight, and also contribute to mortality. The corresponding causal diagram is { S -> U, D -> U, U -> M, S -> M, D -> M } in which U is a collider; conditioning on it introduces the spurious association. The intuition behind this is that if the mother is a smoker, the low birth weight is less likely to be due to a defect.
Best Answer
This is a broad question, but given the Box, Hunter and Hunter quote is true I think what it comes down to is
The quality of the experimental design:
The quality of the implementation of the design:
The quality of the model to accurately reflect the design:
At the risk of stating the obvious I'll try to hit on the key points of each:
is a large sub-field of statistics, but in it's most basic form I think it comes down to the fact that when making causal inference we ideally start with identical units that are monitored in identical environments other than being assigned to a treatment. Any systematic differences between groups after assigment are then logically attributable to the treatment (we can infer cause). But, the world isn't that nice and units differ prior to treatment and evironments during experiments are not perfectly controlled. So we "control what we can and randomize what we can't", which helps to insure that there won't be systematic bias due to the confounders that we controlled or randomized. One problem is that experiments tend to be difficult (to impossible) and expensive and a large variety of designs have been developed to efficiently extract as much information as possible in as carefully controlled a setting as possible, given the costs. Some of these are quite rigorous (e.g. in medicine the double-blind, randomized, placebo-controlled trial) and others less so (e.g. various forms of 'quasi-experiments').
is also a big issue and one that statisticians generally don't think about...though we should. In applied statistical work I can recall incidences where 'effects' found in the data were spurious results of inconsistency of data collection or handling. I also wonder how often information on true causal effects of interest is lost due to these issues (I believe students in the applied sciences generally have little-to-no training about ways that data can become corrupted - but I'm getting off topic here...)
is another large technical subject, and another necessary step in objective causal inference. To a certain degree this is taken care of because the design crowd develop designs and models together (since inference from a model is the goal, the attributes of the estimators drive design). But this only gets us so far because in the 'real world' we end up analysing experimental data from non-textbook designs and then we have to think hard about things like the appropriate controls and how they should enter the model and what associated degrees of freedom should be and whether assumptions are met if if not how to adjust of violations and how robust the estimators are to any remaining violations and...
Anyway, hopefully some of the above helps in thinking about considerations in making causal inference from a model. Did I forget anything big?