"Essentially, all models are wrong, but some are useful."
— Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley. ISBN 0471810339.
What exactly is the meaning of the above phrase?
modeling
"Essentially, all models are wrong, but some are useful."
— Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley. ISBN 0471810339.
What exactly is the meaning of the above phrase?
I'm not sure there are standard practices for generating synthetic data - it's used so heavily in so many different aspects of research that purpose-built data seems to be a more common and arguably more reasonable approach.
For me, my best standard practice is not to make the data set so it will work well with the model. That's part of the research stage, not part of the data generation stage. Instead, the data should be designed to reflect the data generation process. For example, for simulation studies in Epidemiology, I always start from a large hypothetical population with a known distribution, and then simulate study sampling from that population, rather than generating "the study population" directly.
For example, based on our discussion below, two examples of simulated data I've made:
Simulations like the latter are very common when examining the impact of study recruitment methods, statistical approaches to controlling for covariates, etc.
The cited article seems to be based on fears that statisticians "will not be an intrinsic part of the scientific team, and the scientists will naturally have their doubts about the methods used" and that "collaborators will view us as technicians they can steer to get their scientific results published." My comments on the questions posed by @rvl come from the perspective of a non-statistician biological scientist who has been forced to grapple with increasingly complicated statistical issues as I moved from bench research to translational/clinical research over the past few years. Question 5 is clearly answered by the multiple answers now on this page; I'll go in reverse order from there.
4) It doesn't really matter whether an "exact model" exists, because even if it does I probably won't be able to afford to do the study. Consider this issue in the context of the discussion: Do we really need to include “all relevant predictors?” Even if we can identify "all relevant predictors" there will still be the problem of collecting enough data to provide the degrees of freedom to incorporate them all reliably into the model. That's hard enough in controlled experimental studies, let alone retrospective or population studies. Maybe in some types of "Big Data" that's less of a problem, but it is for me and my colleagues. There will always be the need to "be smart about it," as @Aksakal put it an an answer on that page.
In fairness to Prof. van der Laan, he doesn't use the word "exact" in the cited article, at least in the version presently available on line from the link. He talks about "realistic" models. That's an important distinction.
Then again, Prof. van der Laan complains that "Statistics is now an art, not a science," which is more than a bit unfair on his part. Consider the way he proposes to work with collaborators:
... we need to take the data, our identity as a statistician, and our scientific collaborators seriously. We need to learn as much as possible about how the data were generated. Once we have posed a realistic statistical model, we need to extract from our collaborators what estimand best represents the answer to their scientific question of interest. This is a lot of work. It is difficult. It requires a reasonable understanding of statistical theory. It is a worthy academic enterprise!
The application of these scientific principles to real-world problems would seem to require a good deal of "art," as with work in any scientific enterprise. I've known some very successful scientists, many more who did OK, and some failures. In my experience the difference seems to be in the "art" of pursing scientific goals. The result might be science, but the process is something more.
3) Again, part of the issue is terminological; there's a big difference between an "exact" model and the "realistic" models that Prof. van der Laan seeks. His claim is that many standard statistical models are sufficiently unrealistic to produce "unreliable" results. In particular: "Estimators of an estimand defined in an honest statistical model cannot be sensibly estimated based on parametric models." Those are matters for testing, not opinion.
His own work clearly recognizes that exact models aren't always possible. Consider this manuscript on targeted maximum likelihood estimators (TMLE) in the context of missing outcome variables. It's based on an assumption of outcomes missing at random, which may never be testable in practice: "...we assume there are no unobserved confounders of the relationship between missingness ... and the outcome." This is another example of the difficulty in including "all relevant predictors." A strength of TMLE, however, is that it does seem to help evaluate the "positivity assumption" of adequate support in the data for estimating the target parameter in this context. The goal is to come as close as possible to a realistic model of the data.
2) TMLE has been discussed on Cross Validated previously. I'm not aware of widespread use on real data. Google Scholar showed today 258 citations of what seems to be the initial report, but at first glance none seemed to be on large real-world data sets. The Journal of Statistical Software article on the associated R package only shows 27 Google Scholar citations today. That should not, however, be taken as evidence about the value of TMLE. Its focus on obtaining reliable unbiased estimates of the actual "estimand" of interest, often a problem with plug-in estimates derived from standard statistical models, does seem potentially valuable.
1) The statement: "a statistical model that makes no assumptions is always true" seems to be intended as a straw man, a tautology. The data are the data. I assume that there are laws of the universe that remain consistent from day to day. The TMLE method presumably contains assumptions about convexity in the search space, and as noted above its application in a particular context might require additional assumptions.
Even Prof. van der Laan would agree that some assumptions are necessary. My sense is that he would like to minimize the number of assumptions and to avoid those that are unrealistic. Whether that truly requires giving up on parametric models, as he seems to claim, is the crucial question.
Best Answer
I think its meaning is best analyzed by looking at it in two parts:
"All models are wrong" that is, every model is wrong because it is a simplification of reality. Some models, especially in the "hard" sciences, are only a little wrong. They ignore things like friction or the gravitational effect of tiny bodies. Other models are a lot wrong - they ignore bigger things. In the social sciences, we ignore a lot.
"But some are useful" - simplifications of reality can be quite useful. They can help us explain, predict and understand the universe and all its various components.
This isn't just true in statistics! Maps are a type of model; they are wrong. But good maps are very useful. Examples of other useful but wrong models abound.