Solved – Are all models useless? Is any exact model possible — or useful

machine learningmaximum likelihoodmodelingnonparametrictargeted-maximum-likelihood

This question has been festering in my mind for over a month. The February 2015 issue of Amstat News contains an article by Berkeley Professor Mark van der Laan that scolds people for using inexact models. He states that by using models, statistics is then an art rather than a science. According to him, one can always use "the exact model" and that our failure to do so contributes to a "lack of rigor … I fear that our representation in data science is becoming marginalized."

I agree that we are in danger of becoming marginalized, but the threat usually comes from those who claim (sounding a lot like Professor van der Laan, it seems) that they are not using some approximate method, but whose methods are in fact far less rigorous than are carefully applied statistical models — even wrong ones.

I think it is fair to say that Prof van der Laan is rather scornful of those who repeat Box's oft-used quote, "all models are wrong, but some are useful." Basically, as I read it, he says that all models are wrong, and all are useless. Now, who am I to disagree with a Berkeley professor? On the other hand, who is he to so cavalierly dismiss the views of one of the real giants in our field?

In elaborating, Dr van der Laan states that "it is complete nonsense to state that all models are wrong, … For example, a statistical model that makes no assumptions is always true." He continues: "But often, we can do much better than that: We might know that the data are the result of $n$ independent identical experiments." I do not see how one can know that except in very narrow random-sampling or controlled experimental settings. The author points to his work in targeted maximum likelihood learning and targeted minimum loss-based learning, which "integrates the state of the art in machine learning/data-adaptive estimation, all the incredible advances in causal inference, censored data, efficiency and empirical process theory while still providing formal statistical inference." Sounds great!

There are also some statements I agree with. He says that we need to take our work, our role as a statistician, and our scientific collaborators seriously. Hear hear! It is certainly bad news when people routinely use a logistic regression model, or whatever, without carefully considering whether it is adequate to answering the scientific question or if it fits the data. And I do see plenty of such abuses in questions posted in this forum. But I also see effective and valuable uses of inexact models, even parametric ones. And contrary to what he says, I have seldom been "bored to death by another logistic regression model." Such is my naivety, I guess.

So here are my questions:

  1. What useful statistical inferences can be made using a model that makes no assumptions at all?
  2. Does there exist a case study, with important, real data in the use of targeted maximum likelihood? Are these methods widely used and accepted?
  3. Are all inexact models indeed useless?
  4. Is it possible to know that you have the exact model other than in trivial cases?
  5. If this is too opinion-based and hence off-topic, where can it be discussed? Because Dr van der Laan's article definitely does need some discussion.

Best Answer

The cited article seems to be based on fears that statisticians "will not be an intrinsic part of the scientific team, and the scientists will naturally have their doubts about the methods used" and that "collaborators will view us as technicians they can steer to get their scientific results published." My comments on the questions posed by @rvl come from the perspective of a non-statistician biological scientist who has been forced to grapple with increasingly complicated statistical issues as I moved from bench research to translational/clinical research over the past few years. Question 5 is clearly answered by the multiple answers now on this page; I'll go in reverse order from there.

4) It doesn't really matter whether an "exact model" exists, because even if it does I probably won't be able to afford to do the study. Consider this issue in the context of the discussion: Do we really need to include “all relevant predictors?” Even if we can identify "all relevant predictors" there will still be the problem of collecting enough data to provide the degrees of freedom to incorporate them all reliably into the model. That's hard enough in controlled experimental studies, let alone retrospective or population studies. Maybe in some types of "Big Data" that's less of a problem, but it is for me and my colleagues. There will always be the need to "be smart about it," as @Aksakal put it an an answer on that page.

In fairness to Prof. van der Laan, he doesn't use the word "exact" in the cited article, at least in the version presently available on line from the link. He talks about "realistic" models. That's an important distinction.

Then again, Prof. van der Laan complains that "Statistics is now an art, not a science," which is more than a bit unfair on his part. Consider the way he proposes to work with collaborators:

... we need to take the data, our identity as a statistician, and our scientific collaborators seriously. We need to learn as much as possible about how the data were generated. Once we have posed a realistic statistical model, we need to extract from our collaborators what estimand best represents the answer to their scientific question of interest. This is a lot of work. It is difficult. It requires a reasonable understanding of statistical theory. It is a worthy academic enterprise!

The application of these scientific principles to real-world problems would seem to require a good deal of "art," as with work in any scientific enterprise. I've known some very successful scientists, many more who did OK, and some failures. In my experience the difference seems to be in the "art" of pursing scientific goals. The result might be science, but the process is something more.

3) Again, part of the issue is terminological; there's a big difference between an "exact" model and the "realistic" models that Prof. van der Laan seeks. His claim is that many standard statistical models are sufficiently unrealistic to produce "unreliable" results. In particular: "Estimators of an estimand defined in an honest statistical model cannot be sensibly estimated based on parametric models." Those are matters for testing, not opinion.

His own work clearly recognizes that exact models aren't always possible. Consider this manuscript on targeted maximum likelihood estimators (TMLE) in the context of missing outcome variables. It's based on an assumption of outcomes missing at random, which may never be testable in practice: "...we assume there are no unobserved confounders of the relationship between missingness ... and the outcome." This is another example of the difficulty in including "all relevant predictors." A strength of TMLE, however, is that it does seem to help evaluate the "positivity assumption" of adequate support in the data for estimating the target parameter in this context. The goal is to come as close as possible to a realistic model of the data.

2) TMLE has been discussed on Cross Validated previously. I'm not aware of widespread use on real data. Google Scholar showed today 258 citations of what seems to be the initial report, but at first glance none seemed to be on large real-world data sets. The Journal of Statistical Software article on the associated R package only shows 27 Google Scholar citations today. That should not, however, be taken as evidence about the value of TMLE. Its focus on obtaining reliable unbiased estimates of the actual "estimand" of interest, often a problem with plug-in estimates derived from standard statistical models, does seem potentially valuable.

1) The statement: "a statistical model that makes no assumptions is always true" seems to be intended as a straw man, a tautology. The data are the data. I assume that there are laws of the universe that remain consistent from day to day. The TMLE method presumably contains assumptions about convexity in the search space, and as noted above its application in a particular context might require additional assumptions.

Even Prof. van der Laan would agree that some assumptions are necessary. My sense is that he would like to minimize the number of assumptions and to avoid those that are unrealistic. Whether that truly requires giving up on parametric models, as he seems to claim, is the crucial question.

Related Question