Solved – “Targeted Maximum Likelihood Expectation”

censoringestimationmathematical-statisticsnonparametrictargeted-maximum-likelihood

I'm trying to understand some papers by Mark van der Laan. He's a theoretical statistician at Berkeley working on problems overlap significantly with machine learning. One problem for me (besides the deep math) is that he often ends up describing familiar machine learning approaches using a completely different terminology. One of his main concepts is "Targeted Maximum Likelihood Expectation".

TMLE is used to analyze censored observational data from a non-controlled experiment in a way that allows effect estimation even in the presence of confounding factors. I strongly suspect that the many of the same concepts exist under other names in other fields, but I don't yet understand it well enough to match it directly to anything.

An attempt to bridge the gap to "Computational Data Analysis" is here:

Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis

And an introduction for statisticians is here:

Targeted Maximum Likelihood Based Causal Inference: Part I

From the second:

In this article, we develop a particular targeted maximum likelihood
estimator of causal effects of multiple time point interventions. This
involves the use of loss-based super-learning to obtain an initial
estimate of the unknown factors of the G-computation formula, and
subsequently, applying a target-parameter specific optimal fluctuation
function (least favorable parametric submodel) to each estimated
factor, estimating the fluctuation parameter(s) with maximum
likelihood estimation, and iterating this updating step of the initial
factor till convergence. This iterative targeted maximum likelihood
updating step makes the resulting estimator of the causal effect
double robust in the sense that it is consistent if either the initial
estimator is consistent, or the estimator of the optimal fluctuation
function is consistent. The optimal fluctuation function is correctly
specified if the conditional distributions of the nodes in the causal
graph one intervenes upon are correctly specified.

In his terminology, "super learning" is ensemble learning with a a theoretically sound non-negative weighting scheme. But what does he mean by "applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor".

Or breaking it into three distinct questions, does TMLE have a parallel in machine learning, what is a "least favorable parametric submodel", and what is a "fluctuation function" in other fields?

Best Answer

I agree that van der Laan has a tendency to invent new names for already existing ideas (e.g. the super-learner), but TMLE is not one of them as far as I know. It is actually a very clever idea, and I have seen nothing from the Machine Learning community which looks similar (although I might just be ignorant). The ideas come from the theory of semiparametric-efficient estimating equations, which is something that I think statisticians think much more about than ML people.

The idea essentially is this. Suppose $P_0$ is a true data generating mechanism, and interest is in a particular functional $\Psi(P_0)$. Associated with such a functional is often an estimating equation

$$ \sum_i \varphi(Y_i \mid \theta) = 0, $$

where $\theta = \theta(P)$ is determined in some way by $P$, and contains enough information to identify $\Psi$. $\varphi$ will be such that $E_{P} \varphi(Y \mid \theta) = 0$. Solving this equation in $\theta$ may, for example, be much easier than estimating all of $P_0$. This estimating equation is efficient in the sense that any efficient estimator of $\Psi(P_0)$ is asymptotically equivalent to one which solves this equation. (Note: I'm being a little loose with the term "efficient", since I'm just describing the heuristic.) The theory behind such estimating equations is quite elegant, with this book being the canonical reference. This is where one might find standard definitions of "least favorable submodels"; these aren't terms van der Laan invented.

However, estimating $P_0$ using machine learning techniques will not, in general, satisfy this estimating equation. Estimating, say, the density of $P_0$ is an intrinsically difficult problem, perhaps much harder than estimating $\Psi(P_0)$, but machine learning techniques will typically go ahead and estimate $P_0$ with some $\hat P$, and then use a plug-in estimate $\Psi(\hat P)$. van der Laan would criticize this estimator as not being targeted and hence may be inefficient - perhaps, it may not even be $\sqrt n$-consistent at all! Nevertheless, van der Laan recognizes the power of machine learning, and knows that to estimate the effects he is interested in will ultimately require some density estimation. But he doesn't care about estimating $P_0$ itself; the density estimation is only done for the purpose of getting at $\Psi$.

The idea of TMLE is to start with the initial density estimate $\hat p$ and then consider a new model like this:

$$ \hat p_{1, \epsilon} = \frac{\hat p \exp(\epsilon \ \varphi(Y \mid \theta))}{\int \hat p \exp(\epsilon \ \varphi(y \mid \theta)) \ dy} $$

where $\epsilon$ is called a fluctuation parameter. Now we do maximum likelihood on $\epsilon$. If it happens to be the case that $\epsilon = 0$ is the MLE then one can easily verify by taking the derivative that $\hat p$ solves the efficient estimating equation, and hence is efficient for estimating $\Psi$! On the other hand, if $\epsilon \ne 0$ at the MLE, we have a new density estimator $\hat p_1$ which fits the data better than $\hat p$ (after all, we did MLE, so it has a higher likelihood). Then, we iterate this procedure and look at

$$ \hat p_{2, \epsilon} \propto \hat p_{1, \hat \epsilon} \exp(\epsilon \ \varphi(Y \mid \theta). $$

and so on until we get something, in the limit, which satisfies the efficient estimating equation.

Related Question