Solved – How does an estimator that minimizes a weighted sum of squared bias and variance fit into decision theory

biasdecision-theoryfrequentistloss-functionsrisk

Okay–my original message failed to elicit a response; so, let me put the question a differently. I will start by explaining my understanding of estimation from a decision theoretic perspective. I have no formal training and it would not surprise me if my thinking is flawed in some way.

Suppose we have some loss function $L(\theta,\hat\theta(x))$. The expected loss is the (frequentist) risk:

$$R(\theta,\hat\theta(x))=\int L(\theta,\hat\theta(x))\mathcal{L}(\theta,\hat\theta(x))dx,$$

where $\mathcal{L}(\theta,\hat\theta(x))$ is the likelihood; and the Bayes' risk is the expected frequentist risk:

$$r(\theta,\hat\theta(x))=\int\int R(\theta,\hat\theta(x))\pi (\theta)dxd\theta,$$

where $\pi (\theta)$ is our prior.

In general, we find the $\hat\theta(x)$ that minimize $r$ and all of this works out nicely; moreover Fubini's theorem applies and we can reverse the order of integration so that any given $\hat\theta(x)$ that minimizes $r$ is independent of all others. This way the likelihood principle isn't violated and we can feel good about being Bayesian and so on.

For example, given the familiar squared error loss, $L(\theta,\hat\theta(x))=(\theta- \hat\theta(x))^2,$ our frequentist risk is the mean squared error or the sum of squared bias and variance and our Bayes' risk is the expected sum of squared bias and variance given our prior–i.e., the a posteriori expected loss.

This seems sensible to me so far (although I could be quite wrong); but, in any case, things make far less sense to me for some other objectives. For example, suppose that instead of minimizing the sum of equally-weighted squared bias and variance, I want to minimize an unequally-weighted sum–that is, I want the $\hat\theta(x)$ that minimize:

$$(\mathbb{E}[\hat\theta(x)]-\theta)^2+k\mathbb{E}[(\hat\theta(x)-\mathbb{E}[\hat\theta(x)])^2],$$

where $k$ is some positive real constant (other than 1).

I typically refer to a sum like this as an "objective function" although it may be that I'm using that term incorrectly. My question is not about how to find a solution–finding the $\hat\theta(x)$ that minimize this objective function is doable numerically–rather, my question is twofold:

  1. Can such an objective function fit into the decision theory paradigm? If not, is there another framework in which it does fit? If yes, how so? It seems like the associated loss function would be a function of $\theta$, $\hat\theta(x)$, and $\mathbb{E}[\hat\theta(x)]$, which–because of the expectation–is (I think) not proper.

  2. Such an objective function violates the likelihood principle because any given estimate $\hat\theta(x_{j})$ depends on all other estimates of $\hat\theta(x_{i\neq j})$ (even hypothetical ones). Nevertheless, there are occasions when trading an increase in error variance for a reduction in bias is desirable. Given such a goal, is there a way to conceptualize the problem such that it conforms to the likelihood principle?

I'm assuming that I have failed to understand some fundamental concepts about decision theory / estimation / optimization. Thanks in advance for any answers and please assume I know nothing as I have no training in this area or mathematics more generally. Additionally, any suggested references (for the naive reader) are appreciated.

Best Answer

This is a fairly interesting and novel question! At a formal level, using the frequentist risk function $$(\mathbb{E}_\theta[\hat\theta(X)]-\theta)^2+k\mathbb{E}_\theta[(\hat\theta(X)-\mathbb{E}[\hat\theta(X)])^2],$$ means using (for instance) the loss function defined as $$L(\theta,\hat{\theta})=(\mathbb{E}_\theta[\hat\theta(X)]-\theta)^2+k(\hat\theta-\mathbb{E}_\theta[\hat\theta(X)])^2$$ since there is no reason to prohibit expectations like $\mathbb{E}_\theta[\hat\theta(X)]$ to appear in a loss function. That they depend on the whole distribution of $\hat{\theta}(X)$ is a feature that may seem odd, but the whole distribution is set as a function of $\theta$ and the resulting loss is thus a function of $\theta$, $\hat{\theta}$ and the distribution of $\hat{\theta}(X)$.

I can perfectly forecast an objection coming that a loss function $L(\theta,\delta)$ is on principle a function of a state of nature, $\theta$, and of an action, $\delta$, taking place for instance in the parameter space $\Theta$, hence involving no distributional assumption whatsoever. Which is correct from a game theory perspective. But given that this is statistical decision theory, where a decision $\delta$ will depend on the observation $x$ of a random variable $X$, I see no reason why the generalisation where the loss function depends on the distribution of $X$, indexed by $\theta$, could not be considered. That it may violate the likelihood principle is not of direct concern for decision theory and does not prevent the formal derivation of a Bayes estimator.