Solved – Minimizing variance of an estimator under sampling cost penalty

loss-functionsoptimizationsample-sizesamplingvalue-of-information

I have an estimator $t$, whose variance depends on the dimension of
my sample $x_{1:n}$:

$$
\text{Var}(t(x_{1:n})) = f(n).
$$

Suppose that the form of $f(n)$ is known. I would like to determine
what is the optimal sample size $n$, given that every sample has a cost $c$.

I was considering:

a) To minimize a loss function of the type:

$$
\text{Loss}(n) = f(n) + n \times c.
$$

b) To plot $f(n)$ against $n \times c$ to try determine a good sample size $n$
visually.

In my case $c$ is the time needed to simulate $x_i$, for example 0.1sec.
I don't think that minimizing a function were different units are summed
together makes much sense, so I was wondering whether there is any
way to translate CPU times into something more sensible.

Some additional information about my application: $t(x_{1:n})$ is the estimate of a gradient, which is calculated at each step of a Stochastic Steepest Descent algorithm. So if $n$ is small the algorithm is fast but unstable, while if $n$ is large it is slow but stable. So far I have just experimented with many values of $n$, but maybe somebody knows a better way of determining $n$.
Thanks a lot!

Best Answer

You are asking a question about the expected value of sample information.

First, I suggest that you not define your loss function in terms of variance, at least not without further examination. Instead, identify why you care about error in your estimation procedure. What are the different ways of being wrong? What are the consequences of being wrong in each of those ways? Can you quantify those consequences in terms of units of value that make sense to you?

Then define your loss function as cost-of-testing + E[cost-of-error]. To describe losses in this way, you will need to specify a common unit of measure in which both kinds of loss are denominated. Explicitly or implicitly, your loss function will need to incorporate a specification of the "price" at which you would willingly trade off costs of testing for costs of error. This quantification may seem artificial: it may be that there is no obvious common unit in which both kinds of losses might be measured. Too bad: any loss function will define such terms of trade, either explicitly or implicitly. (In terms of intermediate microeconomics: a level set of the loss function defines an indifference curve in the space of arguments; the price is the slope of this indifference curve.) You are better off making this choice consciously, rather than as the logical consequence of an un-examined ad hoc assumption.

In my opinion (and I am right about this), a problem in statistical decision theory (which this is) is handled most naturally in a Bayesian framework, in which we treat unobserved parameters as random variables. Formally, let $\theta$ denote the true value of the variable whose value you're trying to estimate, e.g., the argument at which your search algorithm would find its true minimum. For any other value $\hat{\theta}$ in your parameter space, let $C(\hat{\theta}, \theta)$ denote the cost of error when $\hat{\theta}$ is used as an estimated value and $\theta$ is the true value. It could be that the cost of error depend only on the size of the difference (somehow defined) between the estimated and true values, e.g. $C(\hat{\theta}, \theta) = (\hat{\theta}- \theta)^2$, where the minus sign designates some measure of distance in parameter space.

Let $f(\theta)$ describe your beliefs about the likelihood that $\theta$ might take on various different values, expressed as a probability distribution over the parameter space. Let $f(\theta \mid x_{1:n})$ describe the posterior probability over $\theta$ that would obtain if your sample consisted of the values $x_{1:n}$. After you've done your sampling, you can compute your expected cost of error as a function of the sample you drew: $E[C \mid x_{1:n}] = \int \int C(\hat{\theta}, \theta) f(\hat{\theta} \mid x_{1:n}) \, d\hat{\theta} \, f(\theta \mid x_{1:n}) \, d\theta$, where both integrals are taken over the parameter space. Your total expected loss is then $E[C \mid x_{1:n}] + cn$.

If you are allowed to let choose $n$ using a stopping rule, rather than specifying a choice in advance, you can employ dynamic programming to choose $n$ optimally. This answer to a previous question should offer useful guidance.

If you have to specify $n$ in advance, then you would employ preposterior analysis. Intuitively, you choose $n$ so that in expectation, the marginal benefit of the last test just offsets the cost of the last test. Formally, Let $f(\theta)$ describe your prior beliefs about the distribution of $\theta$, before any sampling has been done. Let $l(x_{1:n} \mid \theta)$ denote the likelihood of drawing a specific sample $x_{1:n}$, if $\theta$ were the true value of the parameter. Then before you've taken your first sample, you can already formulate an expression for your expected cost of error:

$\int \int E[C \mid \mathbf{x}] \, l(\mathbf{x} \mid \theta) \, f(\theta) \, d\mathbf{x} \, d\theta.$

Here, the inner integral is taken over the sample space, the outer integral is taken over the parameter space. You would choose $n$ to minimize the above expression plus $cn$, the cost of testing.

Choosing $n$ dynamically is better, if possible, because you can equate the marginal benefit of the last test to its cost in every realization, rather than merely in expectation.

Best Answer

Related Solutions

Hinge Loss Gradient – Comprehensive Overview and Calculation

Machine Learning – Maximizing Likelihood vs. Minimizing Cost: A Comprehensive Comparison

Related Question