Solved – Finding the input variable to which output is most sensitive

anovacorrelationoptimization

I have a physical system which takes a number of inputs $x_i$ and produces an output $error$.

$$
Y = f(x_1, x_2, x_3, .. x_{1000})
$$

The function $f()$ can be evaluated by running a compute-intensive simulation of a model.

I want to find the $x_i$ to which $Y$ is most sensitive. In practice, I want to optimize the values for the few input variable which would give maximum return (in terms improving the system performance).

I can think of randomly changing each of the $x_i$ by a small amount around the existing value and record the output. Repeat the experiment by few hundred times and compute the correlation between $x_i$ to $Y$ and pick the inputs with high correlation.

I am wondering if there is a more formal method to achieve this.

One important constraint in my particular problem is that each model evaluation requires a computationally intensive simulation of about 10 minutes and $x_i$ is of size $1000$ to $2000$.

Best Answer

I think that you could look into the field of Sensitivity Analysis : https://en.wikipedia.org/wiki/Sensitivity_analysis. In your case, I would advise to compute the Sobol' indices (https://en.wikipedia.org/wiki/Variance-based_sensitivity_analysis).

These indices represent the fraction of variance carried by a variable and/or a set of variables. Several R packages exist in order to compute first and second order indices quite efficiently, by using specific designs.

In your case, as the number of model evaluation is pretty small and the number of inputs is large, you could try to look into surrogate based sensitivty analysis (see for instance https://doi.org/10.1016/j.apm.2013.01.019): Take a well behaved initial design (Latin Hypersquare or other space filling designs), and based on these evaluations, build a surrogate model (using Kriging). This surrogate will then be used for intensive computations, and can give some insightful results.

Be aware however that due to the high number of inputs, an accurate surrogate will probably need a lot of initial runs to be generated. A usual rule of thumb is to take $10d$ initial design points, where $d$ is the input dimension.