If new data is expensive/time-consuming to generate, surrogate-model based optimization may be good approach.
That is:
- fit a suitable, data-driven model to your data
- optimize the surrogate model to derive a new candidate solution
- evaluate candidate solution (that is, measure true purity)
- update model with the new data you just obtained
- repeat 2.-4. until stopping criterion is reached (budget depleted, target purity attained)
What defines a suitable model depends on your problem. For something non-linear, neural networks (as noted by usεr11852), support vector machines or Gaussian Process Regression (aka Kriging) might work. On the other hand, the computational effort for training a model will be adversely affected if your data-set is rather large.
Kriging provides the advantage of giving an uncertainty estimate of its prediction. This allows to compute the Expected Improvement of a candidate solution.
While this does not exactly "integrate these 2 steps" (modeling+optimization) EI provides the means to elegantly balance Exploitation vs. Exploration in a surrogate-model based optimization framework.
I suggest reading Engineering Design via Surrogate modeling, Forrester et al. (2008). Since you already have pre-existing data, you can probably skip Chapter 1 (sampling plans), and focus on Chapter 2 and 3.
Regarding your constraints:
If the constraints themselves are inexpensive to calculate on-the-fly, just respect them in the above step 2. (as you already suggested in your comment). If constraints themselves are expensive to evaluate, you may consider to replace them with a model, too.
Similar to EI, you can use the uncertainty estimate of a Kriging model to compute the probability of feasibility.
Using answer given by @crayfish and a detailed answer on how to put construct constraints here, I was able to come up with a solution.
#define F and S here
F = c(10,10,5)
S = c(8,8,9,8,4)
#loss_fun: to be minimized
loss_fun <- function(A){
P = matrix(A, nrow = n,ncol = m, byrow=T)
T = S*P #proportion matrix * S
F2 = rowSums(T) # Predicted values of F
E = F - F2 # Error
return(sum(E*E))
}
n = length(F)
m = length(S)
#Initial solution (theta)
P_init = c(rep(0.1,n*m))
# Creating Constraints
vi = matrix(rep(0,n*m*n*(m+1)),ncol = n*m, byrow = T)
for (i in 1:n){
for (j in 1:(n*m)) {
if (j <= m * i & j > m * (i-1)) vi[i,j] = -1
}
}
for (i in (n+1):(n*m+n)) {
for (j in 1:(n*m)) {
if ((i-n) == j) vi[i,j] = 1
}
}
myci = c(rep(-1,n),rep(0,n*m))
# check if initial value is in feasible region
vi %*% P_init - myci
#run the optimization module
z = constrOptim(P_init,loss_fun,NULL,ui=vi, ci=myci)
#result
P_final = matrix(z$par,nrow=n,byrow=T)
Since creation of constraints may not be very easy to understand from the code, here is a visualization of constraints when n = 3 and m = 4
-1. p11 + -1. p12 + -1. p13 + -1. p14 >= -1
-1. p21 + -1. p22 + -1. p23 + -1. p24 >= -1
-1. p31 + -1. p32 + -1. p33 + -1. p34 >= -1
p11 >= 0
p12 >= 0
p13 >= 0
p14 >= 0
p21 >= 0
p22 >= 0
p23 >= 0
p24 >= 0
p31 >= 0
p32 >= 0
p33 >= 0
p34 >= 0
Best Answer
Both packages, alabama and Rsolnp, contain "[i]mplementations of the augmented lagrange multiplier method for general nonlinear optimization" --- as the optimization task view says --- and are quite reliable and robust. The can handle equality and inequality constraints defined as (nonlinear) functions again.
I have worked with both packages. Sometimes, constraints are a bit easier to formulate with Rsolnp, whereas alabama appears to be a bit faster at times.
There is also the package Rdonlp2 that relies on an external and in the optimization community well-known software library. Unfortunately, its license status is a bit uncertain at the moment.