Solved – multivariate gaussian processing | bayesian optimization with multiple features (parameter optimization)

bayesian optimizationgaussian processhyperparameteroptimizationpython

Best
My goal is very simple, I would like to optimize 3 parameters in my algorithm.
And I would like to do this via, multivariate gaussian processes.

But the problem which I've is, I do understand how it works for 1 feature | Hyperparameter. But I don't know how to use it on multiple features|Hyper parameters….

What I've

  • create x1:N $\rightarrow $ (x1 to N, e.g. -7, 3) (range
  • create $\mu$ = $0_{n}$ $\rightarrow$ (it's a zero vector with n entries)
  • create $k$_{n*n} $\rightarrow $ $NxN$ matrix k, via : $k(x,x')=exp^{-\frac{(x-x')^2}{2}}$
  • $K = LL^{T}$ $\rightarrow $ find the Cholesky decomposition if $K$
  • $f_{i} \sim \mathcal{N} \big(0_{n},K\big)$ $\rightarrow $ Draw samples of a Gaussian distribution, with mean 0 and covoariance K)
    But draw it like, standard zero one gaussian variables and multiply it by L (square root K)
    $f_{i} \sim \mathcal{N} \big(0_{n},I\big)L $

— code —

The true unknown function, which we are trying to approximate (and find the lowest point

f = lambda x: np.sin(0.9*x).flatten()

The kernel

def kernel(a, b):
    """ GP squared exponential kernel """
    kernelParameter = 0.5
    sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
    return np.exp(-.5 * (1/kernelParameter) * sqdist)

Set some parameters,

  • The first 2 starting points
  • prev = value of previous best point
  • Strike = convergence (when we take 3 times the same point, then stop)

knowledge = np.array([[-5],[2.2]])
prev = 999
strike = 0
print()

Loop until there is a convergence

for i in range(2,30):
    N = len(knowledge)         # number of training points.
    n = 100         # number of test points.
    s = 0.0000005    # noise variance.0.00005


    # Sample some input points and noisy versions of the function evaluated at
    # Starting points
    X = knowledge
    y = f(X) + s*np.random.randn(N)



    #Gaussian process
    K = kernel(X, X)
    L = np.linalg.cholesky(K + s*np.eye(N))


    # points we're going to make predictions at.
    Xtest = np.linspace(-7, 3, n).reshape(-1,1)

    # compute the mean at our test points.
    Lk = np.linalg.solve(L, kernel(X, Xtest))
    mu = np.dot(Lk.T, np.linalg.solve(L, y))

    # compute the variance at our test points.
    K_ = kernel(Xtest, Xtest)
    s2 = np.diag(K_) - np.sum(Lk**2, axis=0)
    s = np.sqrt(s2)



    # PLOTS:
    plt.figure(1)
    plt.clf()

    plt.plot(X, y, 'b+')
    plt.plot(Xtest, f(Xtest), 'b-')

    plt.gca().fill_between(Xtest.flat, mu-3*s, mu+3*s, color="#000000")
    plt.plot(Xtest, mu, 'r--', lw=2)

    plt.savefig('predictive.png', bbox_inches='tight')
    plt.title('Mean predictions plus 3 st.deviations')
    plt.axis([-7, 3, -3, 3])

    plt.show()


    #######
    # Get best next point
    # Take the lowest value --> lowest error
    # If strike == 3 then stop and we have our best value
    #######
    min_index = list(mu-3*s).index(min(mu-3*s))
    x_value = Xtest[min_index]
    knowledge = np.append(knowledge, [x_value],axis=0)
    f_value = f(x_value[0])[0]
    if strike == 3:
        print(x_value)
        break

    if prev == f_value:
        strike += 1
    else:
        strike = 0
        prev = f_value

You can see here the plots

enter image description here

But as you can see, This is only for 1 parameter|feature.
What I would like to know is, how can I do the exact same thing for 2, 3, 4 … features? (I Need 3, X1,X2, X3)

Does someone has some experiences with this? because most of the documents, and blogs which i read are showing this, (Multivariate gaussian processing|Bayesian optimization) for just 1 hyper parameter and not for e.g. 7 hyper paramters

Also. I don't have a real function like
*f = lambda x1,x2,x3: np.sin(0.9*x1)**np.cos(x2)-np.sin(-x3)*

I've a whole algorithm whereby I've 1 value, at the end which has to lay as close as possible to 0.

Best Answer

It makes no difference: you just need to have your kernel accept two input vectors (with size of your hyperparameter space) rather than two scalars. This is because as long as your objective function and your kernel are scalar, all of the math remains the same. i.e if your kernel was: $$\exp\left(-\frac{(x-x')^2}{2\sigma}\right)$$ You could now use: $$\exp\left(-\frac{|\mathbf{x}-\mathbf{x'}|^2}{2\sigma}\right)$$ This form assumes that you believe that your points are correlated isotropically, i.e. that your function behaves similarly in all input dimensions. If that isn't the case, you can use the matrix form with some known covariance: $$\exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{x'})^T\mathbf{\Sigma^{-1}}(\mathbf{x}-\mathbf{x'})\right)$$ This of course is not the only form you can use - your kernel need not be Gaussian in the first place! you can use any function you want that takes in two vectors from your input space and outputs a scalar as your kernel, as long as it's positive definite.