# Stable Bayesian optimization

- 72 Downloads

## Abstract

Tuning hyperparameters of machine learning models is important for their performance. Bayesian optimization has recently emerged as a *de-facto* method for this task. The hyperparameter tuning is usually performed by looking at model performance on a validation set. Bayesian optimization is used to find the hyperparameter set corresponding to the best model performance. However, in many cases, the function representing the model performance on the validation set contains several spurious sharp peaks due to limited datapoints. The Bayesian optimization, in such cases, has a tendency to converge to sharp peaks instead of other more stable peaks. When a model trained using these hyperparameters is deployed in the real world, its performance suffers dramatically. We address this problem through a novel stable Bayesian optimization framework. We construct two new acquisition functions that help Bayesian optimization to avoid the convergence to the sharp peaks. We conduct a theoretical analysis and guarantee that Bayesian optimization using the proposed acquisition functions prefers stable peaks over unstable ones. Experiments with synthetic function optimization and hyperparameter tuning for support vector machines show the effectiveness of our proposed framework.

## Keywords

Bayesian optimization Gaussian process Stable Bayesian optimization Acquisition function## 1 Introduction

*de-facto*method to tune complex machine learning algorithms [21]. In tuning, the goal is to train a classifier at the right complexity so that it neither overfits, nor underfits. Performance on a validation set is used as an indicator of the fitting, and it is expected to peak at the hyperparameters corresponding to the right complexity and exhibit lower values at other hyperparameters. Thus to tune a machine learning algorithm, Bayesian optimization is employed in the pursuit of the peak validation set performance. However, in some situations, especially when the training or the validation dataset is small, spurious peaks appear along the performance surface (e.g., Fig. 1). These peaks tend to be distributed randomly over low performance region. They are characteristically different from the peak corresponding to the right complexity in two ways: (a) they tend to be narrow and (b) they vanish when tested on a large test dataset, whereas the right peak remains stable. Due to the latter difference, a Bayesian optimization method that does not explicitly avoid these spurious peaks may converge to one of them and can result in a badly tuned system with inexplicably low performance during real-world deployment. To the best of our knowledge, we are the first to identify and analyze this issue of spurious peaks and its serious downsides.

Existence of multiple peaks with different widths along an optimization surface is prevalent in many real-world systems. For some of them, the end result of optimization can get dramatically affected depending on whether the optimization has converged to a wide peak or a narrow peak. For example, in alloy design [23], one of the main goals is to find the mixing proportion of a set of elements with the highest value of a physical property (e.g., strength, ductility). However, alloy making is an imprecise process. Due to the impurities in the raw material, the elements can never be mixed at the desired proportion. Therefore, if the desired proportion is at a narrow peak, then the performance of the alloy would not be stable when made repeatedly as even a small difference in impurities could result in dramatic loss in performance. Hence, being able to avoid narrow peaks in favor of more stable peaks is a critical factor of success for several different applications of Bayesian optimization. Unfortunately, till now the various downsides of reaching a narrow peak in the optimization of physical systems and processes have never been identified and attended.

Bayesian optimization, in its simplest form, consists of a Gaussian process (GP) [17] to maintain a distribution on the objective function based on the observations so far, and a mechanism to select the next query point based on an optimistic exploration strategy. This strategy is more commonly known as *acquisition functions* and can be of different types, such as expected improvement (EI) [14] or GP-UCB [20]. Based on the predictive distribution of the Gaussian process, EI computes the expected improvement over the current best observation and we choose a location as our next query point that offers the highest improvement. GP-UCB finds the location of the highest peak of a function by judiciously combining both the mean and the variance of the GP prediction. Apart from hyperparameter tuning, Bayesian optimization has also been used for optimal sensor placement [5], for gait design [12] and optimal path planning [13], etc. While this simple strategy is powerful for many applications, there have been recent attempts to make it widely applicable by making it feasible in high dimension [4], and adding the ability to perform transfer learning [10], multi-objective optimization [11], batch optimization [1, 16], etc. Convergence analyses of Bayesian optimization for EI [3], EI with noisy measurement [22], and GP-UCB [20] provide guarantee of convergence to the optimum of the objective function. However, none of the methods differentiate based on the stability of peaks and can, in principle, converge to any if there are multiple peaks with the same height but with different stability. Thus, to find an optimum where the function value is more stable and avoid regions where function values exhibit undesired fluctuations is an open problem.

To address the issues with spurious peaks, we propose two new acquisition functions for Bayesian optimization that actively seek stable peaks of the objective function. Based on our definition of stability, we show that it is possible to measure the stability of a peak by subjecting the underlying Gaussian process model with input perturbation. When faced with input perturbation, the predictive distribution of the Gaussian process changes. At any peak, the mean of the distribution goes lower, and the variance goes higher. But more importantly, for two peaks of the same height, the narrower peak will have a lower mean and more variance than the other peak. Furthermore, we show that the variance can be effectively decomposed as a sum of two parts: (a) *epistemic variance* due to the limited number of samples, and (b) *aleatoric variance* arising from the interaction between the curvature of the function with the input perturbation. The narrower a peak is, the higher the aleatoric variance will be around that peak. Therefore, the aleatoric variance can be used as an effective measure of the instability of a peak. *Two acquisition functions* are proposed in line with the GP-UCB and EI that, while exploiting the usual combination of mean and variance, also penalize for instability. Theoretically, we prove that under mild assumptions, when two peaks are of the same height, the proposed acquisition functions would always favor the more stable peak. We compare our method with a standard Bayesian optimization implementation on both synthetic function optimization and real-world hyperparameter tuning. On synthetic function optimization, we create a function that has both stable regions and spurious regions. Experiments with the synthetic function show that our proposed method converges to stable regions more often than the baseline. For real-world applications, we demonstrate tuning the hyperparameters of support vector machine on two real datasets. Experimental results clearly demonstrate that our proposed method converges to a stable peak, whereas the standard Bayesian optimization converges to an unstable peak, and hence the SVMs tuned by our method perform better on test sets.

With the concerns about stability of optimization, our proposed method can be applied widely for real-world problems, especially in industrial settings. In practice, it is almost impossible to precisely set the parameters of machines. Depending on the process/plant settings, the outcome of industrial processes can be noticeably different even with a small modification in parameters. By choosing stable peaks rather than unstable ones, our proposed method allows us to find a favorable set of parameters and minimize the effects of instability, which finally leads to robust design and desired products.

## 2 Background

*f*. This optimization problem can be formally defined as:

*f*is a blackbox function without a closed-form expression, it can be evaluated at any point \(\mathbf {x}\) in the domain. Given few input and output values of the function

*f*, Bayesian optimization iteratively suggests samples for evaluation to find the optimum value of the function. Unlike common convex optimizers, it does not require to know the gradient of the function. Bayesian optimization uses all the information available from observations \(\mathbf {x}\) and \(f(\mathbf {x})\) for reasoning rather than relies on only local gradients. Thus, it is useful for optimizing expensive blackbox functions.

Bayesian optimization consists of two main components. The first is a meta model that can be evaluated at any point with uncertainty. This meta model uses prior knowledge, such as smoothness, about the cost function and known datapoints to update our beliefs about the function. There are plenty of choices for this function such as Gaussian process, Bayesian neural networks [19], random forest [2]. The second component of a Bayesian optimization algorithm is an acquisition function that suggests where to evaluate the function next. Using this function, the original problem becomes optimizing a less expensive non-convex function. The acquisition function maintains a trade-off between exploration (where the posterior distribution has high uncertainty) and exploitation (where the objective function has high predictive value). This technique is able to minimize the number of the cost function evaluations. In this work, we use Gaussian process as the meta model, and upper confidence bound (UCB) and expected improvement (EI) as the acquisition functions.

### 2.1 Gaussian process

### 2.2 Acquisition functions

*t*observations. Typically, the acquisition function is defined such that its high value

*potentially*leads to a high value of objective function

*f*. The trade-off between exploring highly uncertain regions or exploiting promising areas is also represented in the acquisition function. In this section, we will discuss about two popular choices for the acquisition function: upper confidence bound and expected improvement.

#### 2.2.1 Upper confidence bound

*f*next. Srinivas et al. [20] proved that if \(\kappa _{t}=2\log \left( t^{2}2\pi ^{2}/3\delta \right) +2d\log \left( t^{2}dbr\sqrt{\text {log}\left( 4da/\delta \right) }\right) \), GP-UCB achieves an upper bound on the cumulative regret \(\mathcal {R}_{T}=\sum _{t=1}^{T}\left( f\left( \mathbf {x}^{*}\right) -f\left( \mathbf {x}_{t}\right) \right) \) that has the order \(\mathcal {O}\left( \sqrt{T\gamma _{T}\kappa _{T}}\right) \forall T\ge 1\), with the probability greater than or equal to \(1-\delta \), where \(\gamma _{T}\) is the maximum information gain after

*T*iterations, the search space is \([0,r]^{d}\) with some \(r>0\) and \(a,b>0\) are some constants related to the function smoothness.

#### 2.2.2 Expected improvement

#### 2.2.3 Maximizing the acquisition function

In order to find the point where we next evaluate \(f(\mathbf {x})\), we have to maximize the acquisition function. Unlike the original objective function, the two acquisition functions of Eq. (4) and Eq. (6) are cheap to sample. They can be solved using standard optimization techniques such as local optimizers, sequential quadratic programming or global optimizers such as DIRECT [8].

## 3 The proposed framework

We present two new acquisition functions for Bayesian optimization designed to maximize a blackbox function with behavior that the maxima from stable regions are preferred over the maxima from relatively unstable regions. We first discuss the notion of stability and then describe how a Gaussian process model gets modified in presence of any perturbation in the input variables. Next, we use the predictive distribution of the modified Gaussian process to formulate two novel acquisition functions: STABLE-UCB and STABLE-EI. We theoretically analyze the proposed acquisition functions and prove that they are guaranteed to take higher values in more stable regions and thus Bayesian optimization using these acquisition functions will have higher tendency to sample from more stable regions. Finally, we present an algorithm summarizing the proposed stable Bayesian optimization.

### 3.1 Stability of Gaussian process prediction

*f*. Using \(\mathcal {D}_{t}\), for a new input \(\mathbf {x}\), the predictive distribution of the corresponding output \(y=f(\mathbf {x})\) is given as

*perturbation-free*case. Also use \(m_{t}(\mathbf {x},\varvec{\Sigma }_{\mathbf {x}})\) and \(v_{t}(\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\) to denote the mean and variance of predictive distribution \(p(y|\mathcal {D}_{t},\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\), then with the Gaussian approximation, we can write

In the following, we utilize the above analysis to define two novel acquisition functions to propose a stable Bayesian optimization framework.

### 3.2 Stable Bayesian optimization

Having a closed-form expression for the predictive mean and variance as in (9) and (10) provides us the required tractability to formulate an acquisition function for ‘stable’ Bayesian optimization. In the expression for the predictive variance in (10), we note that the variance \(v_{t}(\mathbf {x},{\varvec{\Sigma }}_{\mathbf {x}})\) has two components: (1) the *epistemic *variance (uncertainty) term \(\sigma _{t}^{2}(\mathbf {x})\), arising due to our lack of understanding about the function value, mainly due to the finite set of observations and (2) the *aleatoric* variance term \(\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})\) [further detailed in (11)], arising due to the inherent variation in the function around \(\mathbf {x}\). We associate the notion of the stability to this aleatoric variance which takes higher values in regions where the function has faster variations. In the remainder of this section, we use this property to define two new acquisition functions that yield stable Bayesian optimization, which results in a solution where the function value is robust to small perturbations.

#### 3.2.1 The STABLE-UCB acquisition function:

*t*by \(\sigma _{t}^{2}(\mathbf {x})\) and \(\sigma _{t,a}^{2}(\mathbf {x},\varSigma _{\mathbf {x}})\), respectively, we define the STABLE-UCB acquisition function as:

*t*-dependent weight that sets a balance between exploitation and exploration, and \(\lambda >0\) is a fixed weight that sets our penalty on the instability. In the above formulation, our intuition is to

*penalize*the points where the function is varying fast with small change in \(\mathbf {x}\). In our implementation, to balance between the epistemic and the aleatoric variance, we set \(\lambda \) equal to \(\kappa _{t}\).

#### 3.2.2 The STABLE-EI acquisition function:

*t*to the improvement function:

*t*th iteration.

#### 3.2.3 Theoretical analysis:

In this section, we analyze the proposed acquisition functions to provide theoretical guarantees that the acquisition functions \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) and \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\) indeed prefer less sharp peaks of the function \(f(\mathbf {x})\).

## Definition 1

(Identical data topology): Any two points \(\mathbf {x}\), \(\mathbf {x}'\) are said to have identical data topology if for each observation \(\mathbf {x}_{i}\), there exists another observation \(\mathbf {x}_{i'}\) such that \(||\mathbf {x}-\mathbf {x}_{i}||=||\mathbf {x}'-\mathbf {x}_{i'}||\).

A consequence of the identical data topology is that for points \(\mathbf {x}\), \(\mathbf {x}'\), any distance based kernel induces Gram matrices that are equal up to a permutation. With an increasing set of observations, it is not difficult to achieve identical data topology approximately.

## Lemma 1

*f*such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and

*f*locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), under certain mild assumptions, the following relations hold true:

## Proof

To have no favor to any peak, let us assume that there are sufficiently many observations around both \(\mathbf {x}\), \(\mathbf {x}'\) so that the two points have identical data topology. Due to this mild assumption, we have a pair of observations \(\mathbf {x}_{i}\) and \(\mathbf {x}_{i'}\) such that \(||\mathbf {x}-\mathbf {x}_{i}||=||\mathbf {x}'-\mathbf {x}_{i'}||\). This implies that the covariance values \(k(\mathbf {x},\mathbf {x}_{i})=k(\mathbf {x}',\mathbf {x}_{i'})\) and \(k_{1}(\mathbf {x},\mathbf {x}_{i})=k_{1}(\mathbf {x}',\mathbf {x}_{i'})\). By definition, \({{\varvec{\beta }}}=\mathbf {K}^{-1}\mathbf {y}\). Since the peak at \(\mathbf {x}'\) is sharper than the peak at \(\mathbf {x}\), meaning \(y_{i'}\le y_{i}\) and therefore \(\beta _{i'}\le \beta _{i}\) . Hence, \(\sum _{i=1}^{t}\beta _{i}k(\mathbf {x},\mathbf {x}_{i})k_{1}(\mathbf {x},\mathbf {x}_{i})\ge \sum _{i'=1}^{t}\beta _{i'}k(\mathbf {x}',\mathbf {x}_{i'})k_{1}(\mathbf {x}',\mathbf {x}_{i'})\), or \(m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\).

Next, we also note that due to the identical data topology assumption around both peaks, we have equal epistemic uncertainties, *i.e.,* \(\sigma _{t}(\mathbf {x},\varSigma _{\mathbf {x}})=\sigma _{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) by definition of \(\sigma _{t}(\mathbf {x})\) in (3).

Finally, we show that \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\le \sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\). For this, consider the aleatoric variance term in (11). As above, we have the following relations: \(k(\mathbf {x},\mathbf {x}_{i})=k(\mathbf {x}',\mathbf {x}_{i'})\), \(k_{1}(\mathbf {x},\mathbf {x}_{i})=k_{1}(\mathbf {x}',\mathbf {x}_{i'})\), \(\beta _{i'}\le \beta _{i}\) and additionally, \(k_{2}(\mathbf {x},\bar{\mathbf {x}}_{ij})=k_{2}(\mathbf {x}',\bar{\mathbf {x}}_{i'j'})\) . Using these relations, it is straightforward to show that \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\le \sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\). \(\square \)

Next, we state and prove our key results for the newly proposed acquisition functions.

## Theorem 1

(STABLE-UCB case) If \(\mathbf {x}\), \(\mathbf {x}'\) are the two highest peaks in the support of a function *f* such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and *f* locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), the acquisition function in (12) satisfies the relation: \(a_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge a_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) under certain mild assumptions.

## Proof

## Theorem 2

(STABLE-EI case) If \(\mathbf {x}\), \(\mathbf {x}'\) are the two highest peaks in the support of a function *f* such that \(|f(\mathbf {x})-f(\mathbf {x}')|<\eta _{0}\) for small \(\eta _{0}\), and *f* locally varies faster around \(\mathbf {x}'\) compared to \(\mathbf {x}\) in a small \(h_{0}\)-neighborhood, i.e., \(|\frac{f(\mathbf {x+h})-f(\mathbf {x})}{f(\mathbf {x'+h})-f(\mathbf {x}')}|<1\), \(\forall h\in (-h_{0},h_{0})\), the acquisition function in (13) satisfies the relation: \(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge b_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) under certain mild assumptions.

## Proof

*i.e.,*\(b_{t}(\mathbf {x},\varSigma _{\mathbf {x}})\ge b_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\). From the definition of \(z_{t}\) in (13) and Lemma 1, we have

## Remark

The above Theorems 1 and 2 cover an important case that when the peaks in both stable and unstable regions are approximately equal in height, a Bayesian optimization algorithm using the acquisition functions in (12) and (13) will prefer the peak from the stable region. In the case when a peak of the unstable region is higher than a peak of the stable region, the two terms \(m_{t}(\mathbf {x},\varSigma _{\mathbf {x}})-m_{t}(\mathbf {x}',\varSigma _{\mathbf {x}})\) and \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})-\sigma _{t,a}(\mathbf {x}',\varSigma _{\mathbf {x}})\) would be acting against each other and their net difference will decide whether the algorithm suggests a point from the stable region or unstable region. Since the parameters \(\lambda \) and \(\omega \) are user specified, there exist sufficiently large values of them that always guarantee the suggestion of the stable peak. In the case when a peak of the unstable region is lower than a peak of the stable region, both standard and stable Bayesian optimization will select the stable peak.

#### 3.2.4 Computational complexity and convergence analysis:

In this section, we discuss about the computational complexity and the convergence analysis of the proposed stable Bayesian optimization algorithm.

*Computational complexity* Since the difference between our stable Bayesian optimization algorithm and the standard one is the acquisition function, we will focus our attention on the complexity analysis of acquisition function computation. In the standard Bayesian optimization algorithm, the complexity of UCB and EI for *T* observed datapoints is \(\mathcal {O}(T^{3})\). In our proposed acquisition functions (12) and (13), the complexity of computing the mean \(m_{T}(\mathbf {x},\varSigma _{\mathbf {x}})\), the epistemic variance \(\sigma _{T}(\mathbf {x},\varSigma _{\mathbf {x}})\) and the aleatoric variance \(\sigma _{T,a}(\mathbf {x},\varSigma _{\mathbf {x}})\) for *T* observations are all \(\mathcal {O}(T^{3})\). Therefore, our proposed algorithm has the same computational complexity as the standard Bayesian optimization algorithm.

*Convergence*From the definition of acquisition functions in (12) and in (13), we can see that the main difference between the proposed acquisition function and UCB/EI is the appearance of the aleatoric variance \(\sigma _{t,a}(\mathbf {x},\varSigma _{\mathbf {x}})\) term. After a sufficient large number of iterations \(T_{0}\), the Gaussian process models the function \(f(\mathbf {x})\) fairly accurately and the aleatoric variance becomes nearly a fixed function. Then the addition of the aleatoric variance term in the acquisition function can be interpreted as a constrained optimization problem of the blackbox function \(f(\mathbf {x})\) under the constraint that the aleatoric variance is smaller than a specified value. This problem has been well studied theoretically and shown to be effective in practice in the work of Gelbart et al. [6].

## 4 Experiments

In this section, we experiment on a set of synthetic and real datasets to demonstrate the efficacy of our stable Bayesian optimization. Experiments with synthetic dataset show the behavior of our proposed method with a known and complex function with multiple sharp peaks and one stable peak. We also conduct experiments with several hyperparameter tuning problems to show the utility of our method for real-world applications.

### 4.1 Baseline method and evaluation measures

We compare the stable Bayesian optimization using proposed acquisition functions (STABLE-UCB and STABLE-EI) with standard Bayesian optimization using UCB acquisition function (BO-UCB) and EI acquisition function (BO-EI), respectively. On *synthetic* data, we compare STABLE-UCB and STABLE-EI with the corresponding baseline in two aspects: ‘the maximum value found’ and ‘the number of times an algorithm visits around the highest stable peak^{1} with respect to the number of iterations. On *real* data, we show the performance of stable Bayesian optimization and standard Bayesian optimization on both validation and test sets. We compare STABLE-UCB with standard UCB and STABLE-EI with standard EI for fair comparison.

### 4.2 Experiments with the synthetic function

#### 4.2.1 Data generation:

#### 4.2.2 Experimental results:

We randomly initialize 2 observations for Bayesian optimization. Figure 3b illustrates the value of the STABLE-UCB acquisition function and the aleatoric variance after 30 iterations. In the unstable region, the STABLE-UCB acquisition function used for stable Bayesian optimization has a smaller value than that in the stable region due to high aleatoric variance capturing instability. We observe similar results for the STABLE-EI acquisition function.

### 4.3 Experiments with hyperparameter tuning problems

#### 4.3.1 Dataset:

We use letter and glass classification datasets from UCI machine learning repository.^{2} The letter dataset contains 20,000 datapoints about the image characteristic of 26 capital letters in the English alphabet. Since spurious peaks occur mostly when *the training set or the validation set has limited number of datapoints*, we sample only 200 datapoints from the letter dataset. The glass dataset consists of 214 datapoints represented using 10-features related to glass properties. Both datasets are divided into training set, validation set and test set. The validation set accuracy will be used as the objective for optimization, and the test set accuracy will be used as the performance measure.

#### 4.3.2 Experimental results with support vector machine:

Support vector machine (SVM) is a popular machine learning algorithm for classification problems. Two main hyperparameters in SVM using RBF kernel are *C* and \(\gamma \) that represent the misclassification trade-off and the RBF kernel parameter, respectively. We apply both stable Bayesian optimization and standard Bayesian optimization for tuning *C* and \(\gamma \). In the experiments, the objective function \(f(\mathbf {x})\) is the validation set accuracy and vector \(\mathbf {x}\) represents the hyperparameters *C* and \(\gamma \). Performance on test set is used to compare the performance of the proposed method and the baseline.

Figure 5 shows the converged peaks by our proposed STABLE-UCB and BO-UCB over 30 different initializations. As seen from the figure, the number of times BO-UCB converges to spurious peaks is considerably higher than that of STABLE-UCB. This behavior leads to the accuracy performance shown in Fig 6. Figure 6a shows the performance of two STABLE-UCB and BO-UCB on the validation set. We note that this is a multi-class classification task, hence a random classifier would have a mean accuracy of only \(1/26=0.0385\). After 120 iterations, STABLE-UCB’s best accuracy on the validation set is 0.35, whereas BO-UCB’s best is 0.375. However, as we move to the test set and compare the performance of the two methods using the hyperparameters optimized using the validation set, we find that STABLE-UCB performance is higher compared to BO-UCB (see Fig 6b). After 120 iterations, STABLE-UCB performance remains high at 0.44, whereas BO-UCB reaches only up to 0.41. Figure 6c shows the performance of STABLE-EI and BO-EI on the validation set of the letter dataset. After 140 iterations, STABLE-EI reaches 0.35, whereas BO-EI reaches 0.36. However, on the test set, the performance of STABLE-EI remains at 0.43, whereas BO-EI’s best accuracy is 0.41 after 140 iterations (see Fig 6d).

We observed a similar behavior of two algorithms for the glass dataset (Fig 7). On the validation set, although BO-UCB and BO-EI converge to a higher accuracy (both have accuracy = 0.58) than that of STABLE-UCB and STABLE-EI (both have accuracy = 0.56), stable Bayesian optimization accuracy score stays above 0.56 compared to standard Bayesian optimization’s accuracy of under 0.52 for the test set.

Our experiments with SVM hyperparameter tuning demonstrate that spurious peaks are indeed abound in case of small training and validation sets. The proposed stable Bayesian optimization is able to successfully reduce the convergence to such peaks.

## 5 Conclusion

We proposed a stable Bayesian optimization framework to find stable solutions for hyperparameter tuning. We discussed the notion of stability and presented a modified Gaussian process model in presence of noisy inputs. We constructed two novel acquisition functions based on the epistemic and aleatoric variances of the modified Gaussian process estimates. The aleatoric variance becomes high in unstable region around spurious narrow peaks and thus offers a way to guide the function optimization toward stable regions. We theoretically showed that our proposed acquisition functions favor stable regions over unstable ones. We discussed about computational complexity and convergence of our proposed algorithm. Through experiments with both synthetic function optimization and hyperparameter tuning for SVM classifier, we demonstrated the utility of our proposed framework.

The proposed stable Bayesian optimization has advantages over standard Bayesian optimization when the number of datapoints is limited. Thus, it can be applied in real-world domains such as health care, bioinformatics, material design. It also opens promising problems for future work such as privacy preserving stable Bayesian optimization and multi-objective stable Bayesian optimization.

## Footnotes

## Notes

### Acknowledgements

This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

## Compliance with ethical standards

## Conflicts of interest

All the authors declare that they have no conflict of interest.

## References

- 1.Azimi, J., Fern, A., Fern, X.Z.: Batch bayesian optimization via simulation matching. Adv. Neural Inf. Process. Syst.
**1**, 109–117 (2010)MATHGoogle Scholar - 2.Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010)
- 3.Bull, A.D.: Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res.
**12**, 2879–2904 (2011)MathSciNetMATHGoogle Scholar - 4.Chen, B., Castro, R., Krause, A.: Joint optimization and variable selection of high-dimensional gaussian processes. arXiv preprint arXiv:1206.6396 (2012)
- 5.Garnett, R., Osborne, M.A., Roberts, S.J.: Bayesian optimization for sensor set selection. In: IPSN (2010)Google Scholar
- 6.Gelbart, M.A., Snoek, J., Adams, R.P.: Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607 (2014)
- 7.Girard, A., Murray-Smith, R.: Gaussian processes: prediction at a noisy input and application to iterative multiple-step ahead forecasting of time-series. In: Murray-Smith, R, Shorten, R (eds) Switching and Learning in Feedback Systems, pp. 158–184. Springer, Berlin (2005)Google Scholar
- 8.Jones, D.R., Perttunen, C.D., Stuckman, B.E.: Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl.
**79**(1), 157–181 (1993)MathSciNetCrossRefMATHGoogle Scholar - 9.Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Glob. Optim.
**13**(4), 455–492 (1998)MathSciNetCrossRefMATHGoogle Scholar - 10.Joy, T.T., Rana, S., Gupta, S.K., Venkatesh, S.: Flexible transfer learning framework for bayesian optimisation. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 102–114. Springer, Berlin (2016)Google Scholar
- 11.Laumanns, M., Ocenasek, J.: Bayesian optimization algorithms for multi-objective optimization. In: PPSN (2002)Google Scholar
- 12.Lizotte, D.J., Wang, T., Bowling, M.H., Schuurmans, D.: Automatic gait optimization with gaussian process regression. IJCAI
**7**, 944–949 (2007)Google Scholar - 13.Martinez-Cantin, et al.: A bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robots
**27**(2), 93-103 (2009)Google Scholar - 14.Mockus, J., Tiesis, V., Zilinskas, A.: The application of bayesian methods for seeking the extremum. Towards Glob. Optim.
**2**(117–129), 2 (1978)MATHGoogle Scholar - 15.Nguyen, T.D., Gupta, S., Rana, S., Venkatesh, S.: Stable bayesian optimization. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 578–591. Springer, Berlin (2017)Google Scholar
- 16.Nguyen, V., Rana, S., Gupta, S.K., Li, C., Venkatesh, S.: Budgeted batch Bayesian optimization. In: 16th International Conference on IEEE Data Mining (ICDM), 2016, pp. 1107–1112. IEEE (2016)Google Scholar
- 17.Rasmussen, C.E.: Gaussian processes for machine learning. Citeseer, (2006)Google Scholar
- 18.Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: NIPS, pp. 2951–2959 (2012)Google Scholar
- 19.Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., Adams, R.: Scalable Bayesian optimization using deep neural networks. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2171–2180 (2015)Google Scholar
- 20.Srinivas, N., Krause, A., Seeger, M., Kakade, S.M.: Gaussian process optimization in the bandit setting: no regret and experimental design. In: ICML (2010)Google Scholar
- 21.Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In: ACM SIGKDD (2013)Google Scholar
- 22.Wang, Z., de Freitas, N.: Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758 (2014)
- 23.Xue, D., et al.: Accelerated search for materials with targeted properties by adaptive design. Nat. Commun.
**7**, 11241 (2016)CrossRefGoogle Scholar