1 Introduction

Increasingly, users of machine learning tools are non-experts who require off-the-shelf solutions. The machine learning community has much aided such users by making available a wide variety of sophisticated learning algorithms and feature selection methods through open source packages, such as WEKA [15] and mlr [7]. Such packages ask a user to make two kinds of choices: selecting a learning algorithm and customizing it by setting hyperparameters (which also control feature selection, if applicable). It can be challenging to make the right choice when faced with these degrees of freedom, leaving many users to select algorithms based on reputation or intuitive appeal, and/or to leave hyperparameters set to default values. Of course, adopting this approach can yield performance far worse than that of the best method and hyperparameter settings.

This suggests a natural challenge for machine learning: given a dataset, automatically and simultaneously choosing a learning algorithm and setting its hyperparameters to optimize empirical performance. We dub this the combined algorithm selection and hyperparameter optimization (CASH) problem; we formally define it in Sect. 4.3. There has been considerable past work separately addressing model selection, e.g., [1, 6, 8, 9, 11, 24, 25, 33], and hyperparameter optimization, e.g., [3,4,5, 14, 23, 28, 30]. In contrast, despite its practical importance, we are surprised to find only limited variants of the CASH problem in the literature; furthermore, these consider a fixed and relatively small number of parameter configurations for each algorithm, see e.g., [22].

A likely explanation is that it is very challenging to search the combined space of learning algorithms and their hyperparameters: the response function is noisy and the space is high dimensional, involves both categorical and continuous choices, and contains hierarchical dependencies (e.g., , the hyperparameters of a learning algorithm are only meaningful if that algorithm is chosen; the algorithm choices in an ensemble method are only meaningful if that ensemble method is chosen; etc). Another related line of work is on meta-learning procedures that exploit characteristics of the dataset, such as the performance of so-called landmarking algorithms, to predict which algorithm or hyperparameter configuration will perform well [2, 22, 26, 32]. While the CASH algorithms we study in this chapter start from scratch for each new dataset, these meta-learning procedures exploit information from previous datasets, which may not always be available.

In what follows, we demonstrate that CASH can be viewed as a single hierarchical hyperparameter optimization problem, in which even the choice of algorithm itself is considered a hyperparameter. We also show that—based on this problem formulation—recent Bayesian optimization methods can obtain high quality results in reasonable time and with minimal human effort. After discussing some preliminaries (Sect. 4.2), we define the CASH problem and discuss methods for tackling it (Sect. 4.3). We then define a concrete CASH problem encompassing a wide range of learners and feature selectors in the open source package WEKA (Sect. 4.4), and show that a search in the combined space of algorithms and hyperparameters yields better-performing models than standard algorithm selection and hyperparameter optimization methods (Sect. 4.5). More specifically, we show that the recent Bayesian optimization procedures TPE [4] and SMAC [16] often find combinations of algorithms and hyperparameters that outperform existing baseline methods, especially on large datasets.

This chapter is based on two previous papers, published in the proceedings of KDD 2013 [31] and in the journal of machine learning research (JMLR) in 2017 [20].

2 Preliminaries

We consider learning a function \(f: \mathcal {X} \mapsto \mathcal {Y}\), where \(\mathcal {Y}\) is either finite (for classification), or continuous (for regression). A learning algorithm A maps a set {d 1, …, d n} of training data points \(d_i = ({\mathbf {x}}_i,y_i) \in \mathcal {X} \times \mathcal {Y}\) to such a function, which is often expressed via a vector of model parameters. Most learning algorithms A further expose hyperparameters λ ∈ Λ, which change the way the learning algorithm A λ itself works. For example, hyperparameters are used to describe a description-length penalty, the number of neurons in a hidden layer, the number of data points that a leaf in a decision tree must contain to be eligible for splitting, etc. These hyperparameters are typically optimized in an “outer loop” that evaluates the performance of each hyperparameter configuration using cross-validation.

2.1 Model Selection

Given a set of learning algorithms \(\mathcal {A}\) and a limited amount of training data \(\mathcal {D} = \{({\mathbf {x}}_1,y_1),\dots ,({\mathbf {x}}_n,y_n)\}\), the goal of model selection is to determine the algorithm \(A^* \in \mathcal {A}\) with optimal generalization performance. Generalization performance is estimated by splitting \(\mathcal {D}\) into disjoint training and validation sets \(\mathcal {D}_{\text{train}}^{(i)}\) and \(\mathcal {D}_{\text{valid}}^{(i)}\), learning functions f i by applying A to \(\mathcal {D}_{\text{train}}^{(i)}\), and evaluating the predictive performance of these functions on \(\mathcal {D}_{\text{valid}}^{(i)}\). This allows for the model selection problem to be written as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} A^* \in \operatorname*{\mathrm{argmin}}_{A \in \mathcal{A}} \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}(A, \mathcal{D}_{\text{train}}^{(i)}, \mathcal{D}_{\text{valid}}^{(i)}), \end{array} \end{aligned} $$

where \(\mathcal {L}(A, \mathcal {D}_{\text{train}}^{(i)}, \mathcal {D}_{\text{valid}}^{(i)})\) is the loss achieved by A when trained on \(\mathcal {D}_{\text{train}}^{(i)}\) and evaluated on \(\mathcal {D}_{\text{valid}}^{(i)}\).

We use k-fold cross-validation [19], which splits the training data into k equal-sized partitions \(\mathcal {D}_{\text{valid}}^{(1)}, \dots , \mathcal {D}_{\text{valid}}^{(k)}\), and sets \(\mathcal {D}_{\text{train}}^{(i)} = \mathcal {D} \setminus {} \mathcal {D}_{\text{valid}}^{(i)}\) for i = 1, …, k.Footnote 1

2.2 Hyperparameter Optimization

The problem of optimizing the hyperparameters λ ∈  Λ of a given learning algorithm A is conceptually similar to that of model selection. Some key differences are that hyperparameters are often continuous, that hyperparameter spaces are often high dimensional, and that we can exploit correlation structure between different hyperparameter settings λ 1, λ 2 ∈ Λ. Given n hyperparameters λ 1, …, λ n with domains Λ1, …, Λn, the hyperparameter space Λ is a subset of the crossproduct of these domains: Λ ⊂ Λ1 ×⋯ × Λn. This subset is often strict, such as when certain settings of one hyperparameter render other hyperparameters inactive. For example, the parameters determining the specifics of the third layer of a deep belief network are not relevant if the network depth is set to one or two. Likewise, the parameters of a support vector machine’s polynomial kernel are not relevant if we use a different kernel instead.

More formally, following [17], we say that a hyperparameter λ i is conditional on another hyperparameter λ j, if λ i is only active if hyperparameter λ j takes values from a given set \(V_i(j) \subsetneq \Lambda _j\); in this case we call λ j a parent of λ i. Conditional hyperparameters can in turn be parents of other conditional hyperparameters, giving rise to a tree-structured space [4] or, in some cases, a directed acyclic graph (DAG) [17]. Given such a structured space Λ, the (hierarchical) hyperparameter optimization problem can be written as:

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\boldsymbol{\lambda}}^* \in \operatorname*{\mathrm{argmin}}_{{\boldsymbol{\lambda}} \in {\boldsymbol{\Lambda}}} \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}(A_{{\boldsymbol{\lambda}}}, \mathcal{D}_{\text{train}}^{(i)}, \mathcal{D}_{\text{valid}}^{(i)}). \end{array} \end{aligned} $$

3 Combined Algorithm Selection and Hyperparameter Optimization (CASH)

Given a set of algorithms \(\mathcal {A} = \{A^{(1)}, \dots , A^{(k)}\}\) with associated hyperparameter spaces Λ (1), …, Λ (k), we define the combined algorithm selection and hyperparameter optimization problem (CASH) as computing

$$\displaystyle \begin{aligned} \begin{array}{rcl} {} {A^*}_{{\boldsymbol{\lambda}}^*} \in \operatorname*{\mathrm{argmin}}_{A^{(j)} \in \mathcal{A}, {\boldsymbol{\lambda}} \in {\boldsymbol{\Lambda}}^{(j)}} \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}(A^{(j)}_{\boldsymbol{\lambda}}, \mathcal{D}_{\text{train}}^{(i)}, \mathcal{D}_{\text{valid}}^{(i)}). \end{array} \end{aligned} $$
(4.1)

We note that this problem can be reformulated as a single combined hierarchical hyperparameter optimization problem with parameter space Λ = Λ (1) ∪⋯ ∪ Λ (k) ∪{λ r}, where λ r is a new root-level hyperparameter that selects between algorithms A (1), …, A (k). The root-level parameters of each subspace Λ (i) are made conditional on λ r being instantiated to A i.

In principle, problem (4.1) can be tackled in various ways. A promising approach is Bayesian Optimization [10], and in particular Sequential Model-Based Optimization (SMBO) [16], a versatile stochastic optimization framework that can work with both categorical and continuous hyperparameters, and that can exploit hierarchical structure stemming from conditional parameters. SMBO (outlined in Algorithm 1) first builds a model \(\mathcal {M}_{\mathcal {L}}\) that captures the dependence of loss function \(\mathcal {L}\) on hyperparameter settings λ (line 1 in Algorithm 1). It then iterates the following steps: use \(\mathcal {M}_{\mathcal {L}}\) to determine a promising candidate configuration of hyperparameters λ to evaluate next (line 3); evaluate the loss c of λ (line 4); and update the model \(\mathcal {M}_{\mathcal {L}}\) with the new data point (λ, c) thus obtained (lines 5–6).

Algorithm 1 SMBO

1: initialise model \(\mathcal {M}_L\); \(\mathcal {H} \gets \emptyset \)

2: while time budget for optimization has not been exhausted do

3:   λ ← candidate configuration from \(\mathcal {M}_L\)

4:   Compute \(c = \mathcal {L}(A_{{\boldsymbol {\lambda }}}, \mathcal {D}_{\text{train}}^{(i)}, \mathcal {D}_{\text{valid}}^{(i)})\)

5:   \(\mathcal {H} \gets \mathcal {H} \cup \left \{({\boldsymbol {\lambda }},c)\right \}\)

6:   Update \(\mathcal {M}_L\) given \(\mathcal {H}\)

7: end while

8: return λ from \(\mathcal {H}\) with minimal c

In order to select its next hyperparameter configuration λ using model \(\mathcal {M}_{\mathcal {L}}\), SMBO uses a so-called acquisition function \(a_{\mathcal {M}_{\mathcal {L}}}:{\boldsymbol {\Lambda }} \mapsto \mathbb {R}\), which uses the predictive distribution of model \(\mathcal {M}_{\mathcal {L}}\) at arbitrary hyperparameter configurations λ ∈ Λ to quantify (in closed form) how useful knowledge about λ would be. SMBO then simply maximizes this function over Λ to select the most useful configuration λ to evaluate next. Several well-studied acquisition functions exist [18, 27, 29]; all aim to automatically trade off exploitation (locally optimizing hyperparameters in regions known to perform well) versus exploration (trying hyperparameters in a relatively unexplored region of the space) in order to avoid premature convergence. In this work, we maximized positive expected improvement (EI) attainable over an existing given loss c min [27]. Let c(λ) denote the loss of hyperparameter configuration λ. Then, the positive improvement function over c min is defined as

$$\displaystyle \begin{aligned}I_{c_{min}}({\boldsymbol{\lambda}}) := \max\{c_{min}-c({\boldsymbol{\lambda}}),0\}.\end{aligned}$$

Of course, we do not know c(λ). We can, however, compute its expectation with respect to the current model \(\mathcal {M}_{\mathcal {L}}\):

$$\displaystyle \begin{aligned}\mathbb{E}_{\mathcal{M}_{\mathcal{L}}}[I_{c_{min}}({\boldsymbol{\lambda}}{})] = \int_{-\infty}^{c_{min}} \max\{c_{min}-c,0\}\cdot p_{\mathcal{M}_{L}}(c \mid \lambda) \; dc. \end{aligned} $$
(4.2)

We briefly review the SMBO approach used in this chapter.

3.1 Sequential Model-Based Algorithm Configuration (SMAC)

Sequential model-based algorithm configuration (SMAC) [16] supports a variety of models p(cλ) to capture the dependence of the loss function c on hyper-parameters λ, including approximate Gaussian processes and random forests. In this chapter we use random forest models, since they tend to perform well with discrete and high-dimensional input data. SMAC handles conditional parameters by instantiating inactive conditional parameters in λ to default values for model training and prediction. This allows the individual decision trees to include splits of the kind “is hyperparameter λ i active?”, allowing them to focus on active hyperparameters. While random forests are not usually treated as probabilistic models, SMAC obtains a predictive mean μ λ and variance σ λ 2 of p(cλ) as frequentist estimates over the predictions of its individual trees for λ; it then models \(p_{\mathcal {M}_{\mathcal {L}}}(c \mid {\boldsymbol {\lambda }})\) as a Gaussian \({\mathcal {N}}(\mu _{\boldsymbol {\lambda }}, {\sigma _{\boldsymbol {\lambda }}}^2)\).

SMAC uses the expected improvement criterion defined in Eq. 4.2, instantiating c min to the loss of the best hyperparameter configuration measured so far. Under SMAC’s predictive distribution \(p_{\mathcal {M}_{\mathcal {L}}}(c \mid {\boldsymbol {\lambda }}) = {\mathcal {N}}(\mu _{\boldsymbol {\lambda }}, {\sigma _{\boldsymbol {\lambda }}}^2)\), this expectation is the closed-form expression

$$\displaystyle \begin{aligned} \mathbb{E}_{\mathcal{M}_{\mathcal{L}}}[I_{c_{min}}({\boldsymbol{\lambda}}{})] = \sigma_{{\boldsymbol{\lambda}}} \cdot [u \cdot \Phi(u) + \varphi(u)] {}, \end{aligned}$$

where \(u = \frac {c_{\min }-\mu _{{\boldsymbol {\lambda }}}}{\sigma _{{\boldsymbol {\lambda }}}}\), and φ and Φ denote the probability density function and cumulative distribution function of a standard normal distribution, respectively [18].

SMAC is designed for robust optimization under noisy function evaluations, and as such implements special mechanisms to keep track of its best known configuration and assure high confidence in its estimate of that configuration’s performance. This robustness against noisy function evaluations can be exploited in combined algorithm selection and hyperparameter optimization, since the function to be optimized in Eq. (4.1) is a mean over a set of loss terms (each corresponding to one pair of \(\mathcal {D}_{\text{train}}^{(i)}\) and \(\mathcal {D}_{\text{valid}}^{(i)}\) constructed from the training set). A key idea in SMAC is to make progressively better estimates of this mean by evaluating these terms one at a time, thus trading off accuracy and computational cost. In order for a new configuration to become a new incumbent, it must outperform the previous incumbent in every comparison made: considering only one fold, two folds, and so on up to the total number of folds previously used to evaluate the incumbent. Furthermore, every time the incumbent survives such a comparison, it is evaluated on a new fold, up to the total number available, meaning that the number of folds used to evaluate the incumbent grows over time. A poorly performing configuration can thus be discarded after considering just a single fold.

Finally, SMAC also implements a diversification mechanism to achieve robust performance even when its model is misled, and to explore new parts of the space: every second configuration is selected at random. Because of the evaluation procedure just described, this requires less overhead than one might imagine.

4 Auto-WEKA

To demonstrate the feasibility of an automatic approach to solving the CASH problem, we built Auto-WEKA, which solves this problem for the learners and feature selectors implemented in the WEKA machine learning package [15]. Note that while we have focused on classification algorithms in WEKA, there is no obstacle to extending our approach to other settings. Indeed, another successful system that uses the same underlying technology is auto-sklearn [12].

Fig. 4.1 shows all supported learning algorithms and feature selectors with the number of hyperparameters. algorithms. Meta-methods take a single base classifier and its parameters as an input, and the ensemble methods can take any number of base learners as input. We allowed the meta-methods to use any base learner with any hyperparameter settings, and allowed the ensemble methods to use up to five learners, again with any hyperparameter settings. Not all learners are applicable on all datasets (e.g., due to a classifier’s inability to handle missing data). For a given dataset, our Auto-WEKA implementation automatically only considers the subset of applicable learners. Feature selection is run as a preprocessing phase before building any model.

Fig. 4.1
figure 1

Learners and methods supported by Auto-WEKA, along with number of hyperparameters | Λ|. Every learner supports classification; starred learners also support regression

The algorithms in Fig. 4.1 have a wide variety of hyperparameters, which take values from continuous intervals, from ranges of integers, and from other discrete sets. We associated either a uniform or log uniform prior with each numerical parameter, depending on its semantics. For example, we set a log uniform prior for the ridge regression penalty, and a uniform prior for the maximum depth for a tree in a random forest. Auto-WEKA works with continuous hyperparameter values directly up to the precision of the machine. We emphasize that this combined hyperparameter space is much larger than a simple union of the base learners’ hyperparameter spaces, since the ensemble methods allow up to 5 independent base learners. The meta- and ensemble methods as well as the feature selection contribute further to the total size of AutoWEKA’s hyperparameter space.

Auto-WEKA uses the SMAC optimizer described above to solve the CASH problem and is available to the public through the WEKA package manager; the source code can be found at https://github.com/automl/autoweka and the official project website is at http://www.cs.ubc.ca/labs/beta/Projects/autoweka. For the experiments described in this chapter, we used Auto-WEKA version 0.5. The results the more recent versions achieve are similar; we did not replicate the full set of experiments because of the large computational cost.

5 Experimental Evaluation

We evaluated Auto-WEKA on 21 prominent benchmark datasets (see Table 4.1): 15 sets from the UCI repository [13]; the ‘convex’, ‘MNIST basic’ and ‘rotated MNIST with background images’ tasks used in [5]; the appentency task from the KDD Cup ’09; and two versions of the CIFAR-10 image classification task [21] (CIFAR-10-Small is a subset of CIFAR-10, where only the first 10,000 training data points are used rather than the full 50,000.) Note that in the experimental evaluation, we focus on classification. For datasets with a predefined training/test split, we used that split. Otherwise, we randomly split the dataset into 70% training and 30% test data. We withheld the test data from all optimization method; it was only used once in an offline analysis stage to evaluate the models found by the various optimization methods.

Table 4.1 Datasets used; Num. Discr.. and Num. Cont. refer to the number of discrete and continuous attributes of elements in the dataset, respectively

For each dataset, we ran Auto-WEKA with each hyperparameter optimization algorithm with a total time budget of 30 h. For each method, we performed 25 runs of this process with different random seeds and then—in order to simulate parallelization on a typical workstation—used bootstrap sampling to repeatedly select four random runs and report the performance of the one with best cross-validation performance.

In early experiments, we observed a few cases in which Auto-WEKA’s SMBO method picked hyperparameters that had excellent training performance, but turned out to generalize poorly. To enable Auto-WEKA to detect such overfitting, we partitioned its training set into two subsets: 70% for use inside the SMBO method, and 30% of validation data that we only used after the SMBO method finished.

5.1 Baseline Methods

Auto-WEKA aims to aid non-expert users of machine learning techniques. A natural approach that such a user might take is to perform 10-fold cross validation on the training set for each technique with unmodified hyperparameters, and select the classifier with the smallest average misclassification error across folds. We will refer to this method applied to our set of WEKA learners as Ex-Def; it is the best choice that can be made for WEKA with default hyperparameters.

For each dataset, the second and third columns in Table 4.2 present the best and worst “oracle performance” of the default learners when prepared given all the training data and evaluated on the test set. We observe that the gap between the best and worst learner was huge, e.g., misclassification rates of 4.93% vs. 99.24% on the Dorothea dataset. This suggests that some form of algorithm selection is essential for achieving good performance.

Table 4.2 Performance on both 10-fold cross-validation and test data. Ex-Def and Grid Search are deterministic. Random search had a time budget of 120 CPU hours. For Auto-WEKA, we performed 25 runs of 30 h each. We report results as mean loss across 100,000 bootstrap samples simulating 4 parallel runs. We determined test loss (misclassification rate) by training the selected model/hyperparameters on the entire 70% training data and computing accuracy on the previously unused 30% test data. Bold face indicates the lowest error within a block of comparable methods that was statistically significant

A stronger baseline we will use is an approach that in addition to selecting the learner, also sets its hyperparameters optimally from a predefined set. More precisely, this baseline performs an exhaustive search over a grid of hyperparameter settings for each of the base learners, discretizing numeric parameters into three points. We refer to this baseline as grid search and note that—as an optimization approach in the joint space of algorithms and hyperparameter settings—it is a simple CASH algorithm. However, it is quite expensive, requiring more than 10,000 CPU hours on each of Gisette, Convex, MNIST, Rot MNIST +  BI, and both CIFAR variants, rendering it infeasible to use in most practical applications. (In contrast, we gave Auto-WEKA only 120 CPU hours.)

Table 4.2 (columns four and five) shows the best and worst “oracle performance” on the test set across the classifiers evaluated by grid search. Comparing these performances to the default performance obtained using Ex-Def, we note that in most cases, even WEKA’s best default algorithm could be improved by selecting better hyperparameter settings, sometimes rather substantially: e.g., , in the CIFAR-10 small task, grid search offered a 13% reduction in error over Ex-Def.

It has been demonstrated in previous work that, holding the overall time budget constant, grid search is outperformed by random search over the hyperparameter space [5]. Our final baseline, random search, implements such a method, picking algorithms and hyperparameters sampled at random, and computes their performance on the 10 cross-validation folds until it exhausts its time budget. For each dataset, we first used 750 CPU hours to compute the cross-validation performance of randomly sampled combinations of algorithms and hyperparameters. We then simulated runs of random search by sampling combinations without replacement from these results that consumed 120 CPU hours and returning the sampled combination with the best performance.

5.2 Results for Cross-Validation Performance

The middle portion of Table 4.2 reports our main results. First, we note that grid search over the hyperparameters of all base-classifiers yielded better results than Ex-Def in 17/21 cases, which underlines the importance of not only choosing the right algorithm but of also setting its hyperparameters well.

However, we note that we gave grid search a very large time budget (often in excess 10,000 CPU hours for each dataset, in total more than 10 CPU years), meaning that it would often be infeasible to use in practice.

In contrast, we gave each of the other methods only 4 × 30 CPU hours per dataset; nevertheless, they still yielded substantially better performance than grid search, outperforming it in 14/21 cases. Random search outperforms grid search in 9/21 cases, highlighting that even exhaustive grid search with a large time budget is not always the right thing to do. We note that sometimes Auto-WEKA’s performance improvements over the baselines were substantial, with relative reductions of the cross-validation loss (in this case the misclassification rate) exceeding 10% in 6/21 cases.

5.3 Results for Test Performance

The results just shown demonstrate that Auto-WEKA is effective at optimizing its given objective function; however, this is not sufficient to allow us to conclude that it fits models that generalize well. As the number of hyperparameters of a machine learning algorithm grows, so does its potential for overfitting. The use of cross-validation substantially increases Auto-WEKA’s robustness against overfitting, but since its hyperparameter space is much larger than that of standard classification algorithms, it is important to carefully study whether (and to what extent) overfitting poses a problem.

To evaluate generalization, we determined a combination of algorithm and hyperparameter settings A λ by running Auto-WEKA as before (cross-validating on the training set), trained A λ on the entire training set, and then evaluated the resulting model on the test set. The right portion of Table 4.2 reports the test performance obtained with all methods.

Broadly speaking, similar trends held as for cross-validation performance: Auto-WEKA outperforms the baselines, with grid search and random search performing better than Ex-Def. However, the performance differences were less pronounced: grid search only yields better results than Ex-Def in 15/21 cases, and random search in turn outperforms grid search in 7/21 cases. Auto-WEKA outperforms the baselines in 15/21 cases. Notably, on 12 of the 13 largest datasets, Auto-WEKA outperforms our baselines; we attribute this to the fact that the risk of overfitting decreases with dataset size. Sometimes, Auto-WEKA’s performance improvements over the other methods were substantial, with relative reductions of the test misclassification rate exceeding 16% in 3/21 cases.

As mentioned earlier, Auto-WEKA only used 70% of its training set during the optimization of cross-validation performance, reserving the remaining 30% for assessing the risk of overfitting. At any point in time, Auto-WEKA’s SMBO method keeps track of its incumbent (the hyperparameter configuration with the lowest cross-validation misclassification rate seen so far). After its SMBO procedure has finished, Auto-WEKA extracts a trajectory of these incumbents from it and computes their generalization performance on the withheld 30% validation data. It then computes the Spearman rank coefficient between the sequence of training performances (evaluated by the SMBO method through cross-validation) and this generalization performance.

6 Conclusion

In this work, we have shown that the daunting problem of combined algorithm selection and hyperparameter optimization (CASH) can be solved by a practical, fully automated tool. This is made possible by recent Bayesian optimization techniques that iteratively build models of the algorithm/hyperparameter landscape and leverage these models to identify new points in the space that deserve investigation.

We built a tool, Auto-WEKA, that draws on the full range of learning algorithms in WEKA and makes it easy for non-experts to build high-quality classifiers for given application scenarios. An extensive empirical comparison on 21 prominent datasets showed that Auto-WEKA often outperformed standard algorithm selection and hyperparameter optimization methods, especially on large datasets.

6.1 Community Adoption

Auto-WEKA was the first method to use Bayesian optimization to automatically instantiate a highly parametric machine learning framework at the push of a button. Since its initial release, it has been adopted by many users in industry and academia; the 2.0 line, which integrates with the WEKA package manager, has been downloaded more than 30,000 times, averaging more than 550 downloads a week. It is under active development, with new features added recently and in the pipeline.