Automated Machine Learning pp 177219  Cite as
Analysis of the AutoML Challenge Series 2015–2018
Abstract
The ChaLearn AutoML Challenge (The authors are in alphabetical order of last name, except the first author who did most of the writing and the second author who produced most of the numerical analyses and plots.) (NIPS 2015 – ICML 2016) consisted of six rounds of a machine learning competition of progressive difficulty, subject to limited computational resources. It was followed bya oneround AutoML challenge (PAKDD 2018). The AutoML setting differs from former model selection/hyperparameter selection challenges, such as the one we previously organized for NIPS 2006: the participants aim to develop fully automated and computationally efficient systems, capable of being trained and tested without human intervention, with code submission. This chapter analyzes the results of these competitions and provides details about the datasets, which were not revealed to the participants. The solutions of the winners are systematically benchmarked over all datasets of all rounds and compared with canonical machine learning algorithms available in scikitlearn. All materials discussed in this chapter (data and code) have been made publicly available at http://automl.chalearn.org/.
10.1 Introduction
Until about 10 years ago, machine learning (ML) was a discipline little known to the public. For ML scientists, it was a “seller’s market”: they were producing hosts of algorithms in search for applications and were constantly looking for new interesting datasets. Large internet corporations accumulating massive amounts of data such as Google, Facebook, Microsoft and Amazon have popularized the use of ML and data science competitions have engaged a new generation of young scientists in this wake. Nowadays, government and corporations keep identifying new applications of ML and with the increased availability of open data, we have switched to a “buyer’s market”: everyone seems to be in need of a learning machine. Unfortunately however, learning machines are not yet fully automatic: it is still difficult to figure out which software applies to which problem, how to horseshoefit data into a software and how to select (hyper)parameters properly. The ambition of the ChaLearn AutoML challenge series is to channel the energy of the ML community to reduce step by step the need for human intervention in applying ML to a wide variety of practical problems.

Supervised learning problems (classification and regression).

Feature vector representations.

Homogeneous datasets (same distribution in the training, validation, and test set).

Medium size datasets of less than 200 MBytes.

Limited computer resources with execution times of less than 20 min per dataset on an 8 core x86_64 machine with 56 GB RAM.

Different data distributions: the intrinsic/geometrical complexity of the dataset.

Different tasks: regression, binary classification, multiclass classification, multilabel classification.

Different scoring metrics: AUC, BAC, MSE, F_{1}, etc. (see Sect. 10.4.2).

Class balance: Balanced or unbalanced class proportions.

Sparsity: Full matrices or sparse matrices.

Missing values: Presence or absence of missing values.

Categorical variables: Presence or absence of categorical variables.

Irrelevant variables: Presence or absence of additional irrelevant variables (distractors).

NumberP_{tr}of training examples: Small or large number of training examples.

NumberNof variables/features: Small or large number of variables.

RatioP_{tr}∕Nof the training data matrix:P_{tr} ≫ N, P_{tr} = N or P_{tr} ≪ N.
This challenge series started with the NIPS 2006 “model selection game”^{1} [37], where the participants were provided with a machine learning toolbox based on the Matlab toolkit CLOP [1] built on top of the “Spider” package [69]. The toolkit provided a flexible way of building models by combining preprocessing, feature selection, classification and postprocessing modules, also enabling the building of ensembles of classifiers. The goal of the game was to build the best hypermodel: the focus was on model selection, not on the development of new algorithms. All problems were featurebased binary classification problems. Five datasets were provided. The participants had to submit the schema of their model. The model selection game confirmed the effectiveness of crossvalidation (the winner invented a new variant called crossindexing) and emphasized the need to focus more on search effectiveness with the deployment of novel search techniques such as particle swarm optimization.
New in the 2015/2016 AutoML challenge, we introduced the notion of “task”: each dataset was supplied with a particular scoring metric to be optimized and a time budget. We initially intended to vary widely the time budget from dataset to dataset in an arbitrary way. We ended up fixing it to 20 min for practical reasons (except for Round 0 where the time budget ranged from 100 to 300 s). However, because the datasets varied in size, this put pressure on the participants to manage their allotted time. Other elements of novelty included the freedom of submitting any Linux executable. This was made possible by using automatic execution on the opensource platform Codalab.^{2} To help the participants we provided a starting kit in Python based on the scikitlearn library [55].^{3} This induced many of them to write a wrapper around scikitlearn. This has been the strategy of the winning entry “autosklearn” [25, 26, 27, 28].^{4} Following the AutoML challenge, we organized a “beat autosklearn” game on a single dataset (madeline), in which the participants could provide hyperparameters “by hand” to try to beat autosklearn. But nobody could beat autosklearn! Not even their designers. The participants could submit a json file which describes a sklearn model and hyperparameter settings, via a GUI interface. This interface allows researchers who want to compare their search methods with autosklearn to use the exact same set of hypermodels.
A large number of satellite events including bootcamps, summer schools, and workshops have been organized in 2015/2016 around the AutoML challenge.^{5} The AutoML challenge was part of the official selection of the competition program of IJCNN 2015 and 2016 and the results were discussed at the AutoML and CiML workshops at ICML and NIPS in 2015 and 2016. Several publications accompanied these events: in [33] we describe the details of the design of the AutoML challenge.^{6} In [32] and [34] we review milestone and final results presented at the ICML 2015 and 2016 AutoML workshops. The 2015/2016 AutoML challenge had 6 rounds introducing 5 datasets each. We also organized a followup event for the PAKDD conference 2018^{7} in only 2 phases, with 5 datasets in the development phase and 5 datasets in the final “blind test” round.
Going beyond the former published analyses, this chapter presents systematic studies of the winning solutions on all the datasets of the challenge and conducts comparisons with commonly used learning machines implemented in scikitlearn. It provides unpublished details about the datasets and reflective analyses.
This chapter is in part based on material that has appeared previously [32, 33, 34, 36]. This chapter is complemented by a 46page online appendix that can be accessed from the book’s webpage: http://automl.org/book.
10.2 Problem Formalization and Overview
10.2.1 Scope of the Problem
This challenge series focuses on supervised learning in ML and, in particular, solving classification and regression problems, without any further human intervention, within given constraints. To this end, we released a large number of datasets preformatted in given feature representations (i.e., each example consists of a fixed number of numerical coefficients; more in Sect. 10.3).
The distinction between input and output variables is not always made in ML applications. For instance, in recommender systems, the problem is often stated as making predictions of missing values for every variable rather than predicting the values of a particular variable [58]. In unsupervised learning [30], the purpose is to explain data in a simple and compact way, eventually involving inferred latent variables (e.g., class membership produced by a clustering algorithm).
We consider only the strict supervised learning setting where data present themselves as identically and independently distributed inputoutput pairs. The models used are limited to fixedlength vectorial representations, excluding problems of time series prediction. Text, speech, and video processing tasks included in the challenge have been preprocessed into suitable fixedlength vectorial representations.
The difficulty of the proposed tasks lies in the data complexity (class imbalance, sparsity, missing values, categorical variables). The testbed is composed of data from a wide variety of domains. Although there exist ML toolkits that can tackle all of these problems, it still requires considerable human effort to find, for a given dataset, task, evaluation metric, the methods and hyperparameter settings that maximize performance subject to a computational constraint. The participant challenge is to create the perfect black box that removes human interaction, alleviating the shortage of data scientists in the coming decade.
10.2.2 Full Model Selection
We refer to participant solutions as hypermodels to indicate that they are built from simpler components. For instance, for classification problems, participants might consider a hypermodel that combines several classification techniques such as nearest neighbors, linear models, kernel methods, neural networks, and random forests. More complex hypermodels may also include preprocessing, feature construction, and feature selection modules.

a set of parameters α = [α_{0}, α_{1}, α_{2}, …, α_{n}];

a learning algorithm (referred to as trainer), which serves to optimize the parameters using training data;

a trained model (referred to as predictor) of the form y = f(x) produced by the trainer;

a clear objective function J(f), which can be used to assess the model’s performance on test data.
The problem setting also lends itself to using ensemble methods, which let several “simple” models vote to make the final decision [15, 16, 29]. In this case, the parameters θ may be interpreted as voting weights. For simplicity we lump all parameters in a single vector, but more elaborate structures, such as trees or graphs can be used to define the hyperparameter space [66].
10.2.3 Optimization of Hyperparameters
Everyone who has worked with data has had to face some common modeling choices: scaling, normalization, missing value imputation, variable coding (for categorical variables), variable discretization, degree of nonlinearity and model architecture, among others. ML has managed to reduce the number of hyperparameters and produce blackboxes to perform tasks such as classification and regression [21, 40]. Still, any realworld problem requires at least some preparation of the data before it can be fitted into an “automatic” method, hence requiring some modeling choices. There has been much progress on endtoend automated ML for more complex tasks such as text, image, video, and speech processing with deeplearning methods [6]. However, even these methods have many modeling choices and hyperparameters.
While producing models for a diverse range of applications has been a focus of the ML community, little effort has been devoted to the optimization of hyperparameters. Common practices that include trial and error and grid search may lead to overfitting models for small datasets or underfitting models for large datasets. By overfitting we mean producing models that perform well on training data but perform poorly on unseen data, i.e., models that do not generalize. By underfitting we mean selecting too simple a model, which does not capture the complexity of the data, and hence performs poorly both on training and test data. Despite welloptimized offtheshelf algorithms for optimizing parameters, endusers are still responsible for organizing their numerical experiments to identify the best of a number of models under consideration. Due to lack of time and resources, they often perform model/hyperparameter selection with ad hoc techniques. Ioannidis and Langford [42, 47] examine fundamental, common mistakes such as poor construction of training/test splits, inappropriate model complexity, hyperparameter selection using test sets, misuse of computational resources, and misleading test metrics, which may invalidate an entire study. Participants must avoid these flaws and devise systems that can be blindtested.
An additional twist of our problem setting is that code is tested with limited computational resources. That is, for each task an arbitrary limit on execution time is fixed and a maximum amount of memory is provided. This places a constraint on the participant to produce a solution in a given time, and hence to optimize the model search from a computational point of view. In summary, participants have to jointly address the problem of overfitting/underfitting and the problem of efficient search for an optimal solution, as stated in [43]. In practice, the computational constraints have turned out to be far more challenging to challenge participants than the problem of overfitting. Thus the main contributions have been to devise novel efficient search techniques with cuttingedge optimization methods.
10.2.4 Strategies of Model Search
Most practitioners use heuristics such as grid search or uniform sampling to sample θ space, and use kfold crossvalidation as the upperlevel objective J_{2} [20]. In this framework, the optimization of θ is not performed sequentially [8]. All the parameters are sampled along a regular scheme, usually in linear or log scale. This leads to a number of possibilities that exponentially increases with the dimension of θ. kfold crossvalidation consists of splitting the dataset into k folds; (k − 1) folds are used for training and the remaining fold is used for testing; eventually, the average of the test scores obtained on the k folds is reported. Note that some ML toolkits currently support crossvalidation. There is a lack of principled guidelines to determine the number of grid points and the value of k (with the exception of [20]), and there is no guidance for regularizing J_{2}, yet this simple method is a good baseline approach.
Efforts have been made to optimize continuous hyperparameters with bilevel optimization methods, using either the kfold crossvalidation estimator [7, 50] or the leaveoneout estimator as the upperlevel objective J_{2}. The leaveoneout estimator may be efficiently computed, in closed form, as a byproduct of training only one predictor on all the training examples (e.g., virtualleaveoneout [38]). The method was improved by adding a regularization of J_{2} [17]. Gradient descent has been used to accelerate the search, by making a local quadratic approximation of J_{2} [44]. In some cases, the full J_{2}(θ) can be computed from a few key examples [39, 54]. Other approaches minimize an approximation or an upper bound of the leaveoneout error, instead of its exact form [53, 68]. Nevertheless, these methods are still limited to specific models and continuous hyperparameters.
An early attempt at full model selection was the pattern search method that uses kfold crossvalidation for J_{2}. It explores the hyperparameter space by steps of the same magnitude, and when no change in any parameter further decreases J_{2}, the step size is halved and the process repeated until the steps are deemed sufficiently small [49]. Escalante et al. [24] addressed the full model selection problem using Particle Swarm Optimization, which optimizes a problem by having a population of candidate solutions (particles), and moving these particles around the hyperparameter space using the particle’s position and velocity. kfold crossvalidation is also used for J_{2}. This approach retrieved the winning model in ∼76% of the cases. Overfitting was controlled heuristically with early stopping and the proportion of training and validation data was not optimized. Although progress has been made in experimental design to reduce the risk of overfitting [42, 47], in particular by splitting data in a principled way [61], to our knowledge, no one has addressed the problem of optimally splitting data.
While regularizing the second level of inference is a recent addition to the frequentist ML community, it has been an intrinsic part of Bayesian modeling via the notion of hyperprior. Some methods of multilevel optimization combine importance sampling and MonteCarlo Markov Chains [2]. The field of Bayesian hyperparameter optimization has rapidly developed and yielded promising results, in particular by using Gaussian processes to model generalization performance [60, 63]. But Treestructured Parzen Estimator (TPE) approaches modeling P(xy) and P(y) rather than modeling P(yx) directly [9, 10] have been found to outperform GPbased Bayesian optimization for structured optimization problems with many hyperparameters including discrete ones [23]. The central idea of these methods is to fit J_{2}(θ) to a smooth function in an attempt to reduce variance and to estimate the variance in regions of the hyperparameter space that are undersampled to guide the search towards regions of high variance. These methods are inspirational and some of the ideas can be adopted in the frequentist setting. For instance, the randomforestbased SMAC algorithm [41], which has helped speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions, has also been found to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms [23]. We also notice that Bayesian optimization methods are often combined with other techniques such as metalearning and ensemble methods [25] in order to gain advantage in some challenge settings with a time limit [32]. Some of these methods consider jointly the twolevel optimization and take time cost as a critical guidance for hyperparameter search [45, 64].
Besides Bayesian optimization, several other families of approaches exist in the literature and have gained much attention with the recent rise of deep learning. Ideas borrowed from reinforcement learning have recently been used to construct optimal neural network architectures [4, 70]. These approaches formulate the hyperparameter optimization problem in a reinforcement learning flavor, with for example states being the actual hyperparameter setting (e.g., network architecture), actions being adding or deleting a module (e.g., a CNN layer or a pooling layer), and reward being the validation accuracy. They can then apply offtheshelf reinforcement learning algorithms (e.g., RENFORCE, Qlearning, MonteCarlo Tree Search) to solve the problem. Other architecture search methods use evolutionary algorithms [3, 57]. These approaches consider a set (population) of hyperparameter settings (individuals), modify (mutate and reproduce) and eliminate unpromising settings according to their crossvalidation score (fitness). After several generations, the global quality of the population increases. One important common point of reinforcement learning and evolutionary algorithms is that they both deal with the explorationexploitation tradeoff. Despite the impressive results, these approaches require a huge amount of computational resources and some (especially evolutionary algorithms) are hard to scale. Pham et al. [56] recently proposed weight sharing among child models to speed up the process considerably [70] while achieving comparable results.
Note that splitting the problem of parameter fitting into two levels can be extended to more levels, at the expense of extra complexity—i.e., need for a hierarchy of data splits to perform multiple or nested crossvalidation [22], insufficient data to train and validate at the different levels, and increase of the computational load.
Typical example of multilevel inference algorithm. The toplevel algorithm Validation({GridCV(Kridge, MSE), GridCV(Neural, MSE)}, MSE) is decomposed into its elements recursively. Calling the method “train” on it using data D_{TrVa} results in a function f, then tested with test(f, MSE, D_{Te}). The notation [.]_{CV} indicates that results are averages over multiple data splits (crossvalidation). NA means “not applicable”. A model family \(\mathcal {F}\) of parameters α and hyperparameters θ is represented as f(θ, α). We derogate to the usual convention of putting hyperparameters last, the hyperparameters are listed in decreasing order of inference level. \(\mathcal {F}\), thought of as a bottom level algorithm, does not perform any training: train(f(θ, α)) just returns the function f(x;θ, α)
Parameters  

Level  Algorithm  Fixed  Varying  Optimization performed  Data split 
NA  f  All  All  Performance assessment (no inference)  D _{ Te} 
4  Validation  None  All  Final algorithm selection using validation data  D = [D_{Tr}, D_{Va}] 
3  GridCV  Model index i  θ, γ, α  10fold CV on regularly sampled values of θ  D_{Tr} = [D_{tr}, D_{va}]_{CV} 
2  Kridge(θ) Neural(θ)  i, θ  γ, α  Virtual LOO CV to select regularization parameter γ  \(D_{tr}=[D_{tr}^{\backslash \{d\}}, d]_{CV}\) 
1  Kridge(θ, γ) Neural(θ, γ)  i, θ , γ  α  Matrix inversion of gradient descent to compute α  D _{ tr} 
0  Kridge(θ, γ, α) Neural(θ, γ, α)  All  None  NA  NA 
In summary, many authors focus only on the efficiency of search, ignoring the problem of overfitting the second level objective J_{2}, which is often chosen to be kfold crossvalidation with an arbitrary value for k. Bayesian methods introduce techniques of overfitting avoidance via the notion of hyperpriors, but at the expense of making assumptions on how the data were generated and without providing guarantees of performance. In all the prior approaches to full model selection we know of, there is no attempt to treat the problem as the optimization of a regularized functional J_{2} with respect to both (1) modeling choices and (2) data split. Much remains to be done to jointly address statistical and computational issues. The AutoML challenge series offers benchmarks to compare and contrast methods addressing these problems, free of the inventor/evaluator bias.
10.3 Data
Datasets of the 2015/2016 AutoML challenge.C number of classes, Cbal class balance, Sparse sparsity, Miss fraction of missing values, Cat categorical variables, Irr fraction of irrelevant variables, Pte, Pva, Ptr number of examples of the test, validation, and training sets, respectively, N number of features, Ptr/N aspect ratio of the dataset
Datasets of the 2018 AutoML challenge. All tasks are binary classification problems. The metric is the AUC for all tasks. The time budget is also the same for all datasets (1200 s). Phase 1 was the development phase and phase 2 the final “blind test” phase
Some datasets were obtained from public sources, but they were reformatted into new representations to conceal their identity, except for the final round of the 2015/2016 challenge and the final phase of the 2018 challenge, which included completely new data.
In the 2015/2016 challenge, data difficulty progressively increased from round to round. Round 0 introduced five (public) datasets from previous challenges illustrating the various difficulties encountered in subsequent rounds:
 Novice

Binary classification problems only. No missing data; no categorical features; moderate number of features (<2, 000); balanced classes. Challenge lies in dealing with sparse and full matrices, presence of irrelevant variables, and various Ptr∕N.
 Intermediate

Binary and multiclass classification problems. Challenge lies in dealing with unbalanced classes, number of classes, missing values, categorical variables, and up to 7,000 features.
 Advanced

Binary, multiclass, and multilabel classification problems. Challenge lies in dealing with up to 300,000 features.
 Expert

Classification and regression problems. Challenge lies in dealing with the entire range of data complexity.
 Master

Classification and regression problems of all difficulties. Challenge lies in learning from completely new datasets.
The datasets of the 2018 challenge were all binary classification problems. Validation partitions were not used because of the design of this challenge, even when they were available for some tasks. The three reused datasets had similar difficulty as those of rounds 1 and 2 of the 2015/2016 challenge. However, the seven new data sets introduced difficulties that were not present in the former challenge. Most notably an extreme class imbalance, presence of categorical features and a temporal dependency among instances that could be exploited by participants to develop their methods.^{8} The datasets from both challenges are downloadable from http://automl.chalearn.org/data.
10.4 Challenge Protocol
In this section, we describe design choices we made to ensure the thoroughness and fairness of the evaluation. As previously indicated, we focus on supervised learning tasks (classification and regression problems), without any human intervention, within given time and computer resource constraints (Sect. 10.4.1), and given a particular metric (Sect. 10.4.2), which varies from dataset to dataset. During the challenges, the identity and description of the datasets is concealed (except in the very first round or phase where sample data is distributed) to avoid the use of domain knowledge and to push participants to design fully automated ML solutions. In the 2015/2016 AutoML challenge, the datasets were introduced in a series of rounds (Sect. 10.4.3), alternating periods of code development (Tweakathon phases) and blind tests of code without human intervention (AutoML phases). Either results or code could be submitted during development phases, but code had to be submitted to be part of the AutoML “blind test” ranking. In the 2018 edition of the AutoML challenge, the protocol was simplified. We had only one round in two phases: a development phase in which 5 datasets were released for practice purposes, and a final “blind test” phase with 5 new datasets that were never used before.
10.4.1 Time Budget and Computational Resources
The Codalab platform provides computational resources shared by all participants. We used up to 10 compute workers processing in parallel the queue of submissions made by participants. Each compute worker was equipped with 8 cores x86_64. Memory was increased from 24 to 56 GB after round 3 of the 2015/2016 AutoML challenge. For the 2018 AutoML challenge computing resources were reduced, as we wanted to motivate the development of more efficient yet effective AutoML solutions. We used 6 compute workers processing in parallel the queue of submissions. Each compute worker was equipped with 2 cores x86_64 and 8 GB of memory.
To ensure fairness, when a code submission was evaluated, a compute worker was dedicated to processing that submission only, and its execution time was limited to a given time budget (which may vary from dataset to dataset). The time budget was provided to the participants with each dataset in its info file. It was generally set to 1200 s (20 min) per dataset, for practical reasons, except in the first phase of the first round. However, the participants did not know this ahead of time and therefore their code had to be capable to manage a given time budget. The participants who submitted results instead of code were not constrained by the time budget since their code was run on their own platform. This was potentially advantageous for entries counting towards the Final phases (immediately following a Tweakathon). Participants wishing to also enter the AutoML (blind testing) phases, which required submitting code, could submit both results and code (simultaneously). When results were submitted, they were used as entries in the ongoing phase. They did not need to be produced by the submitted code; i.e., if a participant did not want to share personal code, he/she could submit the sample code provided by the organizers together with his/her results. The code was automatically forwarded to the AutoML phases for “blind testing”. In AutoML phases, result submission was not possible.
The participants were encouraged to save and submit intermediate results so we could draw learning curves. This was not exploited during the challenge. But we study learning curves in this chapter to evaluate the capabilities of algorithms to quickly attain good performances.
10.4.2 Scoring Metrics
The scores are computed by comparing submitted predictions to reference target values. For each sample i, i = 1 : P (where P is the size of the validation set or of the test set), the target value is a continuous numeric coefficient y_{i} for regression problems, a binary indicator in {0, 1} for twoclass problems, or a vector of binary indicators [y_{il}] in {0, 1} for multiclass or multilabel classification problems (one per class l). The participants had to submit prediction values matching as closely as possible the target values, in the form of a continuous numeric coefficient q_{i} for regression problems and a vector of numeric coefficients [q_{il}] in the range [0, 1] for multiclass or multilabel classification problems (one per class l).
The provided starting kit contains an implementation in Python of all scoring metrics used to evaluate the entries. Each dataset has its own scoring criterion specified in its info file. All scores are normalized such that the expected value of the score for a random prediction, based on class prior probabilities, is 0 and the optimal score is 1. Multilabel problems are treated as multiple binary classification problems and are evaluated using the average of the scores of each binary classification subproblem.
R^{2}
ABS
BAC
For binary classification problems, the classwise accuracy is the fraction of correct class predictions when q_{i} is thresholded at 0.5, for each class. For multilabel problems, the classwise accuracy is averaged over all classes. For multiclass problems, the predictions are binarized by selecting the class with maximum prediction value \(\arg \max _l q_{il}\) before computing the classwise accuracy.
AUC
F1 score
PAC
Note that the normalization of R^{2}, ABS, and PAC uses the average target value q_{i} = 〈y_{i}〉 or q_{il} = 〈y_{il}〉. In contrast, the normalization of BAC, AUC, and F1 uses a random prediction of one of the classes with uniform probability.
Only R^{2} and ABS are meaningful for regression; we compute the other metrics for completeness by replacing the target values with binary values after thresholding them in the midrange.
10.4.3 Rounds and Phases in the 2015/2016 Challenge
Phases of round n in the 2015/2016 challenge. For each dataset, one labeled training set is provided and two unlabeled sets (validation set and test set) are provided for testing
Phase in round [n]  Goal  Duration  Submissions  Data  Leaderboard scores  Prizes 

* AutoML[n]  Blind  Short  NONE  New datasets,  Test  Yes 
test  (code  not  set  
of code  migrated)  downloadable  results  
Tweakathon[n]  Manual  Months  Code and/  Datasets  Validation  No 
tweaking  or results  downloadable  set results  
* Final[n]  Results of  Short  NONE  NA  Test  Yes 
Tweakathon  (results  set  
revealed  migrated)  results 
Submissions were made in Tweakathon phases only. The results of the latest submission were shown on the leaderboard and such submission automatically migrated to the following phase. In this way, the code of participants who abandoned before the end of the challenge had a chance to be tested in subsequent rounds and phases. New participants could enter at any time. Prizes were awarded in phases marked with a * during which there was no submission. To participate in phase AutoML[n], code had to be submitted in Tweakathon[n1].
In order to encourage participants to try GPUs and deep learning, a GPU track sponsored by NVIDIA was included in Round 4.
To participate in the Final[n], code or results had to be submitted in Tweakathon[n]. If both code and (wellformatted) results were submitted, the results were used for scoring rather than rerunning the code in Tweakathon[n] and Final[n]. The code was executed when results were unavailable or not well formatted. Thus, there was no disadvantage in submitting both results and code. If a participant submitted both results and code, different methods could be used to enter the Tweakathon/Final phases and the AutoML phases. Submissions were made only during Tweakathons, with a maximum of five submissions per day. Immediate feedback was provided on the leaderboard on validation data. The participants were ranked on the basis of test performance during the Final and AutoML phases.
We provided baseline software using the ML library scikitlearn [55]. It uses ensemble methods, which improve over time by adding more base learners. Other than the number of base learners, the default hyperparameter settings were used. The participants were not obliged to use the Python language nor the main Python script we gave as an example. However, most participants found it convenient to use the main python script, which managed the sparse format, the anytime learning settings and the scoring metrics. Many limited themselves to search for the best model in the scikitlearn library. This shows the importance of providing a good starting kit, but also the danger of biasing results towards particular solutions.
10.4.4 Phases in the 2018 Challenge
The 2015/2016 AutoML challenge was very long and few teams participated in all rounds. Further, even though there was no obligation to participate in previous rounds to enter new rounds, new potential participants felt they would be at a disadvantage. Hence, we believe it is preferable to organize recurrent yearly events, each with their own workshop and publication opportunity. This provides a good balance between competition and collaboration.
In 2018, we organized a single round of AutoML competition in two phases. In this simplified protocol, the participants could practice on five datasets during the first (development) phase, by either submitting code or results. Their performances were revealed immediately, as they became available, on the leaderboard.
The last submission of the development phase was automatically forwarded to the second phase: the AutoML “blind test” phase. In this second phase, which was the only one counting towards the prizes, the participants’ code was automatically evaluated on five new datasets on the Codalab platform. The datasets were not revealed to the participants. Hence, submissions that did not include code capable of being trained and tested automatically were not ranked in the final phase and could not compete towards the prizes.
We provided the same starting kit as in the AutoML 2015/2016 challenge, but the participants also had access to the code of the winners of the previous challenge.
10.5 Results
This section provides a brief description of the results obtained during both challenges, explains the methods used by the participants and their elements of novelty, and provides the analysis of postchallenge experiments conducted to answer specific questions on the effectiveness of model search techniques.
10.5.1 Scores Obtained in the 2015/2016 Challenge
The 2015/2016 challenge lasted 18 months (December 8, 2014 to May 1, 2016). By the end of the challenge, practical solutions were obtained and opensourced, such as the solution of the winners [25].
Results of the 2015/2016 challenge winners. < R > is the average rank over all five data sets of the round and it was used to rank the participants. < S > is the average score over the five data sets of the round. UP is the percent increase in performance between the average performance of the winners in the AutoML phase and the Final phase of the same round. The GPU track was run in round 4. Team names are abbreviated as follows: aad aad_freiburg, djaj djajetic, marc marc.boulle, tadej tadejs, abhi abhishek4, ideal ideal.intel.analytics, mat matthias.vonrohr, lisheng lise_sun, asml amsl.intel.com, jlr44 backstreet.bayes, post postech.mlg_exbrain, ref reference
There is still room for improvement by manual tweaking and/or additional computational resources, as revealed by the significant differences remaining between Tweakathon and AutoML (blind testing) results (Table 10.5 and Fig. 10.3b). In Round 3, all but one participant failed to turn in working solutions during blind testing, because of the introduction of sparse datasets. Fortunately, the participants recovered, and, by the end of the challenge, several submissions were capable of returning solutions on all the datasets of the challenge. But learning schemas can still be optimized because, even discarding Round 3, there is a 15–35% performance gap between AutoML phases (blind testing with computational constraints) and Tweakathon phases (human intervention and additional compute power). The GPU track offered (in round 4 only) a platform for trying Deep Learning methods. This allowed the participants to demonstrate that, given additional compute power, deep learning methods were competitive with the best solutions of the CPU track. However, no Deep Learning method was competitive with the limited compute power and time budget offered in the CPU track.
10.5.2 Scores Obtained in the 2018 Challenge
Results of the 2018 challenge winners. Each phase was run on five different datasets. We show the winners of the AutoML (blind test) phase and for comparison their performances in the Feedback phase. The full tables can be found at https://competitions.codalab.org/competitions/17767
Performance in this challenge was slightly lower than that observed in the previous edition. This was due to the difficulty of the tasks (see below) and the fact that data sets in the feedback phase included three deceiving datasets (associated to tasks from previous challenges, but not necessarily similar to the data sets used in the blind test phase) out of five. We decided to proceed this way to emulate a realistic AutoML setting. Although harder, several teams succeeded at returning submissions performing better than chance.
The winner of the challenge was the same team that won the 2015/2016 AutoML challenge: AAD Freiburg [28]. The 2018 challenge helped to incrementally improve the solution devised by this team in the previous challenge. Interestingly, the secondplaced team in the challenge proposed a solution that is similar in spirit to that of the winning team. For this challenge, there was a triple tie in the third place, prizes were split among the tied teams. Among the winners, two teams used the starting kit. Most of the other teams used either the starting kit or the solution open sourced by the AAD Freiburg team in the 2015/2016 challenge.
10.5.3 Difficulty of Datasets/Tasks

Categorical variables and missing data. Few datasets had categorical variables in the 2015/2016 challenge (ADULT, ALBERT, and WALDO), and not very many variables were categorical in those datasets. Likewise, very few datasets had missing values (ADULT and ALBERT) and those included only a few missing values. So neither categorical variables nor missing data presented a real difficulty in this challenge, though ALBERT turned out to be one of the most difficult datasets because it was also one of the largest ones. This situation changed drastically for the 2018 challenge where five out of the ten datasets included categorical variables (RL, PM, RI, RH and RM) and missing values (GINA, PM, RL, RI and RM). These were among the main aspects that caused the low performance of most methods in the blind test phase.

Large number of classes. Only one dataset had a large number of classes (DIONIS with 355 classes). This dataset turned out to be difficult for participants, particularly because it is also large and has unbalanced classes. However, datasets with large number of classes are not well represented in this challenge. HELENA, which has the second largest number of classes (100 classes), did not stand out as a particularly difficult dataset. However, in general, multiclass problems were found to be more difficult than binary classification problems.

Regression. We had only four regression problems: CADATA, FLORA, YOLANDA, PABLO.

Sparse data. A significant number of datasets had sparse data (DOROTHEA, FABERT, ALEXIS, WALLIS, GRIGORIS, EVITA, FLORA, TANIA, ARTURO, MARCO). Several of them turned out to be difficult, particularly ALEXIS, WALLIS, and GRIGORIS, which are large datasets in sparse format, which cause memory problems when they were introduced in round 3 of the 2015/2016 challenge. We later increased the amount of memory on the servers and similar datasets introduced in later phases caused less difficulty.

Large datasets. We expected the ratio of the number N of features over the number P_{tr} of training examples to be a particular difficulty (because of the risk of overfitting), but modern machine learning algorithm are robust against overfitting. The main difficulty was rather the PRODUCT N ∗ P_{tr}. Most participants attempted to load the entire dataset in memory and convert sparse matrices into full matrices. This took very long and then caused loss in performances or program failures. Large datasets with N ∗ P_{tr} > 20.10^{6} include ALBERT, ALEXIS, DIONIS, GRIGORIS, WALLIS, EVITA, FLORA, TANIA, MARCO, GINA, GUILLERMO, PM, RH, RI, RICCARDO, RM. Those overlap significantly with the datasets with sparse data (in bold). For the 2018 challenge, all data sets in the final phase exceeded this threshold, and this was the reason of why the code from several teams failed to finish within the time budget. Only ALBERT and DIONIS were “truly” large (few features, but over 400,000 training examples).

Presence of probes: Some datasets had a certain proportion of distractor features or irrelevant variables (probes). Those were obtained by randomly permuting the values of real features. Twothird of the datasets contained probes ADULT, CADATA, DIGITS, DOROTHEA, CHRISTINE, JASMINE, MADELINE, PHILIPPINE, SYLVINE, ALBERT, DILBERT, FABERT, JANNIS, EVITA, FLORA, YOLANDA, ARTURO, CARLO, PABLO, WALDO. This allowed us in part to make datasets that were in the public domain less recognizable.

Type of metric: We used six metrics, as defined in Sect. 10.4.2. The distribution of tasks in which they were used was not uniform: BAC (11), AUC (6), F1 (3), and PAC (6) for classification, and R2 (2) and ABS (2) for regression. This is because not all metrics lend themselves naturally to all types of applications.

Time budget: Although in round 0 we experimented with giving different time budgets for the various datasets, we ended up assigning 1200 s (20 min) to all datasets in all other rounds. Because the datasets varied in size, this put more constraints on large datasets.

Class imbalance: This was not a difficulty found in the 2015/2016 datasets. However, extreme class imbalance was the main difficulty for the 2018 edition. Imbalance ratios lower or equal to 1–10 were present in RL, PM, RH, RI, and RM datasets, in the latter data set class imbalance was as extreme as 1–1000. This was the reason why the performance of teams was low.
For the datasets used in the 2018 challenge, the tasks’ difficulty was clearly associated with extreme class imbalance, inclusion of categorical variables and high dimensionality in terms of N × P_{tr}. However, for the 2015/2016 challenge data sets we found that it was generally difficult to guess what makes a task easy or hard, except for dataset size, which pushed participants to the frontier of the hardware capabilities and forced them to improve the computational efficiency of their methods. Binary classification problems (and multilabel problems) are intrinsically “easier” than multiclass problems, for which “guessing” has a lower probability of success. This partially explains the higher median performance in rounds 1 and 3, which are dominated by binary and multilabel classification problems. There is not a large enough number of datasets illustrating each type of other difficulties to draw other conclusions.
 1.
Intrinsic difficulty, linked to the amount of noise or the signal to noise ratio. Given an infinite amount of data and an unbiased learning machine \(\hat {F}\) capable of identifying F, the prediction performances cannot exceed a given maximum value, corresponding to \(\hat {F}=F\).
 2.
Modeling difficulty, linked to the bias and variance of estimators \(\hat {F}\), in connection with the limited amount of training data and limited computational resources, and the possibly large number or parameters and hyperparameters to estimate.
Evaluating the intrinsic difficulty is impossible unless we know F. Our best approximation of F is the winners’ solution. We use therefore the winners’ performance as an estimator of the best achievable performance. This estimator may have both bias and variance: it is possibly biased because the winners may be underfitting training data; it may have variance because of the limited amount of test data. Underfitting is difficult to test. Its symptoms may be that the variance or the entropy of the predictions is less than those of the target values.
Evaluating the modeling difficulty is also impossible unless we know F and the model class. In the absence of knowledge on the model class, data scientists often use generic predictive models, agnostic with respect to the data generating process. Such models range from very basic models that are highly biased towards “simplicity” and smoothness of predictions (e.g., regularized linear models) to highly versatile unbiased models that can learn any function given enough data (e.g., ensembles of decision trees). To indirectly assess modeling difficulty, we resorted to use the difference in performance between the method of the challenge winner and that of (a) the best of four “untuned” basic models (taken from classical techniques provided in the scikitlearn library [55] with default hyperparameters) or (b) Selective Naive Bayes (SNB) [12, 13], a highly regularized model (biased towards simplicity), providing a very robust and simple baseline.

LandmarkDecisionTree: performance of a decision tree classifier.

Landmark1NN: performance of a nearest neighbor classifier.

SkewnessMin: min over skewness of all features. Skewness measures the symmetry of a distribution. A positive skewness value means that there is more weight in the left tail of the distribution.
10.5.4 Hyperparameter Optimization
10.5.5 Metalearning
One question is whether metalearning [14] is possible, that is learning to predict whether a given classifier will perform well on future datasets (without actually training it), based on its past performances on other datasets. We investigated whether it is possible to predict which basic method will perform best based on the metalearning features of autosklearn (see the online appendix). We removed the “Landmark” features from the set of meta features because those are performances of basic predictors (albeit rather poor ones with many missing values), which would lead to a form of “data leakage”.

NB: Naive Bayes

SGDlinear: Linear model (trained with stochastic gradient descent)

KNN: Knearest neighbors

RF: Random Forest
The features that are most predictive all have to do with “ClassProbability” and “PercentageOfMissingValues”, indicating that the class imbalance and/or large number of classes (in a multiclass problem) and the percentage of missing values might be important, but there is a high chance of overfitting as indicated by an unstable ranking of the best features under resampling of the training data.
10.5.6 Methods Used in the Challenges
A brief description of methods used in both challenges is provided in the online appendix, together with the results of a survey on methods that we conducted after the challenges. In light of the overview of Sect. 10.2 and the results presented in the previous section, we may wonder whether a dominant methodology for solving the AutoML problem has emerged and whether particular technical solutions were widely adopted. In this section we call “model space” the set of all models under consideration. We call “basic models” (also called elsewhere “simple models”, “individual models”, “base learners”) the member of a library of models from which our hypermodels of model ensembles are built.
Ensembling: dealing with overfitting and anytime learning
Ensembling is the big AutoML challenge series winner since it is used by over 80% of the participants and by all the topranking ones. While a few years ago the hottest issue in model selection and hyperparameter optimization was overfitting, in present days the problem seems to have been largely avoided by using ensembling techniques. In the 2015/2016 challenge, we varied the ratio of number of training examples over number of variables (Ptr∕N) by several orders of magnitude. Five datasets had a ratio Ptr∕N lower than one (dorothea, newsgroup, grigoris, wallis, and flora), which is a case lending itself particularly to overfitting. Although Ptr∕N is the most predictive variable of the median performance of the participants, there is no indication that the datasets with Ptr∕N < 1 were particularly difficult for the participants (Fig. 10.5). Ensembles of predictors have the additional benefit of addressing in a simple way the “anytime learning” problem by growing progressively a bigger ensemble of predictors, improving performance over time. All trained predictors are usually incorporated in the ensemble. For instance, if crossvalidation is used, the predictors of all folds are directly incorporated in the ensemble, which saves the computational time of retraining a single model on the best HP selected and may yield more robust solutions (though slightly more biased due to the smaller sample size). The approaches differ in the way they weigh the contributions of the various predictors. Some methods use the same weight for all predictors (this is the case of bagging methods such as Random Forest and of Bayesian methods that sample predictors according to their posterior probability in model space). Some methods assess the weights of the predictors as part of learning (this is the case of boosting methods, for instance). One simple and effective method to create ensembles of heterogeneous models was proposed by [16]. It was used successfully in several past challenges, e.g., [52] and is the method implemented by the aad_freibug team, one of the strongest participants in both challenges [25]. The method consists in cycling several times over all trained model and incorporating in the ensemble at each cycle the model which most improves the performance of the ensemble. Models vote with weight 1, but they can be incorporated multiple times, which de facto results in weighting them. This method permits to recompute very fast the weights of the models if crossvalidated predictions are saved. Moreover, the method allows optimizing the ensemble for any metric by postfitting the predictions of the ensemble to the desired metric (an aspect which was important in this challenge).
Model evaluation: crossvalidation or simple validation
Evaluating the predictive accuracy of models is a critical and necessary building block of any model selection of ensembling method. Model selection criteria computed from the predictive accuracy of basic models evaluated from training data, by training a single time on all the training data (possibly at the expense of minor additional calculations), such as performance bounds, were not used at all, as was already the case in previous challenges we organized [35]. Crossvalidation was widely used, particularly Kfold crossvalidation. However, basic models were often “cheaply” evaluated on just one fold to allow quickly discarding nonpromising areas of model space. This is a technique used more and more frequently to help speed up search. Another speedup strategy is to train on a subset of the training examples and monitor the learning curve. The “freezethaw” strategy [64] halts training of models that do not look promising on the basis of the learning curve, but may restart training them at a later point. This was used, e.g., by [48] in the 2015/2016 challenge.
Model space: Homogeneous vs. heterogeneous
An unsettled question is whether one should search a large or small model space. The challenge did not allow us to give a definite answer to this question. Most participants opted for searching a relatively large model space, including a wide variety of models found in the scikitlearn library. Yet, one of the strongest entrants (the Intel team) submitted results simply obtained with a boosted decision tree (i.e., consisting of a homogeneous set of weak learners/basic models). Clearly, it suffices to use just one machine learning approach that is a universal approximator to be able to learn anything, given enough training data. So why include several? It is a question of rate of convergence: how fast we climb the learning curve. Including stronger basic models is one way to climb the learning curve faster. Our postchallenge experiments (Fig. 10.9) reveal that the scikitlearn version of Random Forest (an ensemble of homogeneous basic models—decision trees) does not usually perform as well as the winners’ version, hinting that there is a lot of knowhow in the Intel solution, which is also based on ensembles of decision tree, that is not captured by a basic ensemble of decision trees such as RF. We hope that more principled research will be conducted on this topic in the future.
Search strategies: Filter, wrapper, and embedded methods
With the availability of powerful machine learning toolkits like scikitlearn (on which the starting kit was based), the temptation is great to implement allwrapper methods to solve the CASH (or “full model selection”) problem. Indeed, most participants went that route. Although a number of ways of optimizing hyperparameters with embedded methods for several basic classifiers have been published [35], they each require changing the implementation of the basic methods, which is timeconsuming and errorprone compared to using already debugged and welloptimized library version of the methods. Hence practitioners are reluctant to invest development time in the implementation of embedded methods. A notable exception is the software of marc.boulle, which offers a selfcontained hyperparameter free solution based on Naive Bayes, which includes recoding of variables (grouping or discretization) and variable selection. See the online appendix.
Multilevel optimization
Another interesting issue is whether multiple levels of hyperparameters should be considered for reasons of computational effectiveness or overfitting avoidance. In the Bayesian setting, for instance, it is quite feasible to consider a hierarchy of parameters/hyperparameters and several levels of priors/hyperpriors. However, it seems that for practical computational reasons, in the AutoML challenges, the participants use a shallow organization of hyperparameter space and avoid nested crossvalidation loops.
Time management: Exploration vs. exploitation tradeoff
With a tight time budget, efficient search strategies must be put into place to monitor the exploration/exploitation tradeoff. To compare strategies, we show in the online appendix learning curves for two top ranking participants who adopted very different methods: Abhishek and aad_freiburg. The former uses heuristic methods based on prior human experience while the latter initializes search with models predicted to be best suited by metalearning, then performs Bayesian optimization of hyperparameters. Abhishek seems to often start with a better solution but explores less effectively. In contrast, aad_freiburg starts lower but often ends up with a better solution. Some elements of randomness in the search are useful to arrive at better solutions.
Preprocessing and feature selection
The datasets had intrinsic difficulties that could be in part addressed by preprocessing or special modifications of algorithms: sparsity, missing values, categorical variables, and irrelevant variables. Yet it appears that among the topranking participants, preprocessing has not been a focus of attention. They relied on the simple heuristics provided in the starting kit: replacing missing values by the median and adding a missingness indicator variable, onehotencoding of categorical variables. Simple normalizations were used. The irrelevant variables were ignored by 2/3 of the participants and no use of feature selection was made by topranking participants. The methods used that involve ensembling seem to be intrinsically robust against irrelevant variables. More details from the fact sheets are found in the online appendix.
Unsupervised learning
Despite the recent regain of interest in unsupervised learning spurred by the Deep Learning community, in the AutoML challenge series, unsupervised learning is not widely used, except for the use of classical space dimensionality reduction techniques such as ICA and PCA. See the online appendix for more details.
Transfer learning and meta learning
To our knowledge, only aad_freiburg relied on metalearning to initialize their hyperparameter search. To that end, they used datasets of OpenML.^{13} The number of datasets released and the diversity of tasks did not allow the participants to perform effective transfer learning or meta learning.
Deep learning
The type of computations resources available in AutoML phases ruled out the use of Deep Learning, except in the GPU track. However, even in that track, the Deep Learning methods did not come out ahead. One exception is aad_freiburg, who used Deep Learning in Tweakathon rounds three and four and found it to be helpful for the datasets Alexis, Tania and Yolanda.
Task and metric optimization
There were four types of tasks (regression, binary classification, multiclass classification, and multilabel classification) and six scoring metrics (R2, ABS, BAC, AUC, F1, and PAC). Moreover, class balance and number of classes varied a lot for classification problems. Moderate effort has been put into designing methods optimizing specific metrics. Rather, generic methods were used and the outputs postfitted to the target metrics by crossvalidation or through the ensembling method.
Engineering
One of the big lessons of the AutoML challenge series is that most methods fail to return results in all cases, not a “good” result, but “any” reasonable result. Reasons for failure include “out of time” and “out of memory” or various other failures (e.g., numerical instabilities). We are still very far from having “basic models” that run on all datasets. One of the strengths of autosklearn is to ignore those models that fail and generally find at least one that returns a result.
Parallelism
The computers made available had several cores, so in principle, the participants could make use of parallelism. One common strategy was just to rely on numerical libraries that internally use such parallelism automatically. The aad_freiburg team used the different cores to launch in parallel model search for different datasets (since each round included five datasets). These different uses of computational resources are visible in the learning curves (see the online appendix).
10.6 Discussion
 1.
Was the provided time budget sufficient to complete the tasks of the challenge? We drew learning curves as a function of time for the winning solution of aad_freiburg (autosklearn, see the online appendix). This revealed that for most datasets, performances still improved well beyond the time limit imposed by the organizers. Although for about half the datasets the improvement is modest (no more that 20% of the score obtained at the end of the imposed time limit), for some datasets the improvement was very large (more than 2× the original score). The improvements are usually gradual, but sudden performance improvements occur. For instance, for Wallis, the score doubled suddenly at 3× the time limit imposed in the challenge. As also noted by the authors of the autosklearn package [25], it has a slow start but in the long run gets performances close to the best method.
 2.
Are there tasks that were significantly more difficult than others for the participants? Yes, there was a very wide range of difficulties for the tasks as revealed by the dispersion of the participants in terms of average (median) and variability (third quartile) of their scores. Madeline, a synthetic dataset featuring a very nonlinear task, was very difficult for many participants. Other difficulties that caused failures to return a solution included large memory requirements (particularly for methods that attempted to convert sparse matrices to full matrices), and short time budgets for datasets with large number of training examples and/or features or with many classes or labels.
 3.
Are there metafeatures of datasets and methods providing useful insight to recommend certain methods for certain types of datasets? The aad_freiburg team used a subset of 53 metafeatures (a superset of the simple statistics provided with the challenge datasets) to measure similarity between datasets. This allowed them to conduct hyperparameter search more effectively by initializing the search with settings identical to those selected for similar datasets previously processed (a form of metalearning). Our own analysis revealed that it is very difficult to predict the predictors’ performances from the metafeatures, but it is possible to predict relatively accurately which “basic method” will perform best. With LDA, we could visualize how datasets recoup in two dimensions and show a clean separation between datasets “preferring” Naive Bayes, linear SGD, or KNN, or RF. This deserves further investigation.
 4.
Does hyperparameter optimization really improve performance over using default values? The comparison we conducted reveals that optimizing hyperparameters rather than choosing default values for a set of four basic predictive models (Knearest neighbors, Random Forests, linear SGD, and Naive Bayes) is generally beneficial. In the majority of cases (but not always), hyperparameter optimization (hyperopt) results in better performances than default values. Hyperopt sometimes fails because of time or memory limitations, which gives room for improvement.
 5.
How do winner’s solutions compare with basic scikitlearn models? They compare favorably. For example, the results of basic models whose parameters have been optimized do not yield generally as good results as running autosklearn. However, a basic model with default HP sometimes outperforms this same model tuned by autosklearn.
10.7 Conclusion
We have analyzed the results of several rounds of AutoML challenges.
Our design of the first AutoML challenge (2015/2016) was satisfactory in many respects. In particular, we attracted a large number of participants (over 600), attained results that are statistically significant, and advanced the state of the art to automate machine learning. Publicly available libraries have emerged as a result of this endeavor, including autosklearn.
In particular, we designed a benchmark with a large number of diverse datasets, with large enough test sets to separate topranking participants. It is difficult to anticipate the size of the test sets needed, because the error bars depend on the performances attained by the participants, so we are pleased that we made reasonable guesses. Our simple ruleofthumb “N = 50/E” where N is the number of test samples and E the error rate of the smallest class seems to be widely applicable. We made sure that the datasets were neither too easy nor too hard. This is important to be able to separate participants. To quantify this, we introduced the notion of “intrinsic difficulty” and “modeling difficulty”. Intrinsic difficulty can be quantified by the performance of the best method (as a surrogate for the best attainable performance, i.e., the Bayes rate for classification problems). Modeling difficulty can be quantified by the spread in performance between methods. Our best problems have relatively low “intrinsic difficulty” and high “modeling difficulty”. However, the diversity of the 30 datasets of our first 2015/2016 challenge is both a feature and a curse: it allows us to test the robustness of software across a variety of situations, but it makes metalearning very difficult, if not impossible. Consequently, external metalearning data must be used if metalearning is to be explored. This was the strategy adopted by the AAD Freiburg team, which used the OpenML data for meta training. Likewise, we attached different metrics to each dataset. This contributed to making the tasks more realistic and more difficult, but also made metalearning harder. In the second 2018 challenge, we diminished the variety of datasets and used a single metric.
With respect to task design, we learned that the devil is in the details. The challenge participants solve exactly the task proposed to the point that their solution may not be adaptable to seemingly similar scenarios. In the case of the AutoML challenge, we pondered whether the metric of the challenge should be the area under the learning curve or one point on the learning curve (the performance obtained after a fixed maximum computational time elapsed). We ended up favoring the second solution for practical reasons. Examining after the challenge the learning curves of some participants, it is quite clear that the two problems are radically different, particularly with respect to strategies mitigating “exploration” and “exploitation”. This prompted us to think about the differences between “fixed time” learning (the participants know in advance the time limit and are judged only on the solution delivered at the end of that time) and “any time learning” (the participants can be stopped at any time and asked to return a solution). Both scenarios are useful: the first one is practical when models must be delivered continuously at a rapid pace, e.g. for marketing applications; the second one is practical in environments when computational resources are unreliable and interruption may be expected (e.g. people working remotely via an unreliable connection). This will influence the design of future challenges.
The two versions of AutoML challenge we have run differ in the difficulty of transfer learning. In the 2015/2016 challenge, round 0 introduced a sample of all types of data and difficulties (types of targets, sparse data or not, missing data or not, categorical variables of not, more examples than features or not). Then each round ramped up difficulty. The datasets of round 0 were relatively easy. Then at each round, the code of the participants was blindtested on data that were one notch harder than in the previous round. Hence transfer was quite hard. In the 2018 challenge, we had 2 phases, each with 5 datasets of similar difficulty and the datasets of the first phase were each matched with one corresponding dataset on a similar task. As a result, transfer was made simpler.
Concerning the starting kit and baseline methods, we provided code that ended up being the basis of the solution of the majority of participants (with notable exceptions from industry such as Intel and Orange who used their own “in house” packages). Thus, we can question whether the software provided biased the approaches taken. Indeed, all participants used some form of ensemble learning, similarly to the strategy used in the starting kit. However, it can be argued that this is a “natural” strategy for this problem. But, in general, the question of providing enough starting material to the participants without biasing the challenge in a particular direction remains a delicate issue.
From the point of view of challenge protocol design, we learned that it is difficult to keep teams focused for an extended period of time and go through many challenge phases. We attained a large number of participants (over 600) over the whole course of the AutoML challenge, which lasted over a year (2015/2016) and was punctuated by several events (such as hackathons). However, it may be preferable to organize yearly events punctuated by workshops. This is a natural way of balancing competition and cooperation since workshops are a place of exchange. Participants are naturally rewarded by the recognition they gain via the system of scientific publications. As a confirmation of this conjecture, the second instance of the AutoML challenge (2017/2018) lasting only 4 months attracted nearly 300 participants.
One important novelty of our challenge design was code submission. Having the code of the participants executed on the same platform under rigorously similar conditions is a great step towards fairness and reproducibility, as well as ensuring the viability of solution from the computational point of view. We required the winners to release their code under an open source licence to win their prizes. This was good enough an incentive to obtain several software publications as the “product” of the challenges we organized. In our second challenge (AutoML 2018), we used Docker. Distributing Docker images makes it possible for anyone downloading the code of the participants to easily reproduce the results without stumbling over installation problems due to inconsistencies in computer environments and libraries. Still the hardware may be different and we find that, in postchallenge evaluations, changing computers may yield significant differences in results. Hopefully, with the proliferation of affordable cloud computing, this will become less of an issue.
The AutoML challenge series is only beginning. Several new avenues are under study. For instance, we are preparing the NIPS 2018 Life Long Machine Learning challenge in which participants will be exposed to data whose distribution slowly drifts over time. We are also looking at a challenge of automated machine learning where we will focus on transfer from similar domains.
Footnotes
 1.
 2.
 3.
 4.
 5.
 6.
 7.
 8.
In RL, PM, RH, RI and RM datasets instances were chronologically sorted, this information was made available to participants and could be used for developing their methods.
 9.
Examples of sparse datasets were provided in round 0, but they were of smaller size.
 10.
Independently and Identically Distributed samples.
 11.
We use sklearn 0.16.1 and autosklearn 0.4.0 to mimic the challenge environment.
 12.
We set the loss of SGD to be ‘log’ in scikitlearn for these experiments.
 13.
Notes
Acknowledgements
Microsoft supported the organization of this challenge and donated the prizes and cloud computing time on Azure. This project received additional support from the Laboratoire d’Informatique Fondamentale (LIF, UMR CNRS 7279) of the University of Aix Marseille, France, via the LabeX Archimede program, the Laboratoire de Recheche en Informatique of Paris Sud University, and INRIASaclay as part of the TIMCO project, as well as the support from the ParisSaclay Center for Data Science (CDS). Additional computer resources were provided generously by J. Buhmann, ETH Zürich. This work has been partially supported by the Spanish project TIN201674946P (MINECO/FEDER, UE) and CERCA Programme/Generalitat de Catalunya. The datasets released were selected among 72 datasets that were donated (or formatted using data publicly available) by the coauthors and by: Y. Aphinyanaphongs, O. Chapelle, Z. Iftikhar Malhi, V. Lemaire, C.J. Lin, M. Madani, G. Stolovitzky, H.J. Thiesen, and I. Tsamardinos. Many people provided feedback to early designs of the protocol and/or tested the challenge platform, including: K. Bennett, C. Capponi, G. Cawley, R. Caruana, G. Dror, T. K. Ho, B. Kégl, H. Larochelle, V. Lemaire, C.J. Lin, V. Ponce López, N. Macia, S. Mercer, F. Popescu, D. Silver, S. Treguer, and I. Tsamardinos. The software developers who contributed to the implementation of the Codalab platform and the sample code include E. Camichael, I. Chaabane, I. Judson, C. Poulain, P. Liang, A. Pesah, L. Romaszko, X. Baro Solé, E. Watson, F. Zhingri, M. Zyskowski. Some initial analyses of the challenge results were performed by I. Chaabane, J. Lloyd, N. Macia, and A. Thakur were incorporated in this paper. Katharina Eggensperger, Syed Mohsin Ali and Matthias Feurer helped with the organization of the Beat AutoSKLearn challenge. Matthias Feurer also contributed to the simulations of running autosklearn on 2015–2016 challenge datasets.
Bibliography
 1.Alamdari, A.R.S.A., Guyon, I.: Quick start guide for CLOP. Tech. rep., Graz University of Technology and Clopinet (May 2006)Google Scholar
 2.Andrieu, C., Freitas, N.D., Doucet, A.: Sequential MCMC for Bayesian model selection. In: IEEE Signal Processing Workshop on HigherOrder Statistics. pp. 130–134 (1999)Google Scholar
 3.Assunção, F., Lourenço, N., Machado, P., Ribeiro, B.: Denser: Deep evolutionary network structured representation. arXiv preprint arXiv:1801.01563 (2018)Google Scholar
 4.Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016)Google Scholar
 5.Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: 30th International Conference on Machine Learning. vol. 28, pp. 199–207. JMLR Workshop and Conference Proceedings (May 2013)Google Scholar
 6.Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8), 1798–1828 (2013)Google Scholar
 7.Bennett, K.P., Kunapuli, G., Jing Hu, J.S.P.: Bilevel optimization and machine learning. In: Computational Intelligence: Research Frontiers, Lecture Notes in Computer Science, vol. 5050, pp. 25–47. Springer (2008)Google Scholar
 8.Bergstra, J., Bengio, Y.: Random search for hyperparameter optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012)Google Scholar
 9.Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: 30th International Conference on Machine Learning. vol. 28, pp. 115–123 (2013)Google Scholar
 10.Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. In: Advances in Neural Information Processing Systems. pp. 2546–2554 (2011)Google Scholar
 11.Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1–2), 273–324 (December 1997)Google Scholar
 12.Boullé, M.: Compressionbased averaging of selective naive bayes classifiers. Journal of Machine Learning Research 8, 1659–1685 (2007), http://dl.acm.org/citation.cfm?id=1314554 Google Scholar
 13.Boullé, M.: A parameterfree classification method for large scale learning. Journal of Machine Learning Research 10, 1367–1385 (2009), https://doi.org/10.1145/1577069.1755829 Google Scholar
 14.Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to data mining. Springer Science & Business Media (2008)Google Scholar
 15.Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)Google Scholar
 16.Caruana, R., NiculescuMizil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries of models. In: 21st International Conference on Machine Learning. pp. 18–. ACM (2004)Google Scholar
 17.Cawley, G.C., Talbot, N.L.C.: Preventing overfitting during model selection via Bayesian regularisation of the hyperparameters. Journal of Machine Learning Research 8, 841–861 (April 2007)Google Scholar
 18.Colson, B., Marcotte, P., Savard, G.: An overview of bilevel programming. Annals of Operations Research 153, 235–256 (2007)Google Scholar
 19.Dempe, S.: Foundations of bilevel programming. Kluwer Academic Publishers (2002)Google Scholar
 20.Dietterich, T.G.: Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation 10(7), 1895–1923 (1998)Google Scholar
 21.Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, 2nd edn. (2001)Google Scholar
 22.Efron, B.: Estimating the error rate of a prediction rule: Improvement on crossvalidation. Journal of the American Statistical Association 78(382), 316–331 (1983)Google Scholar
 23.Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., LeytonBrown, K.: Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In: NIPS workshop on Bayesian Optimization in Theory and Practice (2013)Google Scholar
 24.Escalante, H.J., Montes, M., Sucar, L.E.: Particle swarm model selection. Journal of Machine Learning Research 10, 405–440 (2009)Google Scholar
 25.Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Proceedings of the Neural Information Processing Systems, pp. 2962–2970 (2015), https://github.com/automl/autosklearn
 26.Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Methods for improving bayesian optimization for automl. In: Proceedings of the International Conference on Machine Learning 2015, Workshop on Automatic Machine Learning (2015)Google Scholar
 27.Feurer, M., Springenberg, J., Hutter, F.: Initializing bayesian hyperparameter optimization via metalearning. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 1128–1135 (2015)Google Scholar
 28.Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Practical automated machine learning for the automl challenge 2018. In: International Workshop on Automatic Machine Learning at ICML (2018), https://sites.google.com/site/automl2018icml/
 29.Friedman, J.H.: Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29(5), 1189–1232 (2001)Google Scholar
 30.Ghahramani, Z.: Unsupervised learning. In: Advanced Lectures on Machine Learning. Lecture Notes in Computer Science, vol. 3176, pp. 72–112. Springer Berlin Heidelberg (2004)Google Scholar
 31.Guyon, I.: Challenges in Machine Learning book series. Microtome (2011–2016), http://www.mtome.com/Publications/CiML/ciml.html
 32.Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Macià, N., Ray, B., Saeed, M., Statnikov, A., Viegas, E.: AutoML challenge 2015: Design and first results. In: Proc. of AutoML 2015@ICML (2015), https://drive.google.com/file/d/0BzRGLkqgrIqWkpzcGw4bFpBMUk/view
 33.Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Macià, N., Ray, B., Saeed, M., Statnikov, A., Viegas, E.: Design of the 2015 ChaLearn AutoML challenge. In: International Joint Conference on Neural Networks (2015), http://www.causality.inf.ethz.ch/AutoML/automl_ijcnn15.pdf
 34.Guyon, I., Chaabane, I., Escalante, H.J., Escalera, S., Jajetic, D., Lloyd, J.R., Macía, N., Ray, B., Romaszko, L., Sebag, M., Statnikov, A., Treguer, S., Viegas, E.: A brief review of the ChaLearn AutoML challenge. In: Proc. of AutoML 2016@ICML (2016), https://docs.google.com/a/chalearn.org/viewer?a=v&pid=sites&srcid=Y2hhbGVhcm4ub3JnfGF1dG9tbHxneDoyYThjZjhhNzRjMzI3MTg4
 35.Guyon, I., Alamdari, A.R.S.A., Dror, G., Buhmann, J.: Performance prediction challenge. In: the International Joint Conference on Neural Networks. pp. 1649–1656 (2006)Google Scholar
 36.Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Ray, B., Saeed, M., Statnikov, A., Viegas, E.: Automl challenge 2015: Design and first results (2015)Google Scholar
 37.Guyon, I., Cawley, G., Dror, G.: HandsOn Pattern Recognition: Challenges in Machine Learning, Volume 1. Microtome Publishing, USA (2011)Google Scholar
 38.Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.): Feature extraction, foundations and applications. Studies in Fuzziness and Soft Computing, PhysicaVerlag, Springer (2006)Google Scholar
 39.Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, 1391–1415 (2004)Google Scholar
 40.Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: Data mining, inference, and prediction. Springer, 2nd edn. (2001)Google Scholar
 41.Hutter, F., Hoos, H.H., LeytonBrown, K.: Sequential modelbased optimization for general algorithm configuration. In: Proceedings of the conference on Learning and Intelligent OptimizatioN (LION 5) (2011)Google Scholar
 42.Ioannidis, J.P.A.: Why most published research findings are false. PLoS Medicine 2(8), e124 (August 2005)Google Scholar
 43.Jordan, M.I.: On statistics, computation and scalability. Bernoulli 19(4), 1378–1390 (September 2013)Google Scholar
 44.Keerthi, S.S., Sindhwani, V., Chapelle, O.: An efficient method for gradientbased adaptation of hyperparameters in SVM models. In: Advances in Neural Information Processing Systems (2007)Google Scholar
 45.Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast bayesian hyperparameter optimization on large datasets. In: Electronic Journal of Statistics. vol. 11 (2017)Google Scholar
 46.Kohavi, R., John, G.H.: Wrappers for feature selection. Artificial Intelligence 97(1–2), 273–324 (December 1997)Google Scholar
 47.Langford, J.: Clever methods of overfitting (2005), blog post at http://hunch.net/?p=22
 48.Lloyd, J.: Freeze Thaw Ensemble Construction. https://github.com/jamesrobertlloyd/automlphase2 (2016)
 49.Momma, M., Bennett, K.P.: A pattern search method for model selection of support vector regression. In: In Proceedings of the SIAM International Conference on Data Mining. SIAM (2002)Google Scholar
 50.Moore, G., Bergeron, C., Bennett, K.P.: Model selection for primal SVM. Machine Learning 85(1–2), 175–208 (October 2011)Google Scholar
 51.Moore, G.M., Bergeron, C., Bennett, K.P.: Nonsmooth bilevel programming for hyperparameter selection. In: IEEE International Conference on Data Mining Workshops. pp. 374–381 (2009)Google Scholar
 52.NiculescuMizil, A., Perlich, C., Swirszcz, G., Sindhwani, V., Liu, Y., Melville, P., Wang, D., Xiao, J., Hu, J., Singh, M., et al.: Winning the kdd cup orange challenge with ensemble selection. In: Proceedings of the 2009 International Conference on KDDCup 2009Volume 7. pp. 23–34. JMLR. org (2009)Google Scholar
 53.Opper, M., Winther, O.: Gaussian processes and SVM: Mean field results and leaveoneout, pp. 43–65. MIT (10 2000), massachusetts Institute of Technology Press (MIT Press) Available on Google BooksGoogle Scholar
 54.Park, M.Y., Hastie, T.: L1regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4), 659–677 (2007)Google Scholar
 55.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)Google Scholar
 56.Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018)Google Scholar
 57.Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Le, Q., Kurakin, A.: Largescale evolution of image classifiers. arXiv preprint arXiv:1703.01041 (2017)Google Scholar
 58.Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.): Recommender Systems Handbook. Springer (2011)Google Scholar
 59.Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press (2001)Google Scholar
 60.Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems 25, pp. 2951–2959 (2012)Google Scholar
 61.Statnikov, A., Wang, L., Aliferis, C.F.: A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification. BMC Bioinformatics 9(1) (2008)Google Scholar
 62.Sun, Q., Pfahringer, B., Mayo, M.: Full model selection in the space of data mining operators. In: Genetic and Evolutionary Computation Conference. pp. 1503–1504 (2012)Google Scholar
 63.Swersky, K., Snoek, J., Adams, R.P.: Multitask Bayesian optimization. In: Advances in Neural Information Processing Systems 26. pp. 2004–2012 (2013)Google Scholar
 64.Swersky, K., Snoek, J., Adams, R.P.: Freezethaw bayesian optimization. arXiv preprint arXiv:1406.3896 (2014)Google Scholar
 65.Thornton, C., Hutter, F., Hoos, H.H., LeytonBrown, K.: Autoweka: Automated selection and hyperparameter optimization of classification algorithms. CoRR abs/1208.3719 (2012)Google Scholar
 66.Thornton, C., Hutter, F., Hoos, H.H., LeytonBrown, K.: Autoweka: Combined selection and hyperparameter optimization of classification algorithms. In: 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 847–855. ACM (2013)Google Scholar
 67.Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014)Google Scholar
 68.Vapnik, V., Chapelle, O.: Bounds on error expectation for support vector machines. Neural computation 12(9), 2013–2036 (2000)Google Scholar
 69.Weston, J., Elisseeff, A., BakIr, G., Sinz, F.: Spider (2007), http://mloss.org/software/view/29/
 70.Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.