Keywords

Introduction

Evolution strategies (ESs) are a class of metaheuristics for optimization by means of (computer) experiments. They belong to the broader class of evolutionary algorithms and like other heuristics from this class mimic adaptive processes in biological evolution. The search in ESs is characterized by the alternating application of variation and selection operators. Recombination and mutation operators are the variation operators and create new individuals. The selection operator selects individuals from a population based on their corresponding fitness value, which is obtained by computing the objective function value in evolution strategies. The selected individuals form the next population and the process is repeated. A distinguishing feature of evolution strategies as compared to most other evolutionary algorithms is self-adaptive mutation operators, which are capable of adapting the shape of the mutation distribution according to the local topology of the landscape and thereby help ESs achieve maximal progress rates.

Before discussing technical details of the evolution strategies, it will be worthwhile to give a brief outline of their history: The idea of mimicking evolution in order to optimize technical systems arose in Germany, where it led to the development of evolution strategies by Ingo Rechenberg [50] and Hans-Paul Schwefel [57], and in the USA, where it led to the development of genetic algorithms [21, 34] and evolutionary programming [20]. In the mid-1990s, these strategies were unified under the common umbrella of evolutionary algorithms (EAs) [5]. While all these heuristics share the idea to mimic evolution in computational algorithms, researchers in genetic algorithms and evolution strategies emphasized on different aspects of algorithm design and problem domains.

Since their invention in the 1960s by Rechenberg and Schwefel at the Technical University of Berlin, evolution strategies have been used to optimize real-world systems, typically in engineering design. The first application of an evolution strategy was the design of optimal shapes in engineering such as nozzles and wing shape using physical experiments. Often the evolutionary design procedure discovered high-performing structures with surprising shapes that have never been considered by engineers before [7]. Starting from sequential stochastic hill climbing strategies, soon evolution strategies advanced to more sophisticated problem-solvers, and their main application domain became the treatment of black-box continuous optimization problems on the basis of computer models.

One important development introduced adaptive step sizes or mutation distributions. Although there were some precursors to the idea of step-size adaptation in stochastic search algorithms [56], the development of flexible and efficient adaptation schemes for mutation distributions became a major point of attention in ES research. This feature distinguished them from genetic algorithms which worked commonly with constant mutation strengths.

Different variants of the adaptation of the mutation parameters were developed, and the three mainstream variants are the control of a single step size by means of the following approaches:

  • The so-called 1∕5-th success rule: the rate of generating successful mutations is monitored, and the step size is controlled to achieve success rate 1∕5, which is the optimal on spherical function.

  • The mutative self-adaptation: it most closely resembles natural evolution, where the step size also undergoes the recombination, mutation, and natural selection [14].

  • The derandomized self-adaptation (Hansen, Ostermeier, and Gawelczyk [30]): it cumulates the standardized steps and compares the length of the cumulative vector to the one obtained under random selection.

Their efforts to find efficient ways to control the shape of the mutation distribution culminated in the covariance matrix adaptation evolution strategy (CMA-ES) [29]. Due to its invariance properties, it is seen today by many researchers as the state-of-the-art evolution strategy when it comes to practically solving ill-conditioned optimization problems (cf. [24]).

A parallel development was the introduction of population-based (or multi-membered) evolution strategies. Here, the idea is to create an evolutionary algorithm that performs, as Hans-Paul Schwefel called it, a collective hill climbing [58] (a collection of search points, where each point performs a simple hill climb search). To categorize different population models, the notation of (μ, λ)- and (μ + λ)- schemes was introduced, in which μ denotes the number of individuals in the parent population and λ the number of individuals in the offspring population. These multi-membered strategies are able to exploit the positive effects of recombination (crossover) and are more reliable in global optimization settings and on noisy problems than the early single-membered variants. Moreover, population-based algorithms could be executed in parallel and later could more easily be extended to advanced evolution strategies for solving multi-objective and multimodal optimization tasks.

Nowadays, evolution strategies are mainly used for the simulation-based optimization, i.e., computerized models that require parametric optimization. Evolution strategies are suitable for the optimization of non-smooth functions because they do not require derivatives. In general, the algorithmic paradigms of ESs can be extended to general metric search spaces, and mainstream variants of ESs address continuous optimization problems, as opposed to genetic algorithms [21], which are more typically used on binary search spaces.

Contemporary evolution strategies have shown to be competitive to other derivative-free optimization algorithms, both in theoretical studies [13, 65] and by benchmark comparisons on a large corpus of empirical problems [59]. Furthermore, their practical utility is underpinned by a large number of successful applications in engineering and systems optimization [6].

This chapter will give a brief introduction to classical and contemporary evolution strategies with a focus on mainstream variants of evolution strategies. Firstly, classical (μ, λ) − and (μ + λ) evolution strategies will be described in section “Classical Evolution Strategies”. Then, in section “Derandomized Evolution Strategies” derandomization, techniques will be discussed, which is the distinguishing feature in the CMA-ES. Section “Theoretical Results” addresses theoretical findings on the convergence and reliability of evolution strategies. An overview on new developments and nonstandard evolution strategies is provided in section “Nonstandard Evolution Strategies”. Moreover, this section covers adaptations of ESs that make them more suitable for multimodal optimization. Section “Benchmarks and Empirical Study” discusses empirical benchmarks used in this field and includes a comparative study of contemporary evolution strategies. Section “Conclusions” summarizes the main characteristics of ESs and highlights future research directions.

Classical Evolution Strategies

The typical application field of evolution strategies is (continuous) unconstrained optimization where the optimization problem is given by:

$$\displaystyle \begin{aligned} \min_{\vec{x} \in \mathbb{R}^d} f(\vec{x}) {} \end{aligned} $$
(1)

In the context of ESs, the function f can be a black-box function and usually function f is assumed to be nonlinear and have a minimum. Maximization problems can be brought into the standard form of Eq. 1 by simply flipping the sign of f. Standard implementations also allow to restrict the domain of decision variables to interval domains and to introduce constraints. For the sake of brevity, the extensions of ESs for constraint handling will be widely omitted in the following discussion.

Evolution strategies can be viewed as stochastic processes on a population of individuals from the space of individuals \(\mathbb {I}\). An individual in evolution strategies typically comprises the following information:

  • a d-tuple of values for the decision variables x 1, …, x d, representing a candidate solution for the optimization problem (see Eq. 1),

  • a tuple of strategy parameters. The strategy parameters can, for instance, be the standard deviations used to generate the perturbations of variables in the mutation (step sizes) or the components of a covariance matrix used in the mutation. Strategy parameters can be adapted during evolution.

  • a fitness value. It is typically based on the objective function value, and it may be altered by a penalty for constraint violations.

In some variants of the evolution strategies, the so-called (μ, κ, λ)- evolution strategies, the individual’s age is also maintained, that is, the number of generations that an individual has survived.

Another basic data structure of an evolution strategy is a population. A population is a multi-set of individuals, i. e., of elements of \(\mathbb {I}\). One distinguishes between the parent populations P t consisting of μ individuals and the offspring populations consisting of λ individuals.

The basic algorithm of a (μ, λ)- evolution strategy and a (μ + λ)- evolution strategy is outlined in Algorithm 1. The algorithm starts with initializing a population P 0 of μ parent individuals, for instance, by uniform random sampling in the feasible intervals for the objective variables \(\vec {x}\). Then, the fitness values of P 0 are determined, and the best solution found in P 0 is identified and stored in the variables \(\vec {x}^{best}_0, f^{best}_0\).

Algorithm 1: Evolution Strategy

Then, the following generational loop is executed until a termination criterion is met. Common termination criteria are stagnation of the search process or the excess of a maximally allowed duration for searching.

The search process in the generational loop is governed by two (stochastic) variation operators, recombination and mutation, and a deterministic selection operator. The recombination operator, namely, \(\mathtt {Recombine}: \mathbb {I}^\mu \times \varOmega \rightarrow \mathbb {I}^\lambda \), generates from the μ individuals in P t an offspring population of λ individuals which are then mutated by the mutation operator, \(\mathtt {Mutate}: \mathbb {I}^\mu \times \varOmega \rightarrow \mathbb {I}^\mu \). The mutated individuals are evaluated, and, if necessary, the best found solution \(\vec {x}^{best}_t, f^{best}_t\) gets updated. Then the parent population of the next round is determined by selecting the μ best solutions from

  • in case of a (μ, λ)-ES the λ offspring individuals

  • in case of a (μ + λ)-ES the μ parents of the current generation and the λ offspring individuals,

  • in case of a (μ, κ, λ)-ES from the μ parents who have not exceeded an age of κ and the λ offspring individuals

Finally the generation counter is increased, and the loop is continued or terminated (if the termination condition is met). After completing the loop, the best attained solution constitutes the output.

It appears to be a “chicken-and-egg” dilemma whether it makes more sense to start the evolution strategy with the generation of a parent population (with μ individuals), as suggested in Algorithm 1, or to do so at some other stage of the evolution, i.e., starting with an offspring population. The chosen representation has the advantage that the process that generates the subsequent parent populations P 0, P 1, P 2, … can be viewed as a memoryless stochastic process or more precisely a Markov process:

$$\displaystyle \begin{aligned} P_{t+1} = \mathtt{Select}\left(\mathtt{Mutate}\left(\mathtt{Recombine}\left(P_t\right)\right), P_t\right). \end{aligned} $$
(2)

This means that given P t for some t ≥ 1, the information of P t−1 is irrelevant in order to determine P t+1; alternatively, in the terminology of stochastic processes, the state of P t+1 is conditionally independent of the state of P t−1 given P t. This so-called Markov property makes it easier to analyze the behavior of the evolution strategy. In addition, P t can be viewed as a checkpoint of the algorithm, and if the process stops, e.g., because of a computer crash, the process may resume by starting the loop with the last saved state of P t.

The main loop of the evolution strategy is inspired by the principles of evolutionary adaptation in nature that were discovered in parallel by the naturalists Alfred Russel Wallace (1823–1913) and Charles Darwin (1809–1882). In brief, a population of individuals adapt to their environment by (random) variation and selection. The reason for variability in the population was unknown to these researchers. Only much later in the so-called modern synthesis, it was linked to the mutation and recombination of genes. The ES presented in Algorithm 1 does however by far not provide a complete model for evolution in nature. In fact, important driving forces of natural evolutionary processes such as the development of temporally stable species and coevolution cannot be modeled with this basic evolution strategy. On the other hand, by mimicking only the variation and selection process, one can already achieve a potent and robust optimization heuristic, and the theoretical analysis of evolution strategies can provide new insights in the dynamics of natural evolution.

There are many options to instantiate the operators of an evolution strategy, and in the literature, a certain terminology is used to refer to standard choices. Next, the most common instantiations of operators will be discussed by following the structure of Algorithm 1. The first step is the initialization of P0, where the starting population is set by the user, since this allows to resume the evolution from a checkpoint. However, it is also very common to view the initialization as an integral part of the evolution strategy. Initialization procedures vary, while common choices are either constant initialization by generating μ copies of a starting (seed) point, or random initialization, i.e., to initialize the decision variables randomly within their bounds. The initialization of strategy parameters can have a significant impact on the transient behavior of an evolution strategy. In case of step-size parameters, it is often recommended to set these to 5% of the search space size.

A more complex operator is the recombination operator. In the nomenclature, the number of individuals that participate in the creation of a single vector is called ρ. The notation (μρ, λ)- ES and, respectively, (μρ + λ)- ES makes this number explicit. The ρ individuals that participate in the recombination are drawn by independent uniformly random choices from the population.

Given ρ individuals, there are two common strategies to create an offspring vector – intermediate recombination and dominant recombination. The vector to be determined is commonly the vector of decision variables \(\vec {x}\), but it can also include the vector of strategy parameters:

  • Intermediate recombination determines the offspring by averaging the components of the parent individual. It can be applied on the object parameters and on the strategy parameters. Given a ρ-tuple parent vectors \((\vec {q}^{(1)},\dots ,\vec {q}^{(\rho )}) \in (\mathbb {R}^d)^\rho \), it computes the resulting vector \(\vec {r}\) by means of \(r_j = \frac {1}{\rho }(\sum _{i=1}^{\rho } q_j^{(i)})\) for i = 1, …, d.

  • Discrete (or dominant) recombination sets the i-th position of the offspring vector \(\vec {r}\) randomly to one of values of the parents. By drawing d uniform random numbers u j, j = 1, …, d from the set {1, …, ρ}, the offspring individual is set to \(r_j = q_{u_j}\).

The terminology, intermediate and dominant, is lent from the theory of inheritance of biological traits by the botanist Gregor Mendel (1822–1884).

The mutation operator is seen as a main driving force of the evolutionary progress in ESs. Mutation adds a small random perturbation to each component of \(\vec {x}\). The scaling of this perturbation is based on the strategy variables. A common case is to use individual so-called step-size parameters \(\sigma _1 \in \mathbb {R}^+\), …, \(\sigma _d \in \mathbb {R}^+\). The mutation then performs as indicated in the following equation:

$$\displaystyle \begin{aligned} x_i^{\prime} \leftarrow x_i + \sigma_i \cdot \mathcal{N}(0,1), \quad i =1, \dots, d\end{aligned} $$
(3)

Here, Normal(0, 1) denotes the generation of a standard normal distributed random number. The resulting distributions of random numbers have a standard deviation of σ i, why the σ-variables are also termed as standard deviations of the mutation.

A distinguishing feature of ESs is the self-adaptation of the mutation’s parameters. In case of d step sizes, the mutative self-adaptation lets the step sizes themselves undergo an evolutionary process. In continuous optimization, the parameters of the multivariate Gaussian distribution can be adapted. Three levels of adaptation can be devised: Firstly, it is possible to control only a single standard deviation that is used for all decision variables (possibly with a constant scaling factor). This is called isotropic self-adaptation. Then, in the so-called individual step-size adaptation for each decision variable, a different standard deviation for the mutation is maintained and adapted. Finally, it is also possible to learn the full covariance matrix of the multivariate Gaussian distribution that is used in the mutation. The different levels of mutation distribution adaptation are indicated in Fig. 1. As a rule of thumb, it can be stated that the more mutation parameters are to be adapted, the longer it takes to reach an optimal convergence behavior for a given model.

Fig. 1
figure 1

Three levels of step-size control: isotropic (a), individual step sizes (b), and covariance matrix adaptation (c)

There are different strategies for controlling or adapting the mutation parameters:

  • Step-size control based on the success rate: It has been shown that for the (1 + 1)-ES and on two important benchmark functions – sum of squares and corridor model – among all isotropic Gaussian distributions, the optimal standard deviation is obtained at a step size that yields approximately a success probability of 1∕5. Because the success probability can be assessed during execution, this allows for an effective step-size control of the (1 + 1)-ES.

  • Mutative step-size control: The idea is to make the parameters of the mutation distribution part of the individual and let it undergo an evolutionary process itself. Details of this strategy will be elaborated in this section.

  • Derandomized step-size control: Here a more efficient adaptation of the mutation distribution is derived based on cumulative information from previous successful mutation steps. Derandomized self-adaptation uses arithmetic procedures that can no longer be considered as biomimetic. For the price of losing flexibility and simplicity, they gain efficiency in particular for unconstrained continuous optimization and allow practicable schemes for adapting a full covariance matrix of a mutation distribution. The history and details of derandomized evolution strategies will be elaborated on in the next section.

The classical self-adaptive mutation in evolution strategies is nowadays called mutative step-size control. For adapting the standard deviations of the mutation distribution, it augments the individual by a step-size vector and mutates the standard deviations of the mutation before it mutates the decision variables using these standard deviations:

$$\displaystyle \begin{aligned} N_{\mathrm{global}} \quad\leftarrow \quad& \mathcal{N}(0,1) \end{aligned} $$
(4)
$$\displaystyle \begin{aligned} \sigma_i^{\prime} \leftarrow \quad& \sigma_i \cdot \exp(\tau_{\mathrm{local}} \mathcal{N}(0,1) + \tau_{\mathrm{global}} N_{\mathrm{global}}), \quad i =1, \dots, d{} \end{aligned} $$
(5)
$$\displaystyle \begin{aligned} x_i^{\prime} \quad\leftarrow \quad& x_i +\sigma_i \cdot \mathcal{N}(0,1), \quad i =1, \dots, d{} \end{aligned} $$
(6)

The parameters τ local and τ global are called local and global learning rates. The simplest form of the mutative step-size control exploits only a single step size σ for all the coordinates. In this case, Eqs. (5) and (6) simplify to

$$\displaystyle \begin{aligned} \sigma^{\prime} \quad\leftarrow \quad& \sigma \cdot \exp(\tau_{\mathrm{global}} N_{\mathrm{global}}), \quad i =1, \dots, d \end{aligned} $$
(7)
$$\displaystyle \begin{aligned} x_i^{\prime} \quad\leftarrow \quad & x_i +\sigma'\cdot \mathcal{N}(0,1), \quad i =1, \dots, d \end{aligned} $$
(8)

The utilization of mutative self-adaptation is valid when there are more offspring individuals than parent individuals. In this case, the step sizes with a greater value have a higher probability to be selected because they typically lead to better offspring individuals. Two commonly used default settings for parameters in the (μ, λ)-ES are:

  • Small population size: μ = 1, λ = 7, τ global = 1.2, single step size, ρ = 1.

  • Large population size: μ = 15, λ = 100, τ local = 1.1, τ global = 1.2, ρ = μ, and intermediate recombination of decision variables and step-size parameters.

A larger population size is preferable if the optimization takes place on a rugged fitness function and a larger number of evaluations can be afforded.

Derandomized Evolution Strategies

Mutative step-size control tends to work well in the standard ES for the adaptation of a single global step size but shows poor performance when it comes to the individual step sizes or arbitrary normal mutations. Schwefel claimed that the adaptation of the strategy parameters in those cases is impossible within small populations [58] and suggested larger populations as a solution to the problem. Later on, Rudolph questioned the effectiveness of ES learning upon discarding past information and claimed that ES learning would benefit from introducing memory to the individuals [52]. Indeed, due to the crucial role that the mutation operator plays within ES, its mutative step-size control was investigated intensively. Especially, the disruptive effects to which mutative step-size control is subject were studied at several levels [29, 47] and are reviewed here:

  • Indirect selection. By definition, the goal of the mutation operator is to apply a stochastic variation to the decision variables, which will increase the individual’s selection probability. The selection of the strategy parameters setting is indirect, i.e., the vector of a successful mutation is not utilized to adapt the step-size parameters, but rather the parameters of the distribution that led to this mutation vector.

  • Realization of parameter variation. Due to the sampling from a random distribution, the realization of the parameter variation does not necessarily reflect the nature of the strategy parameters. Thus, the difference de facto between good and bad strategy settings of strategy parameters is only reflected in the difference between their probabilities to be selected – which can be rather small. Essentially, this means that the selection process of the strategy parameters is strongly disturbed.

  • The strategy parameter change rate is defined as the difference between strategy parameters of two successive generations. Hansen and Ostermeier [29] argue that the change rate is an important factor, as it gives an indication concerning the adaptation speed, and thus it has a direct influence on the performance of the algorithm. The principal claim is that this change rate basically vanishes in the standard ES. The change rate depends on the mutation strength to which the strategy parameters are subject. While aiming at attaining the maximal change rate, the latter is limited by an upper bound, due to the finite selection information that can be transferred between generations. Change rates that exceed the upper bound would lead to a stochastic behavior. Moreover, the mutation strength that obtains optimal change rate is typically smaller than the one that obtains good diversity among the mutants – a desired outcome of the mutation operator, often referred to as selection difference. Thus, the conflict between the objective of optimal change rate versus the objective of optimal selection difference cannot be resolved at the mutation strength level [47]. A possible solution to this conflict would be to detach the change rate from the mutation strength.

The so-called derandomized mutative step-size control aims to treat those disruptive effects, regardless of the search space dimensionality, population size, or any other characteristic parameters.

The concept of derandomized evolution strategies has been originally introduced by scholars at the Technical University of Berlin in the beginning of the 1990s. It was followed by the release of a new generation of successful ES variants by Hansen, Ostermeier, and Gawelczyk [28, 30, 46, 48].

The first versions of derandomized ES algorithms introduced a controlled global step size in order to monitor the individual step sizes by decreasing the stochastic effects of the probabilistic sampling. The selection disturbance was completely removed with later versions by omitting the adaptation of strategy parameters by means of probabilistic sampling. This was combined with individual information from the last generation (the successful mutations, i.e., of selected offspring) and then adjusted to correlated mutations. Later on, the concept of adaptation by accumulated information was introduced, aiming to use wisely the past information for the purpose of step-size adaptation. Rather than using the last generation’s information alone, it was successfully generalized to a weighted average of the previous generations.

Note that the different derandomized ES variants strictly follow a \(\left (1,\lambda \right )\) strategy, postponing the treatment of recombination or plus-strategies for later stages. Moreover, the different variants hold different numbers of strategy parameters to be adapted, and this is an important factor in the complexity of the optimization routine and in its learning rate. The different algorithms hold a number of strategy parameters scaling either linearly (\(\mathcal {O}(d)\) parameters responsible for individual step-sizes) or quadratically (\(\mathcal {O}(d^2)\) parameters responsible for arbitrary normal mutations) with the dimensionality d of the search space.

First Level of Derandomization

The so-called first level of derandomization targeted the following desired effects: (i) a degree of freedom with respect to the mutation strength of the strategy parameters, (ii) scalability of the ratio between the change rate and the mutation strength, and (iii) independence of population size with respect to the adaptation mechanism. The realization of the first level of derandomization can be reviewed through three particular derandomized ES variants:

DR1

The first derandomized attempt [46] coupled the successful mutations to the selection of decision parameters and learned the mutation step size as well as the scaling vector based upon the successful variation. The mutation step is formulated for the kth individual, k = 1, …, λ:

$$\displaystyle \begin{aligned} \vec{x}^{(g+1)} = \vec{x}^{(g)} + \xi_k\delta^{(g)}\vec{\xi}^k_{\mathrm{scal}}\vec{\delta}^{(g)}_{\mathrm{scal}}\vec{z}_k ~~~~~~~~~~~ \vec{z}_k\in\left\{-1,+1\right\}^d \end{aligned} $$
(9)

Note that \(\vec {z}_k\) is a random vector of ± 1, rather than a normally distributed random vector, while \(\vec {\xi }_{\mathrm {scal}}^k \sim \vec {\mathcal {N}}\left (0,1\right )^+\), i.e., distributed over the positive part of the normal distribution. The evaluation and selection are followed by the adaptation of the strategy parameters (subscripts sel refer to the selected individual):

$$\displaystyle \begin{aligned} \delta^{(g+1)}= \delta^{(g)} \cdot \left(\xi_{\mathrm{sel}}\right)^{\beta} \end{aligned} $$
(10)
$$\displaystyle \begin{aligned} \vec{\delta}^{(g+1)}_{\mathrm{scal}}=\vec{\delta}^{(g)}_{\mathrm{scal}} \cdot \left(\vec{\xi}^{\mathrm{sel}}_{\mathrm{scal}}+b\right)^{\beta_{\mathrm{scal}}} \end{aligned} $$
(11)

DR2

The second derandomized ES variant [48] aimed to accumulate information about the correlation or anticorrelation of past mutation vectors in order to adapt the global step size as well as the individual step sizes – by introducing a quasi-memory vector. This accumulated information allowed omitting the stochastic element in the adaptation of the strategy parameters – updating them only by means of successful variations, rather than with random steps.

The mutation step for the kth individual, k = 1, …, λ, reads

$$\displaystyle \begin{aligned} \vec{x}^{(g+1)} = \vec{x}^{(g)} + \delta^{(g)}\vec{\delta}^{(g)}_{\mathrm{scal}}\vec{z}_k~~~~~~~~~~~ \vec{z}_k \sim \vec{\mathcal{N}}\left(0,1\right) \end{aligned} $$
(12)

Introducing a quasi-memory vector \(\vec {Z}\):

$$\displaystyle \begin{aligned} \vec{Z}^{(g)}=c\vec{z}_{\mathrm{sel}} + \left(1-c\right)\vec{Z}^{(g-1)} \end{aligned} $$
(13)

The adaptation of the strategy parameters according to the selected offspring:

$$\displaystyle \begin{aligned} \delta^{(g+1)}=\delta^{(g)}\cdot\left(exp\left(\frac{\|\vec{Z}^{(g)}\|}{\sqrt{d}\sqrt{\frac{c}{2-c}}}-1+\frac{1}{5d}\right)\right)^{\beta} \end{aligned} $$
(14)
$$\displaystyle \begin{aligned} \vec{\delta}^{(g+1)}_{\mathrm{scal}}=\vec{\delta}^{(g)}_{\mathrm{scal}}\cdot\left(\frac{\left|\vec{Z}^{(g)}\right|}{\sqrt{\frac{c}{2-c}}} +b \right)^{\beta_{\mathrm{scal}}},~~~~~ \left|\vec{Z}^{(g)}\right|=\left(|Z^{(g)}_1|,|Z^{(g)}_2|,\ldots ,|Z^{(g)}_n| \right) \end{aligned} $$
(15)

DR3

This third variant [30], usually referred to as the Generation Set Adaptation (GSA), considered the derandomization of arbitrary normal mutations for the first time, aiming to achieve invariance with respect to the scaling of variables and the rotation of the coordinate system. This naturally came with the cost of a quasi-memory matrix, \(\mathbf {B}\in \mathbb {R}^{r\times d}\), setting the dimension of the strategy parameters space to d 2 ≤ r ≤ 2d 2. The adaptation of the global step size is mutative with stochastic variations, just like in the DR1.

The mutation step is formulated for the kth individual, k = 1, …, λ:

$$\displaystyle \begin{aligned} \vec{x}^{(g+1)} = \vec{x}^{(g)} + \delta^{(g)}\xi_k\vec{y}_k \end{aligned} $$
(16)
$$\displaystyle \begin{aligned} \vec{y}_k=c_m{\mathbf{B}}^{(g)}\cdot\vec{z}_k ~~~~~~~~~~~ \vec{z}_k \sim \vec{\mathcal{N}}\left(0,1\right) \end{aligned} $$
(17)

The update of the memory matrix is formulated as

$$\displaystyle \begin{aligned} \begin{array}{l} {\mathbf{B}}^{(g)}=\left(\vec{b}_1^{(g)},\ldots,\vec{b}_r^{(g)}\right)\\ {}\vec{b}^{(g+1)}_1=\left(1-c\right)\cdot\vec{b}^{(g)}_1+c\cdot\left(c_u\xi_{\mathrm{sel}}\vec{y}_{\mathrm{sel}}\right),~~~~ \vec{b}^{(g+1)}_{i+1}=\vec{b}^{(g)}_i \end{array} \end{aligned} $$
(18)

The step size is updated as follows:

$$\displaystyle \begin{aligned} \delta^{(g+1)}=\delta^{(g)}\left(\xi_{\mathrm{sel}}\right)^{\beta} \end{aligned} $$
(19)

Second Level of Derandomization: CMA-ES

Following a series of successful derandomized ES variants addressing the first level of derandomization, and a continuous effort at the Technical University of Berlin, the so-called covariance matrix adaptation (CMA) evolution strategy was released in 1996 [28], as a completely derandomized evolution strategy – the fourth generation of derandomized ES variants. The so-called second level of derandomization targeted the following effects: (i) The probability to regenerate the same mutation step is increased, (ii) the change rate of the strategy parameters is subject to explicit control, and (iii) strategy parameters are stationary when subject to random selection. The second level of derandomization was implemented by means of the CMA-ES.

The CMA-ES combines the robust mechanism of ES with powerful statistical learning principles, and thus it is sometimes subject to informal criticism for not being a genuine biomimetic evolution strategy. In short, it aims at satisfying the maximum likelihood principle by applying principal component analysis (PCA) [35] to the successful mutations, and it uses cumulative global step-size adaptation.

In the notation used here, the vector \(\vec {m}\) represents the mean of the mutation distribution, but is also associated with the current solution-point, σ denotes the global step size, and the covariance matrix C determines the shape of the distribution ellipsoid:

$$\displaystyle \begin{aligned} \vec{x}^{\mathrm{NEW}} \sim \mathcal{N}(\vec{m},\sigma^2 \mathbf{C}) = \vec{m} + \sigma \cdot \mathcal{N}(\vec{0},\mathbf{C})=\vec{m} + \sigma \cdot \vec{z} \end{aligned}$$

Two independent principles define the adaptation of the covariance matrix, C, versus the adaptation of the global step size σ:

  • The mean \(\vec {m}\) and the covariance matrix C of the normal distribution are updated according to the maximum likelihood principle, such that good mutations are likely to appear again. \(\vec {m}\) is updated such that

    $$\displaystyle \begin{aligned} \mathcal{P}\Big(\vec{x}_{\mathrm{sel}}|\mathcal{N}\Big(\vec{m},\sigma^2\mathbf{C}\Big)\Big) \longrightarrow \max \end{aligned}$$

    and C is updated such that

    $$\displaystyle \begin{aligned} \mathcal{P}\Big(\frac{\vec{x}_{\mathrm{sel}}-\vec{m}_{\mathrm{old}}}{\sigma}\Big|\mathcal{N}\Big(\vec{0},\mathbf{C}\Big)\Big) \longrightarrow \max \end{aligned}$$

    considering the prior C. This is implemented through the so-called covariance matrix adaptation (CMA) mechanism.

  • σ is updated such that it is conjugate perpendicular to the consecutive steps of \(\vec {m}\). This is implemented through the so-called cumulative step-size adaptation (CSA) mechanism.

Evolution Path

A straightforward way to update the covariance matrix would be to construct a d × d matrix analogue to the DR2 mechanism (see Eq. 13), with the outer-product of the selected mutation vector \(\vec {z}_{\mathrm {sel}}\):

$$\displaystyle \begin{aligned} \mathbf{C} \longleftarrow (1-c_{\mathrm{cov}})\mathbf{C} + c_{\mathrm{cov}}\vec{z}_{\mathrm{sel}}\vec{z}^T_{\mathrm{sel}} \end{aligned}$$

However, to avoid discarding the sign information of \(\vec {z}_{\mathrm {sel}}\), the so-called evolution path is defined to accumulate the past information using an exponentially weighted moving average,

$$\displaystyle \begin{aligned} \vec{p}_c \propto \sum_{i=0}^g (1-c_c)^{g-i} \vec{z}_{\mathrm{sel}}^{(i)}, \end{aligned}$$

yielding the following update step for the covariance matrix:

$$\displaystyle \begin{aligned} \mathbf{C} \longleftarrow (1-c_{\mathrm{cov}})\mathbf{C} + c_{\mathrm{cov}}\vec{p}_{c}\vec{p}^T_c \end{aligned}$$

The Path Length Control

The covariance matrix update is not likely to simultaneously increase the variance in all directions, and thus a global step-size control is much needed to operate in parallel. The basic idea of the so-called path length control is to measure the length of the evolution path, which also constitutes the consecutive steps of \(\vec {m}\), and adapt the step-size according to the following rationale: If the evolution path is longer than expected, the steps are likely parallel, and thus the step size should be increased; alternatively, if it is shorter than expected, the steps are probably antiparallel, and the step size should be decreased accordingly. That magnitude is defined as the expected length of a normally distributed random vector. This evaluation is explicitly carried out by the conjugate evolution path:

$$\displaystyle \begin{aligned} \vec{p}_{\sigma} \propto \sum_{i=0}^g (1-c_{\sigma})^{g-i} {\mathbf{C}}^{(i)~-\frac{1}{2}}~\vec{z}_{\mathrm{sel}}^{(i)}\end{aligned} $$

where the eigen-decomposition of C is required in order to align all directions within the rotated frame. Then, the update of the step size depends on the comparison between \(\|\vec {p}_{\sigma }\|\) and the expected length of a normally distributed random vector, \(E\left [\|\mathcal {N}\left (0,\mathbf {I}\right )\|\right ]\):

$$\displaystyle \begin{aligned} \sigma \longleftarrow \sigma \cdot \exp\left(\frac{\|\vec{p}_{\sigma}\|} {E\left[\|\mathcal{N}\left(0,\mathbf{I}\right)\|\right]}-1\right)\end{aligned} $$

The (μ W, λ) Rank-μ CMA

The rank-μ covariance matrix adaptation [26] is an extension of the original update rule for larger population sizes. The idea is to use μ > 1 vectors in order to update the covariance matrix C in each generation, based on weighted intermediate recombination. Let \(\vec {x}_{i:\lambda }\) denote the ith ranked solution point, such that

$$\displaystyle \begin{aligned} f\left(\vec{x}_{1:\lambda} \right) \leq f\left(\vec{x}_{2:\lambda} \right) \leq \cdots \leq f\left(\vec{x}_{\lambda:\lambda}\right) \end{aligned}$$

The updated mean is now defined as follows:

$$\displaystyle \begin{aligned} \vec{m} \leftarrow \sum _{i=1}^{\mu} w_i \vec{x}_{i:\lambda} = \vec{m}+\sigma \sum _{i=1}^{\mu} w_i \vec{z}_{i:\lambda} \equiv \langle\vec{x}\rangle_W \end{aligned}$$

with a set of weights, \(w_1 \geq w_2 \geq \cdots \geq w_{\mu } > 0,~\sum _{i=1}^{\mu }w_i =1\). The covariance matrix update is now formalized by means of rank-μ update, combined with the rank-one update:

$$\displaystyle \begin{aligned} \mathbf{C} \longleftarrow (1-c_{\mathrm{cov}})\mathbf{C} + \frac{c_{\mathrm{cov}}}{\mu_{\mathrm{cov}}}\vec{p}_{c}\vec{p}^T_c + c_{\mathrm{cov}}\left(1-\frac{1}{\mu_{\mathrm{cov}}} \right)\sum_{i=1}^{\mu} w_i \vec{z}_{i:\lambda}\vec{z}^T_{i:\lambda} \end{aligned}$$

The (μ W, λ)-CMA-ES heuristic is summarized in Algorithm 2, with initCMA() referring to the parametric initialization procedure. It should be noted that a CMA variant, which resembles the DR2 and targets a vector of d individual step sizes (i.e., by means of a diagonalized covariance matrix), was released by the name of sep-CMA-ES [51]. Furthermore, the CMA-ES heuristic was simplified in the form of the so-called CMSA strategy [15] and was further improved for certain cases of global optimization [1].

Algorithm 2: (μ W, λ)-CMA-ES

Theoretical Results

The theory of evolution strategies is traditionally focused on the questions of convergence dynamics.

Under relative mild conditions on the used mutation and recombination operators, complete global convergence in probability for t → can be proven [54]. Basically, for continuous objective functions, it is sufficient to ascertain that for every 𝜖-ball around the global minimizer the probability that the mutation operator samples a point in this region is positive, regardless of the starting point. This is given, for instance, by bounding the standard deviations of the mutation from below by a small positive value. This can be easily generalized to ES for discrete [53] or even mixed-integer optimization problems [43].

Of more practical relevance are results on the convergence dynamics, that is, the speed of convergence to the optimum and on local progress rates. Different approaches for analysis have been used, based on dynamical systems theory [13], stochastic process theory [54], and techniques from the asymptotic analysis of randomized algorithms [33]. It is a well-established result that the most self-adaptive ES variants, if parameterized correctly, achieve a linear convergence rate on convex quadratic problems, where the condition number of the matrix and the type of step-size adaptation determine the linear factor [14, 45]. The same can be found for problems with fitness proportional noise [2]. On the other hand, on sharp ridges and plateaus and at the boundary of constraints, classical step-size adaptation tends to fail [38]. Also for problems with additive noise, the accuracy of the found results will be limited by the standard deviation of the noise [39]. The theory of ESs also revealed some insights regarding the manner in which different parameters are correlated with each other and devised guidelines concerning optimal parameter settings. Most prominently, the 1∕5-th success rule for step-size control in (1 + 1)-ES with isotropic mutation was developed based on theoretical studies on the sphere and the corridor model [50].

Also results on the effect of genetic drift and recombination in population-based evolution strategies are available: Beyer highlighted the so-called genetic repair effect that recombination has in ESs. When using recombination convergent behavior and optimal convergence, rates can be achieved with higher mutation step sizes. This increases the robustness in global optimization settings as it is more likely to escape from local optima. The genetic repair effect is stronger when intermediate recombination is used, as compared to discrete recombination. Dynamical systems analysis and Markov chain analysis on simple search landscapes revealed that it is rather impossible to simultaneously explore different local optima by means of a single population [11]. Even if the recombination operator is disabled, when different attractor basins share exactly the same geometry, the population tends to quickly concentrate on a single attractor only [55]. These findings gave incentive to the development of niching [60] and restart methods [4] that counteract this effect and prove to perform better on multimodal landscapes.

Nonstandard Evolution Strategies

The broad success of the family of evolution strategies provided the motivation to devise extended heuristics for treating problem instances that are beyond the canonical unconstrained single-objective, unimodal optimization formulation. Indeed, ES extensions to mixed-integer search spaces [8, 42], uncertainty handling [27], multimodal domains [60], and multi-objective Pareto optimization [31] were introduced in recent years. The goal of the current section is to provide an overview of those extensions.

ES for Nonstandard Search Spaces

A general framework for ES on nonstandard search spaces, termed metric-based evolutionary algorithms, was developed by Droste and Wiesmann [17]. They specified guidelines for instantiating of a mutation operator and recombination operators and exemplified the design method for the optimization of ordinary binary decision diagrams. Using similar guidelines, ESs for integer programming [53], mixed-integer programming [43], and graph-based optimization [18] were developed. Common guidelines on designing ESs for new types of solution spaces are:

  1. 1.

    Causal representation: Solutions should be represented in some metric search space such that relatively small changes with respect to the distance in the search space result on average in only relatively small changes of the objective function value.

  2. 2.

    Unimodal mutation distributions: Small mutations should occur more likely than large mutations.

  3. 3.

    Scalability of mutation: In order to implement self-adaptive mutation operators, it is essential that mutations can be scaled in terms of the average distance between parents and offspring.

  4. 4.

    Accessibility of points by mutation: By applying one or a chain of many mutations, it should be possible to reach every point in the search space, regardless of the starting point.

  5. 5.

    Unbiasedness of mutation: Mutations should not introduce a bias in the search. They should be symmetric and probability distributions with maximal entropy should be preferred.

  6. 6.

    Similarity to parents in recombination: The distance of an offspring to its parents should not exceed the distance of the parents to each other. Moreover, on average, the distance to all parents should be the same.

Following these design principles, one can expect generalized ESs to possess similar properties in comparison with standard ESs for continuous search spaces. However, a warning should be placed here regarding the functioning of step-size adaptation. In the theoretical derivation of optimal schemes of ES, often the fact is used that differentiable problems locally resemble quadratic or linear functions. This property is lost when it comes to discrete optimization, and therefore the generalization of such results requires some caution. On the other hand, it has been found that mutative self-adaptation of step sizes also works in nonstandard search spaces such as for the adaptation of mutation probabilities for binary vectors or for parameters of geometric distributions in mixed integer evolution strategies.

Niching and Multi-population ES

Given multimodal search landscapes with multiple basins of attraction that are of interest, targeting the simultaneous identification of several optima constitutes a challenge both at the theoretical and practical levels [44, 49, 60]. Within the domain of evolutionary computation, this challenge is typically treated by extending a given search heuristic into subpopulations of trial solutions that evolve in parallel to various solutions of the problem. This idea stems from the evolutionary concept of organic speciation, and the so-called niching techniques are the extension of EAs to speciation forming multiple subpopulations. The computational challenge in niching may be formulated as achieving an effective interplay between partitioning the search space into niches occupied by stable subpopulations, by means of population diversity preservation, and exploiting the search in each niche by means of a highly efficient optimizer with local search capabilities [60]. A niching framework utilizing derandomized ES was introduced in [60], proposing the CMA-ES as a niching optimizer for the first time. The underpinning of that framework was the selection of a peak individual per subpopulation in each generation, followed by its sampling according to DES principles to produce the consecutive dispersion of search points. The biological analogy of this machinery is an alpha male winning all the imposed competitions and dominating thereafter its ecological niche, which then obtains all the sexual resources therein to generate its offspring.

A common utility for defining the landscape subdomain of each subpopulation is a so-called niche radius. A radius-based framework for niching, which employs derandomized ES heuristics, has been formulated and investigated [61] and has shown a broad success in tackling both synthetic and real-world multimodal optimization problems. In practice, this framework holds multiple derandomized ES populations, which conduct heuristic search in their radii-defined subdomains and independently update their mutation distributions and step sizes. Since the partitioning is enforced per each generation according to the niche radius parameter, and since no a priori knowledge is available on the global structure of the search landscape and the spatial distribution of its basins of attraction, an adaptive niche radius approach was devised to remedy this so-called niche radius presumption [62]. The main idea of ES niching with self-adaptive niche shape approaches is to exploit learned landscape information, as reflected by the evolving mutation distribution, to define the niches in a more accurate manner. Especially, a Mahalanobis CMA-ES niching heuristic was formulated, which carries out the distance calculations among the individuals based upon the Mahalanobis distance metric, by utilizing the evolving covariance matrices of the CMA mechanism. Such heuristic operation resulted in successful niching on landscapes with unevenly shaped optima, on which the fixed-radii approaches performed poorly [62].

On a related note, a multi-restart with increasing population size approach was developed with the CMA algorithm, namely, IPOP-CMA-ES [4]. This heuristic aims at attaining the global optimum, while possibly visiting local optima along the process and restarting the algorithm with a larger population size and a modified initial step size. It is thus not defined as a niching technique.

Noise Handling and Robust Optimization with ES

Robust optimization is concerned with identifying solvers that can also perform well when the input parameters and/or the objective function values are slightly perturbed on a systematic basis [10]. Most variants of ES are inherently suited to deal with noisy environments, and it was shown that a larger population size and the use of recombination are beneficial in noisy settings [2, 12]. Various techniques have shown to further improve performance on noisy objective functions. One technique is the so-called thresholding operator, which considers search points as improvement only when they introduce objective function improvements that exceed a certain threshold [9]. Moreover, it has been suggested to use the sample mean of multiple evaluations of the same individual (effective fitness) for evaluation. This increases the computation time, and therefore more efficient sampling schemes have been developed subsequently. Furthermore, a rank stability scheme was suggested to treat noise, exploiting the fact that ESs rather require a correct ranking rather than a correct objective function value [27].

Knowledge of second-order Hessian information at the optimum is desirable not only as a measure of system robustness to noise in the decision variables but also as a means for dimensionality reduction and for landscape characterization. Experimental optimization of quantum systems motivated the compilation of an automated method to efficiently retrieve the Hessian matrix about the global optimum without derivative evaluations from experimental measurements [63]. The study designed a heuristic to learn the Hessian matrix based upon the CMA-ES machinery, with necessary modification, by exploiting an inherent relation between the covariance matrix to the inverse Hessian matrix. It then corroborated this newly proposed technique, entitled forced optimal covariance adaptive learning (FOCAL), on noisy simulation-based optimization as well as on laboratory experimental quantum systems. The formal relation between the covariance matrix to the Hessian matrix is generally unknown, but has been a subject of active research. A recent study rigorously showed that accumulation of selected individuals carried the potential to reveal valuable information about the search landscape [64], e.g., as already practically utilized by derandomized ES variants. This theoretical study proved that a statistically constructed covariance matrix over selected decision vectors in the proximity of the optimum shared the same eigenvectors with the Hessian matrix about the optimum. It then provided an analytic approximation of this covariance matrix for a non-elitist multi-child (1, λ) strategy, holding for a large population size λ.

For a comprehensive overview of contemporary noise handling and robust optimization in ESs, the reader is referred to the PhD dissertation of Kruisselbrink, which was also complemented by an empirical study of the most common variants. Also, theoretical limits in the precision of multi-evaluation schemes in the presence of additive noise are derived therein [39].

Multi-criterion and Constraint-Handling ES

In practical settings, the scenario of unconstrained optimization is not very common. Rather problems with multiple constraint functions and conflicting objective functions need to be solved.

In the optimization with constraints, it is mandatory to use alternative schemes for step-size adaptation for reasons that are explained in detail by Kramer and Schwefel [38]. They also suggest alternative schemes that can better deal with the constraints.

Adaptations of the ESs are also required for optimization with multiple objective functions. In this case, it is common to search for a set of non-dominated solutions, the so-called Pareto set. A first proposal on using evolution strategies to approximate a Pareto front was made by Kursawe [40], a long time before the nowadays flourishing research field of evolutionary multi-criterion optimization was established. Today, three variants of the evolution strategy are used for optimization with multiple objectives:

  • Pareto archived evolution strategy [37]: This classical multi-criterion optimization strategy uses an archive to maintain non-dominated points. The archive is updated based on the non-dominance and density of points.

  • Predator prey evolution strategy [22, 41]: In this biomimetic strategy, individuals are distributed on a grid and a population of predator individuals is performing a random walk on the grid and triggers local selection. The predators select their prey based on different objective functions or combination strategies.

  • Multi-objective CMA-ES [31]: This strategy seeks to improve contributions of individuals to the hypervolume indicator, which measures the size of the Pareto dominated space. Consequently, this strategy is well adapted to locate precise and regular representations of Pareto fronts for objective functions with complex shapes and correlated input variables. Another self-adaptation method, using local tournaments on hypervolume contributions, was suggested in [36] but so far received little attention.

Mirrored Sampling

The mirrored sampling technique is a derandomized mutation method. It was firstly introduced in [16] for non-elitist (1, λ)-ES and then extended to the (μ, λ)-ES [3]. The idea of mirrored sampling is to generate part (normally half) of the offspring population in a derandomized way. More specifically, a single mutation vector z is utilized to generate two offspring (rather than one in the standard ES) – one by adding z to the parent x: x + z and another by subtracting z from x: x −z. The two offspring generated are symmetric or mirrored to the parental point. Mirrored sampling helps accelerating the convergence rate of evolution strategies, which is theoretically proven in [16].

When applied in (μ, λ)-CMA-ES with cumulative step-size adaptation, the mirrored sampling leads to a reduction of recombined mutation variance. Consequently, the step size is more than desirably reduced, and a premature convergence would occur. In order to solve this, the concept of pairwise selection is introduced [3], in which only the better offspring among the mirrored pair is allowed to possibly contribute to the weighted recombination. Then, it is assured that recombination will not use both elements of a mirrored pair at the same time.

Benchmarks and Empirical Study

Besides a detailed description of evolution strategies and their theoretical aspects, it would be intuitive and helpful to look at the empirical ability of ESs in solving the black-box problems. In the ES benchmarking, two difficulties arise. On one hand, it is hard to design a set of test functions that captures the problem characteristics encountered in the real-world applications. On the other hand, summarizing ES ability over a set of test functions would be not straightforward due to the fact that performance of evolution strategies largely varies on problems with different characteristics (e.g., separability). The black-box optimization benchmark [25] (BBOB) is devised to tackle these difficulties. The noiseless BBOB encompasses 24 noise-free real-parameter single-objective functions that are either separable, ill-conditioned, or multimodal. All the test functions are defined over \(\mathbb {R}^d\), while the global optima for all the test functions are initialized in [−5, 5]d [19]. In addition, BBOB also introduces a proper measure to represent the performance of ESs for global optimization – the empirical cumulative distribution function (ECDF). ECDFs can be summarized over multiple test functions and represented graphically, increasing the accessibility of the benchmark results.

As opposed to the earlier work, in this empirical study, evolution strategies with different step-size adaptation strategies are considered, including the classical 1∕5-th success rule and mutative self-adaptation. Moreover the benchmark also covers a broad range of classical and contemporary strategies with derandomized self-adaptation. The evolution strategies tested are listed in the following:

  • (1 + 1)-ES: one plus one elitist evolution strategy with 1∕5 success rule.

  • (15, 100)-MSC-ES: mutative self-adaptation of individual step sizes.

  • (1, 7)-MSC-ES: mutative self-adaptation of individual step sizes.

  • DR2-ES: the derandomized evolution strategy using accumulated success mutation vector for step-size adaptation.

  • (μμ w, λ)-CMA-ES: covariance matrix adaptation evolution strategy with weighted intermediate recombination.

  • (μμ w, λ m)-CMA-ES: CMA-ES with mirrored sampling and pairwise selection.

  • IPOP-CMA-ES: a restart CMA-ES with increasing population size.

  • (1, λ)-DR2-Niching: the niching approach based on the second derandomized ES variants.

  • (1, λ)-CMA-Niching: CMA-ES niching with fixed niche radius.

  • (1 + λ)-CMA-Niching: the elitist version.

  • (1, λ)-Mahalanobis-CMA-Niching: niche shape adaptation using Mahalanobis distance.

Experimental Settings

The BBOB parameter settings of the experiment are the same for all the tested ES variants. The initial global step size σ is set to 1. The maximum number of function evaluations is set to 104 × d. The initial solution vector (initial parent) is a uniformly distributed random vector restricted to the hyper-box [−4, 4]d. The algorithms are tested for problems of different number of input variables d. These are d ∈{2, 3, 5, 10, 20}.

The parameter settings for CMA-ES variants follow the suggested values in [23]; the reader is referred to it for details. For all the niching ES variants, the setting λ = 10 is used throughout all the test. The fixed niche radius calculation can be found in [60].

Results

BBOB automatically records the history of the fitness values found by the tested algorithm. The time is measured in numbers of function evaluations. The benchmark provides a postprocessing procedure to statistically estimate the empirical cumulative distribution function (ECDF) of running length (function evaluations) to reach the global optimum from the data. The ECDF of running length describes the distribution of necessary function evaluations a specific optimization algorithm follows in order to reach the global optimum in an experiment. An ECDF curve that inclines to distribute running length over smaller values indicates good performance of its corresponding algorithm. Thus, ECDFs characterize the performance of optimization algorithms and are used to present the benchmark results. Instead of generating ECDFs for all 24 test functions in BBOB, several representative functions in BBOB are selected, which are listed as follows:

  1. 1.

    f 1 Sphere function.

  2. 2.

    f 2 Ellipsoidal function.

  3. 3.

    f 10 Rotated ellipsoidal function.

  4. 4.

    f 8 Rosenbrock function.

  5. 5.

    f 13 Sharp ridge function.

  6. 6.

    f 7 Step ellipsoidal function.

  7. 7.

    f 15 − f 19 Multimodal function having weakly global structure.

  8. 8.

    f 20 − f 24 Multimodal function having adequate global structure.

By aggregating the selected functions in function groups, it is possible to illustrate which ES variants perform better given a specific property (e.g., separable, isotropic, multimodal) by identifying functions with distinct properties. Rather than making a thorough competition, we aim for providing insights into the algorithms’ strengths and weaknesses.

The results are presented in two parts. In the first part, all the ESs except IPOP-CMA-ES and the niching ES variants are compared on the aforementioned test functions and are excluded from the comparison. The purpose is to compare all the ESs which are inclining to local search. The results are depicted in Figs. 2, 3, 4, and 5. In 5D, the comparisons on the first 6 functions show that the standard CMA-ES and its mirrored sampling variant outperform the others in most functions. In general, the mutative self-adaptation ESs perform worse than the derandomized ES variants. The reason is that the MSC-ES normally exploits a much larger population size in order to adapt the covariance and thus consumes much more function evaluations. On the simple sphere function, the winner is (1 + 1)-ES, as expected from theory. In addition, the DR2-ES is equally good as CMA-ES. On the separable ellipsoid function, DR2-ES even outperforms CMA-ES because it efficiently adapts the uncorrelated mutations. On the non-separable functions (ellipsoid, Rosenbrock, sharp ridge and step-ellipsoid), CMA-ES and mirrored sampling significantly outperform the other ES variants. This is because CMA-ES is capable of adapting arbitrary mutations by means of the covariance matrix, while the other ES variants exploit either isotropic or axis-parallel mutation distributions. In 20D, the comparison roughly shows the same results as in 5D. Note that on separable ellipsoid and step-ellipsoid function, (15, 100)-MSC-ES could catch the convergence speed of CMA-ES variants.

Fig. 2
figure 2

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for all functions and subgroups in 5-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(1+1), \(\triangledown \):(1,7)-MSC, ⋆ :(15,100)-MSC, \(\Box \):DR2, △:CMA, ♢:Mirroring

Fig. 3
figure 3

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for two groups of multimodal functions in 5-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(1+1), \(\triangledown \):(1,7)-MSC, ⋆ :(15,100)-MSC, \(\Box \):DR2, △:CMA, ♢:Mirroring

Fig. 4
figure 4

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for all functions and subgroups in 20-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(1+1), \(\triangledown \):(1,7)-MSC, ⋆ :(15,100)-MSC, \(\Box \):DR2, △:CMA, ♢:Mirroring

Fig. 5
figure 5

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for two groups of multimodal functions in 20-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(1+1), \(\triangledown \):(1,7)-MSC, ⋆ :(15,100)-MSC, \(\Box \):DR2, △:CMA, ♢:Mirroring

In the second part, all the niching ES variants, IPOP-CMA-ES, restart (1 + 1)-ES and (15,100)-MSC-ES are compared on the multimodal functions. These algorithms are grouped for this comparison because they are better equipped for global search. The results are depicted in Figs. 6 (5D) and 7 (20D). On the weakly structured multimodal functions (f 15 − f 19), IPOP-CMA-ES outperforms all the niching ES variants, (1 + 1)-ES with restart and (15, 100)-MSC-ES. (1 + λ)-CMA-Niching shows the best performance among all the niching ES variants tested. Surprisingly, (μ, λ)-MSC-ES performs well even compared to niching ES variants, which may be a consequence of its large population size. On the weakly structured multimodal functions, (1 + λ)-CMA-Niching could catch up with the performance of IPOP-CMA-ES both in 5D and 20D. (μ, λ)-MSC-ES performs quite poorly in this case. In addition, although it is a simple strategy, (1 + 1)-restart shows good results compared to IPOP-CMA-ES and niching ES. Evidently, the niching ES variants spend many function evaluations for maintaining the local optima and thus exhibits altogether poorer performance in terms of global convergence speed when compared to, e.g., IPOP-CMA-ES, which targets the accurate approximation of a single optimum and exploits much of its resources to achieve this.

Fig. 6
figure 6

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for two groups of multimodal functions in 5-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(15,100)-MSC, \(\triangledown \):(1+1)-restart, ⋆ :IPOP, \(\Box \):niching_cma, △:niching_cma_ma, ♢:niching_cmaplus, :niching_dr2

Fig. 7
figure 7

Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8..2] for two groups of multimodal functions in 20-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each single target. Legend: ∘:(15,100)-MSC, \(\triangledown \):(1+1)-restart, ⋆ :IPOP, \(\Box \):niching_cma, △:niching_cma_ma, ♢:niching_cmaplus, :niching_dr2

Conclusions

It has been shown in this chapter that evolution strategies are a versatile class of stochastic search heuristics for optimization. There exists a rich body of theoretical results on ESs, including global convergence conditions, results showing linear convergence rates on high dimensional functions, and findings on the stability of subpopulations and the impact of recombination on global convergence reliability. Moreover, ESs are rank-based (order-invariant) and invariant to changes of the coordinate system. The self-adaptation of the stochastic distribution is an important feature, too, as it frees the user from the burden of choosing the right parameters for mutation and it also makes highly precise approximation of optima possible.

This chapter highlighted mainstream variants of ESs for continuous optimization, including the (1 + 1)-ES with 1∕5-th success rule, the (μ, λ) −ES with mutative step-size adaptation, and common variants of ES with different levels of derandomized step-size adaptation, namely, DR1, DR2, DR3, and CMA-ES. Moreover, common concepts of ESs for multimodal optimization were discussed.

All these strategies have been compared on different categories of functions. The empirical studies confirmed the superiority of covariance matrix adaptation techniques on ill-conditioned problems with correlated variables. However, if these problems do not govern the search difficulty, other evolution strategies can be highly competitive as well. Moreover, it was confirmed that multimodal optimization requires special adaptations to evolution strategies in order to achieve maximal performance.

Our literature review has shown that the algorithmic techniques developed for ES are not only fruitful in the domain of continuous optimization but can be applied to other problem classes as well. Here, the key is defining a metric representation of the search space and following a set of guidelines for the design of mutation and recombination operators, which were reviewed here in a rather informal manner.

Some prevalent topics for future research will be the integration of multiple criteria and constraints, although some first promising results are already available in this direction. Moreover, for nonstandard ES, the theoretical analysis needs to be advanced, in particular the study of convergence dynamics when the available time is limited. Finally, looking back to the original biological inspiration of evolution strategies, one might conjecture that nature has still many “tricks” in store that when well understood could lead to a further enhancement of ES-like search strategies. In this context, it will be interesting to follow recent trends in biological evolution theories [32], showing that a much broader set of mechanisms seem to govern organic evolution than those captured in the modern synthesis.

Cross-References