Optimal deterministic algorithm generation
Abstract
A formulation for the automated generation of algorithms via mathematical programming (optimization) is proposed. The formulation is based on the concept of optimizing within a parameterized family of algorithms, or equivalently a family of functions describing the algorithmic steps. The optimization variables are the parameters—within this family of algorithms—that encode algorithm design: the computational steps of which the selected algorithms consist. The objective function of the optimization problem encodes the merit function of the algorithm, e.g., the computational cost (possibly also including a cost component for memory requirements) of the algorithm execution. The constraints of the optimization problem ensure convergence of the algorithm, i.e., solution of the problem at hand. The formulation is described prototypically for algorithms used in solving nonlinear equations and in performing unconstrained optimization; the parametrized algorithm family considered is that of monomials in function and derivative evaluation (including negative powers). A prototype implementation in GAMS is provided along with illustrative results demonstrating cases for which wellknown algorithms are shown to be optimal. The formulation is a mixedinteger nonlinear program. To overcome the multimodality arising from nonconvexity in the optimization problem, a combination of brute force and generalpurpose deterministic global algorithms is employed to guarantee the optimality of the algorithm devised. We then discuss several directions towards which this methodology can be extended, their scope and limitations.
Keywords
Optimization Nonlinear equations Algorithms Optimal control1 Introduction
Computation has led to major advances in science and engineering, in large part due to ingenious numerical algorithms. The development of algorithms is thus of crucial importance and requires substantial resources and time. Typically multiple algorithms exist for a given task. Comparisons of algorithms that tackle a given family of problems is sometimes performed theoretically and sometimes numerically. It would be desirable to automatically generate algorithms and have a guarantee that these are the best for a given problem or for a class of problems. We propose the use of numerical optimization to design optimal algorithms, i.e., algorithms that perform better than any conceivable alternative. More precisely, each iteration of the algorithms is interpreted, in an inputoutput sense, as a function evaluation. Then, an optimization problem is formulated that finds the best (in a given metric) algorithm among a family of algorithms/functions.
There is large literature on automatic generation of algorithms. In many cases genetic algorithms are used to first generate a set of algorithms built of elementary components of promising approaches. The set is evaluated on test problems and a new elitist algorithm is obtained by combination [4, 21, 22]. The work by [21, 22] covers basic methods and ideas in the field of genetic programming. Automatic algorithm generation for retrieval, classification and other data manipulation has been proposed by [23] based on statistical data analysis and other methods. For instance, in [33] algorithms for the wellknown knapsack problem are automatically generated by genetic algorithms. Algorithms are first generated for small knapsack instances and then compared to algorithms computed on larger test sets. In [19] automatic generation of parallel sorting algorithms is presented by combination of known sorting algorithms to obtain a best performing algorithm for a certain input. The well known art gallery problem (AGP) is discussed by [41], where an iterative algorithm is developed. Building on the work of [32], in [15] a novel approach for performance analysis is presented. In [20] Nesterovlike firstorder algorithms for smooth unconstrained convex optimization are developed. The worstcase convergence bound of the presented algorithms is twice as small as that of Nesterov’s methods. Moreover, there is substantial work on automatic code and algorithm generation for certain problems, e.g., [2, 3, 13, 34]. Many depend on algebraic operations regarding matrices, e.g., QR decomposition, diagonal transformations or eigenvalue computation. Many of those operations are well documented in the Thesis of [6]. Finally, the problem of tuning the parameters of existing algorithms through optimization has been considered [25]. Lessard et al. [24] use control theory to derive bounds on the convergence rate of important algorithms, most of which are considered herein as well. Deterministic global optimization has been used by Ruuth and Spiteri [35, 36] to maximize the Courant–Friedrichs–Lewy coefficients of Runge–Kutta methods subject to constraints ensuring that they preserve strong stability.
Herein, the focus is on iterative algorithms and the basis for the automatic generation will be a formulation similar to the ones used in optimization with dynamic systems embedded. Consider an iterative algorithm, like the timehonored Newton–Raphson for finding roots of algebraic equations, e.g., in one dimension \( x_{n+1} = x_n  f(x_n)/f'(x_n) \equiv x_n + u_n(x_n). \) The iterates can thus be seen as a “control action” \(u_n\) that depends only on the current guess, i.e., “statedependent feedback”; a similar interpretation can be given to most algorithms. But if algorithms are feedback laws (towards a prescribed objective), then we do have celebrated mathematical tools for computing optimal feedback, e.g., Hamilton–Jacobi–Bellman. It would thus make sense to use these tools to devise optimal algorithms. Recall that iterative algorithms have been considered before in automatic algorithm generation, e.g., [41].
The key goal of the manuscript is to formulate a systematic process for obtaining optimal algorithms. To the best of our knowledge, there exists no other proposal in the literature to use deterministic global optimization for finding an algorithm that is optimal w.r.t. an objective such as the computational cost for general classes of algorithms. Recall the references [35, 36] which also use deterministic global optimization; in that sense the work of Ruth and Spiteri preceeds our work in using mixedinteger nonlinear programing (MINLP) to devise optimal algorithms. However, therein the objective are maximal values of coefficients subject to stability of the algorithm, whereas we consider more general classes of algorithms and as objective estimates of the cost of the algorithm. More specifically, in our work, optimality is determined based on a desired measurable performance property which is encoded as the objective of the optimization problem. For instance this can be the computational expense of the algorithm, possibly accounting for the memory requirements with an appropriate weight. In the article we use an approximation of the cost, and throughout determine optimality and cheapest algorithms in the sense of minimizing this approximated cost. Upon convergence of our formulation, the resulting algorithm is optimal among a considered family of algorithms, which in turn is encoded using the variables (or degrees of freedom) of the optimization problem. The optimization formulation is completed by the constraints which ensure that only feasible algorithms will be selected among the family of algorithms considered, i.e., algorithms that converge to a solution of the problem at hand. The formulations can be cast as nonlinear programs (NLP) or MINLP. These formulations are quite expensive; in fact, as we discuss in the following they are expected to be more expensive than the original problem; however, we argue that this is acceptable.
Our proof of concept illustrations involve devising optimal algorithms that perform local unconstrained minimization of scalar functions and those that solve systems of equations. We consider a class of algorithms that involves evaluations of the function and its first two derivatives. Algorithms can be composed for a modest number of iterations of a monomial involving these quantities and we call these monomialtype algorithms. Optimality is first sought within this class of algorithms. The class is then extended to allow for methods such as Nesterov’s acceleration. The wellknown steepest descent is shown to be optimal in finding a stationary point of a univariate fourthorder polynomial, whereas it is infeasible for the twodimensional Rosenbrock objective function (it would take more than the allowed number of iterations to converge). Both statements hold for a given initial guess (starting value of the variables given to the algorithm). We also consider the effect of different initial guesses and show that Nesterov’s method is the cheapest in an ensemble of initial guesses. Finally, we focus on finding optimal algorithms for a given problem, e.g., minimizing a fixed function; however, we discuss how to handle multiple instances by considering families of problems.
The remainder of the manuscript is structured starting with the simplest possible setting, namely solution of a single nonlinear equation by a particular class of algorithms. Two immediate extensions are considered (a) to equations systems and (b) to local unconstrained optimization. Subsequently, numerical results are presented along with some discussion. Limitations are discussed followed by potential extensions to more general problems and algorithm classes. Finally, key conclusions are presented.
2 Definitions and assumptions
For the sake of simplicity, first, solution of a single nonlinear equation is considered as the prototypical task an algorithm has to tackle.
Definition 1
(Solution of equation) Let \(x \in X \subset {\mathbb {R}}\) and \(f:X \rightarrow {\mathbb {R}}\). A solution of the equation is a point \(x^* \in X\) with \(f(x^*)=0\). An approximate solution of the equation is a point \(x^* \in X\) with \(f(x^*)\le \varepsilon \), \( \varepsilon > 0\).
Note that no assumptions on existence or uniqueness are made, nor convexity of the function. A particular interest is to find a steadystate solution of a dynamic system \(\dot{x}(t)=f\left( x(t)\right) \). We will first make use of relatively simple algorithms:
Definition 2
(Simple Algorithm) Algorithms operate on the variable space \(\mathbf{x} \in X \subset {\mathbb {R}}^{n_x}\) and start with an initial guess \(\mathbf{x}^{(0)} \in X\). A given iteration (it) of a simple algorithm amounts to calculating the current iterate as a function of the previous \(\mathbf{x}^{(it)}=g_{it}\left( \mathbf{x}^{(it1)}\right) \) with \(g_{it}: X \rightarrow X\). Therefore, the algorithm has calculated \(\mathbf{x}^{(it)} = g_{it}\left( g_{it1}\left( \ldots (g_{1}(\mathbf{x}^{(0)}))\right) \right) \) after iteration (it). A special case are algorithms that satisfy \(g_{it_1}(\mathbf{x})=g_{it_2}(\mathbf{x})\) for all \(it_1,it_2\), i.e., use the same function at each iteration.
Note that the definition includes the initial guess \(\mathbf{x}^{(0)}\). Herein, both a fixed initial guess and an ensemble of initial conditions is used. Unless otherwise noted, the same function will be used at each iteration \(g_{it} \equiv g\).
In other words, algorithmic iterations can be seen as functions and algorithms as composite functions. This motivates our premise, that finding an optimal algorithm constitutes an optimal control problem. In some sense, the formulation finds the optimal function among a set of algorithms considered. One could thus talk of a “hyperalgorithm” or “metaalgorithm” implying a (parametrized) family of algorithms, or possibly the span of a “basis set” of algorithms that are used to identify an optimal algorithm. In the following we will assume \(X={\mathbb {R}}^{n_x}\) and consider approximate feasible algorithms:
Definition 3
An algorithm is feasible if it solves a problem; otherwise it is termed infeasible. More precisely: it is termed feasible in the limit for a given problem if it solves this problem as the number of iterations approaches \(\infty \); it is termed finitely feasible if it solves the problem after a finite number of iterations; it is termed approximate feasible if after a finite number of iterations it solves a problem approximately. In all these cases the statements hold for some (potentially open) set of initial conditions. A feasible algorithm is optimal with respect to a metric if among all feasible algorithms it minimizes that metric.
One could also distinguish between feasible path algorithms, i.e., those that satisfy \(\mathbf{x}^{(it)} \in X\), \(\forall it\) in contrast to algorithms that for intermediate iterations violate this condition.
Throughout the article, it is assumed that \(f:X \subset {\mathbb {R}} \rightarrow {\mathbb {R}}\) is sufficiently smooth, in the sense that if algorithms are considered that use derivatives of a given order, then f is continuously differentiable to at least that order. The derivative of order j is denoted by \(f^{(j)}\) with \(f^{(0)}\equiv f\). A key point in our approach is the selection of a class of algorithms: a (parametrizable) family of functions. It is possible, at least in principle, to directly consider the functions \(g_{it}\) and optimize in the corresponding space of functions. Alternatively, one can identify a specific family of algorithms, e.g., those based on gradients, which includes wellknown algorithms such as Newton and explicit Euler:
Definition 4
Consider problems that involve a single variable \(x \in X \subset {\mathbb {R}}\) and \(f: X \rightarrow {\mathbb {R}}\). Monomialtype algorithms are those that consist of a monomial \(g_{it}(x)=x+ \alpha _{it} {\varPi }_{j=0}^{j_{max}} \left( f^{(j)}(x)\right) ^{\nu _j}\) of derivatives of at most order \(j_{max}\), allowing for positive and negative (integer) powers \(\nu _j\).

Newton: \(x^{(it)}=x^{(it1)}+\frac{f^{(0)}(x^{(it1)})}{f^{(1)}(x^{(it1)})}\), i.e., we have \(\alpha =1\), \(\nu _0=1\), \(\nu _1=1\), \(\nu _{i>1}=0\).

Explicit Euler: \(x^{(it)}=x^{(it1)}+f(x^{(it1)}) {\varDelta } t\), i.e., we have \(\alpha ={\varDelta } t\), \(\nu _0=1\), \(\nu _{i>0}=0\).
3 Development
We first develop an optimization formulation that identifies optimal algorithms of a particular problem: the solution of a given nonlinear scalar equation, Definition 1. We will consider approximate feasible solutions and simple algorithms that use the same function at each iteration. If the optimization metric accounts for the computational cost of the algorithm, then it is expected that algorithms that are only feasible in the limit will not be optimal since their cost will be higher than those of finitely feasible algorithms.
Obviously to solve the optimization problem with local methods one needs an initial guess for the optimization variables, i.e., an initial algorithm. Some local optimization methods even require a “feasible” initial guess, i.e., an algorithm that solves (nonoptimally) the problem at hand. Note however, that we will consider deterministic global methods for the solution of the optimization problem, and as such, at least in theory, there is no need for an initial guess.
3.1 Infinite problem
The intermediate variables can be considered as part of the optimization variables along with the constraints or eliminated. Note the analogy to full discretization vs. late discretization vs. multiple shooting in dynamic optimization [5, 30] with known advantages and disadvantages.
3.2 Finite problem
It may not always be easy to estimate the cost \(\phi _j\left( f^{(j)}\left( x^{(it)}\right) ,\alpha _{it},\nu _j \right) \). For instance consider Newton’s method augmented with a linesearch method \( x^{(it)}=x^{(it1)}+\alpha _{it} \frac{f(x^{(it1)})}{f^{(1)}(x^{(it1)})}\). For each major iteration, multiple minor iterations are required, with evaluations of f. If the cost of the minor iterations is negligible, the computational cost is simply the evaluation of f, its derivative and its inversion. If we know the number of minor iterations then we can calculate the required number of evaluations. If the number of iterations is not known, we can possibly obtain it by the step size, but this requires knowledge of the line search strategy. In some sense this is an advantage of the proposed approach: the step size is not determined by a line search method but rather by the optimization formulation. However, a challenge is that we need to use a good cost for the step size, i.e., incorporate in the objective function the computational expense associated with the algorithm selecting a favorable step size \(\alpha _{it}\). Otherwise, the optimizer can select an optimal \(\alpha _{it}\) for that instance which is not necessarily sensible when the cost is accounted for. To mimic the way conventional algorithms work, in the numerical results we allow discrete values of the step size \(\alpha _{(it)}=\pm \, 2^{{\bar{\alpha }}_{it}}\) and put a bigger cost for bigger \({\bar{\alpha }}\) since this corresponds to more evaluations of the line search.
The algorithm is fully determined by selecting \(\nu _j\) and \({\bar{\alpha }}_{it}\). So in some sense the MINLP is a purely integer problem: the variables \(x^{(it)}\) are uniquely determined by the equations. To give a sense of the problem complexity, consider the setup used in the numerical case studies. Assume we allow derivatives of order \(j \in \{0,1,2\}\) and exponents \(\nu _j \in \{\,2,\,1,0,1,2\}\) and these are fixed for each iteration it. This gives \(5^3=125\) basic algorithms. Further we decide for each algorithm and iteration it if the step size \(\alpha _{it}\) is positive or negative, and allow \({\bar{\alpha }}_{it} \in \{0,1,\ldots ,10\}\), different for each iteration it of the algorithm. Thus, we have \(2\times 10^5\) combinations of step sizes for each algorithm and \(25 \times 10^6\) total number of combinations.
3.3 System of equations
A natural extension of solving a single equation is to solve a system of equations.
Definition 5
Let \(\mathbf{x} \in X \subset {\mathbb {R}}^{n_x}\) and \(\mathbf{f}:X \rightarrow {\mathbb {R}}^{n_x}\). A solution of the system of equations is a point \(\mathbf{x}^* \in X\) with \(\mathbf{f}(\mathbf{x}^*)=\mathbf{0}\). An approximate solution of the system of equations is a point \(\mathbf{x}^* \in X\) with \(\mathbf{f}\left( \mathbf{x}^*\right) \le \varepsilon \), \(\varepsilon >0\).
In addition to more cumbersome notation, the optimization problems become much more challenging to solve due to increasing number of variables and also due to the more expensive operations. Obviously the monomials need to be defined appropriately; for instance the inverse of the first derivative corresponds to the inverse of the Jacobian, assuming it has a unique solution. This in turn can be written as an implicit function or a system of equations can be used in the optimization formulation. For the numerical results herein we use small dimensions (up to two herein) and analytic expressions for the inverse. In “Appendix A” we discuss an approach more amenable to higher dimensions.
3.4 Special twostep methods
3.5 Finding optimal local optimization algorithms
It is relatively easy to adapt the formulation from equation solving to local solution of optimization problems. The only substantial change consists in replacing the endpoint constraint with some desired termination criterion such as an approximate solution of the KKT conditions. For unconstrained problems the corresponding criterion is stationarity of the objective \(\frac{\partial f}{\partial x_{ix}}=0\), for \(ix=1,\ldots ,n_x\). Similarly to systems of equations, increased dimensionality (\(n_x>1\)) implies that vectors and matrices arise and operations need to be appropriately formulated.
4 Numerical results
4.1 Method
We provide a prototype implementation in GAMS [11], inspired by [29]. We used GAMS 24.5 and tested both local and global methods, mostly KNITRO [12] and BARON [40]. All the common elements are placed in “optforalgomain.gms”. The problemspecific definitions are then given in corresponding GAMS files that are included so it is easy to switch between problems. To limit the number of variables we avoid introducing variables for f and its derivatives and rather define them using macros. We do however introduce auxiliary variables for the iterates of the algorithm as well as variables that capture the residuals, see below. Both could be avoided by the use of further macros. Since we utilize global solvers we impose finite bounds for all variables.
 (EQx3)

Solving \(x^3=1\), \(x \in [\,2,2]\), \(x^{(0)}=0.1\)
 (EQexp)

Solving \(x \exp (x)=1\), \(x \in [\,2,2]\), \(x^{(0)}=0.1\)
 (minx4)

\(\min _x x^4+x^3x^21\), \(x \in [\,2,2]\), \(x^{(0)}=0.1\)
 (RB)

Rosenbrock function in 2d \(\min _{\mathbf{x}} 100 (x_2x_1^2)^2+(1x_1)^2\), \(\mathbf{x} \in [\,2,2]^2\), \(\mathbf{x}^{(0)}=(0.7,0.75)\)
 (eqn2d)

Simple 2d equation solving \(x_2x_1^2=0\), \(5x_2\exp (x_1)=0\). Here we excluded second derivatives.
 (convex)

\(\min _x \exp (x)+x^2\) (strongly convex with convexity parameter 2), \(x \in [\,2,2]\), \(x^{(0)}=0.1\)
 (quad2d)

\(\min _\mathbf{x} (x_11)^2+2x_2^2x_1x_2\), \(\mathbf{x} \in [\,1,2]^2\), \(\mathbf{x}^{(0)}=(\,1,\,1)\)
 (exp2d)

\(\min _\mathbf{x} \exp (x_1)+x_1^2+\exp (x_2)+x_1^2+x_1 x_2\), \(\mathbf{x} \in [\,1,1]^2\), \({\mathbf {x}}^{(0)}=(0.5,0.5)\). The optimum of f is at \({\mathbf {x}}^*\approx (\,0.257627653,\,0.257627653)\)
4.2 Singlestep methods
We tried several runs, with short times (order of minute) and longer times (10 min). BARON does not utilize KNITRO as a local solver. We made the experience that KNITRO performs very well in finding feasible points for hard integer problems. Therefore, we let KNITRO find a feasible point for each fixed algorithm, i.e., we fixed variables describing the algorithm to reduce the problem’s dimension, to provide a good initial point for BARON. Note that when we let KNITRO search the whole algorithm space, it often fails to find any feasible algorithm, even when they exist. This again shows the hardness of the formulation. In some cases convergence of the optimization was improved when we enforced that residuals are decreasing with iteration \(res^{(it)}<res^{(it1)}\). It is noteworthy that in early computational experiments the optimizer also managed to furnish unexpected solutions that resulted in refining the formulation, e.g., choosing a step size \(\alpha \) that resulted in direct convergence before discrete values of \(\alpha \) were enforced.
Due to the relatively difficult convergence of the optimization and the relatively small number of distinct algorithms (125), we solved each problem for each algorithm in three sequential steps: first KNITRO to find good initial guesses, then BARON without convergence criterion (i.e., without imposing \(y_{res}^{(it_{con})}=0\)) and then BARON with convergence criterion. We tried a 60 s and a 600 s run for each problem. We can see that in 600 s the optimizer was able to detect more algorithms as (in)feasible, see Table 1. Still, the formulation without a fixed algorithm could not be solved to optimality within 600 s. Convergence to the global optimum is not guaranteed for all cases. The main findings of this explicit enumeration is that most algorithms are infeasible for harder problems. For equationsolving, in addition to Newton’s algorithm, some algorithms with second derivatives are feasible. In, e.g., problem (minx4) the derivatives could attain 0 leading to a numerical problem. The solvers were still able to fight this issue in most cases and cases where algorithms resulted in an undefined operation were skipped.
The algorithms discovered as optimal for unconstrained minimization are instructive (at least to us!). In the problem (minx4) the steepest descent algorithm is optimal. It is interesting to note that the algorithms do not furnish a global minimum: in the specific formulation used optimal algorithms are computationally cheap algorithms that give a stationary point. In contrast, for the minimization of the Rosenbrock function (RB), steepest descent is not feasible within ten iterations, which constitutes a wellknown behavior. In the case of solving a single equation, several feasible algorithms were obtained.
Table summarizing the numerical results for the presented problems
Problem  With CC  w/o CC  

60 s  600 s  60 s  600 s  
#sol  #feas  #sol  #feas  #sol  #sol  
(EQx3)  39  2  41  4  41  42 
(EQexp)  31  25  61  54  28  59 
(minx4)  13  13  15  15  15  18 
(RB)  0  0  0  0  1  1 
(eqn2d)  32  0  45  0  39  43 
(convex)  52  52  52  52  55  56 
(quad2d)  38  12  39  13  39  40 
(exp2d)  53  41  53  39  54  55 
We also used a genetic algorithm approach in order to solve problems (EQx3) and (EQexp) with the onestep formulation. We used the genetic algorithm implemented in MATLAB R2016a. The genetic algorithm could not find a feasible solution after \(10^9\) generation steps. Since the nonlinear equations in the formulation make the problem very hard for the GA, we relaxed these manually by reformulating them into two inequalities with an additional \(\epsilon =0.001\) factor to allow for a larger feasible set. Still, the GA was not able to find a feasible solution point. We also initialized the GA with one of the feasible algorithms but it did not help either. We do not claim that a more sophisticated genetic algorithm with a possibly differently tailored formulation would be able to solve the kind of problems considered in this work, even though this result gives us a first positive impression of our approach.
4.2.1 Unexpected algorithms discovered as optimal
It is relatively easy to see that algorithm (7) is a special case of (9), and to rationalize why it does work for convex functions (or for concave functions by changing the sign of \(\alpha \)). Recall that we optimized for each possible algorithm that uses up to second derivatives. Algorithm (7) was discovered for the equation (EQexp). In this special case, step size \(\frac{1}{f^{(2)}(x^{(it1)})}\) was good enough to converge within the 10 allowed iterations and \(\alpha \) can be set to \(\,1\). We could rethink the costs of a changing \(\alpha \) and the usage of the inverse of the second derivative in order to force \(1\ne \alpha \ne \,1\).
It is interesting to consider the convergence of such algorithms for the case that the root of the function f exists only in the limit. Then, the algorithms searching for the roots can only reach \(\epsilon \)convergence. For instance, consider \(f(x)=x\exp (x)\), which is convex over, e.g., \(X=(\,\infty ,\,2]\). It holds that \(\lim \nolimits _{x \rightarrow \infty }f(x) = 0\) and for, e.g., the starting point \(x^{(0)}=\,5\), algorithm (7) moves in the negative xdirection, i.e., the algorithm iterates to the left in each iteration. Let us now consider the algorithm given by (8). Here \(\frac{f^{(1)}(x^{(it1)})}{(f^{(2)}(x^{(it1)}))^2}\) operates as the step size \(s^{(it1)}\). Algorithm (8) is more problematic, but in the special case considered here, it converges.
4.3 The importance of the step size
Recall that the optimization formulation allows choice of the step size aiming to mimic typical step size search. In some cases this leads to “spurious” values of the step size \(\alpha \). We allow step sizes \(\alpha _{it}=\pm \, 2^{{\bar{\alpha }}_{it}}\) with \({\bar{\alpha }}_{it}\in \{0,1,\ldots ,10\}\). The algorithms discovered by our formulation might, in some cases, be rationalized as algorithms for optimal line search. There are many rules for the choice of the step size when using linesearch. One of those rules can be adjusted to only use step sizes of size \(\alpha _{it}=\pm \, 2^{{\bar{\alpha }}_{it}}\). The rule says then to divide the step size by 2 as long as the current step is not feasible, i.e., the step obtained in the end does not have to be optimal but only feasible. Herein, the optimizer finds the optimal step size for the linesearch using this rule for the given function f and the given iterative algorithm. In other words, in the 1D case, step size allows each algorithm to be feasible. Good algorithms for the general case, e.g., Newton, will have a sensible step size selection, whereas other algorithms may have an apparently spurious one. In particular for \(x \exp (x)1=0\) the simplest algorithm described by \(x^{(it1)}+\alpha _{it}\) is the second cheapest one and the algorithm \(x^{(it1)}+\alpha _{it} f\left( x^{(it1)}\right) ^2\) is surprisingly the cheapest algorithm.
In contrast, for two or more dimensions, such spurious behavior is not expected or obtained. A simple optimization of the step size is not enough for an algorithm to converge to a root or a minimum of a given function, since the direction provided by the gradient is crucial. For the two dimensional problems not many new algorithms are found. For instance, for the wellknown and quite challenging Rosenbrock function, allowing at most 20 iterations, only Newton’s algorithm is found to be feasible. This is not very surprising given the quite restrictive setting considered: singlestep methods, modest number of iterations allowed, and restricted choice of step sizes \(\alpha _{it}=\pm \, 2^{{\bar{\alpha }}_{it}}\). Note also that the global optimizer did not converge in all cases so that we cannot exclude the existence of feasible algorithms and even thou Newton’s algorithm needs less than 10 iterations, the solver did not find the algorithm to be feasible when only 10 iterations are allowed.
4.4 Twostep methods
We first considered the 1dimensional strongly convex problem \(\min _x \exp (x)+x^2\) over \(X=[\,2,2]\) with starting point \(x^{(0)}=y^{(0)}=0.1\).
Additionally, we considered the minimization of the convex function (exp2d). The optimimum of f is at \({\mathbf {x}}^*\approx (\,0.257627653,\,0.257627653)\). The results can be seen in Fig. 2. Note that this time Newton’s algorithm is not among the 5 cheapest algorithms for this problem. It is also important to remark that the 5 algorithms all follow the same path, as can be seen in the contour plot shown in Fig. 2. This is caused by the choice of the starting point and by the symmetry of the considered function. The examples show that it is possible for us to discover nontrivial algorithms if the formulation is adjusted appropriately, albeit at high cost of solving the optimization formulations.
4.5 Initial conditions
After finding twostep algorithms for the functions mentioned in Sect. 4.4, we investigated the behavior of the 5 cheapest algorithms for several starting points.
Next, let us discuss the results for (quad2d) and the 5 cheapest algorithms shown in Fig.1 for starting point \({\mathbf {x}}^{(0)}={\mathbf {y}}^{(0)}=(\,1,\,1)\). We chose 15 different starting points \({\mathbf {x}}^{(0)}={\mathbf {y}}^{(0)} \in \{(2,2), (2,1), (2,0), (2,\,1), (1,2), (1,1), (1,0), (1,\,1), (0,2)\), \((0,1), (0,0), (0,\,1), (\,1,2), (\,1,1), (\,1,0), (\,1,\,1)\}\). Nesterov’s method and Newton converged for all starting points and Nesterov was the cheapest for every initial condition. The other algorithms did not converge for all starting points or even became infeasible, e.g., the algorithm \(y^{(it1)}+\alpha _{it} f^2 \nabla f^2\) was infeasible for every starting point containing a 0 coordinate. Still, even though the unknown algorithms were infeasible for some starting points, the ranking w.r.t. the cost did not change for the cases where all algorithms converged.
Last, let us discuss the results for (exp2d) and the 5 cheapest algorithms shown in Fig. 2 for starting point \({\mathbf {x}}^{(0)}={\mathbf {y}}^{(0)}=(0.5,0.5)\). We chose 13 different starting points \({\mathbf {x}}^{(0)}={\mathbf {y}}^{(0)} \in \{(\,1,\,1), (\,1,0), (\,1,1), (0,\,1), (0,0), (0,1), (1,\,1), (1,0)\), \((1,1), (0.4,0.3), (0.9,\,0.1), (\,0.6,\,0.8), (\,0.3,\,0.9)\}\). Here we could observe a behavior that was not quite expected. The algorithms only converged for 7 out of the 13 starting points \(\{(\,1,\,1), (0,0), (0,1), (1,\,1), (1,0), (1,1), (0.4,0.3)\}\) and not for the rest. Some of these 7 starting points are placed on the diagonal path taken by the algorithms, seen in Fig. 2 and the other starting points seemed to simply provide useful derivative information. The algorithms either were infeasible for the other 6 starting points or the optimizer did not converge in the given time. Likely the infeasibility is due to using the same \(\alpha \) and \(\beta \) for both coordinates in the algorithms, which is commonly done. Since the chosen function has symmetrical properties, the algorithms only converge under some specific conditions. In that sense the proposed method gives insight into devising new algorithmic ideas, e.g., different \(\beta \) for different coordinates.
5 Limitations
The proposed formulation is prototypical and a number of challenges arise naturally. As aforementioned, the knowledge of an explicit cost may not be a good approximation for all problems. It is also clear that different objective functions, including, e.g., considerations of memory usage or the inclusion of error correction features, may dramatically effect the results.
We expect the proposed optimization formulation to be very hard to solve and the numerical results confirm this. In particular we expect it to be at least as hard as the original problems. Proving that statement is outside the scope of the manuscript but two arguments are given, namely that the final constraint corresponds to solution of problem and that the subproblems of the optimization problem are the same or at least related to the original problem.
Herein brute force and generalpurpose solvers are used for the solution of the optimization problems. It is conceivable to develop specialized numerical algorithms that will perform much better than generalpurpose ones due to the specific problem structure. In essence we have an integer problem with linear objective and a single nonlinear constraint. It seems promising to mask the intermediate variables from the optimizer. This would in essence follow the optimization with implicit functions embedded [30] and followup work. It is also conceivable to move to parallel computing, but suitable algorithms are not yet parallelized.
In our formulation, f is assumed to be a given function, so that we can apply current numerical methods for the optimization. It is, however, possible to consider multiple instances of the problem simultaneously, i.e., allow f to be among a class of functions. This can be done similar to stochastic optimization [7]. A simple method is to sample the instances of interest (functions f) and optimize for some weighted/average performance of the algorithm. Alternatively, the instances can be parametrized by continuous parameters and the objective in the above formulations replaced with some metric of the parameter distribution, e.g., the expectation of the cost. It is also possible, and likely promising, to consider worstcase performance, e.g., in a minmax formulation [18]. It is also interesting to consider the relation of this work to the theorems developed in [45], in particular that optimality of an algorithm over one class of problems does not guarantee optimality over another class.
6 Future work: possible extensions
In principle, the proposed formulation can be applied to any problem and algorithm. In this section we list illustrative extensions to other problems of interest, starting with relatively straightforward steps and progressing towards exploratory possibilities. In the discussion we try to distinguish between relative extensions that could be easily done and somewhat speculative potential ideas.
6.1 Optimal tuning of algorithms
Many algorithms have tuning parameters. Optimizing these for a given fixed algorithm using similar formulations as presented is straightforward. Of course, alternative proposals exist, e.g., [25].
6.2 Matrices
Regarding the possibility of working with matrices, a formulation for finding algorithms from the family of Quasi–Newton methods could be formulated. The general idea of the Quasi–Newton methods is to update an approximate Hessian matrix with the use of gradient difference \(\nabla f(x^{(it)})\nabla f(x^{(it1)})\). Then an approximation of the Hessian matrix is computed by the use of, e.g., Broyden’s method or the DavidonFletcherPowell formula, which both can be expressed with one equation.
6.3 Rational polynomial algorithms
A rather obvious generalization is to consider not just a single monomial but rather rational polynomials involving the derivatives. More generally this could be extended to include noninteger and possibly even irrational powers. This would also allow algorithms involving Taylorseries expansion. No conceptual difficulty exists for an optimization formulation similar to the above but the number of variables increases.
6.4 Integral operations
The formulation essentially also allows algorithms that perform integration, if \(f^{(j)}\) with \(j<0\) is allowed. Obviously for multivariate programs (\(n_x>1\)) the dimensionality needs to be appropriately considered.
6.5 Implicit methods
6.6 General multistep methods
6.7 Multiple families of algorithms
The article focuses on optimizing with a family of algorithms, described by a parametrized function. A natural question that arises is if multiple families can be considered. One approach is to encode each family in its own optimization problem, solve multiple optimization problems and postprocess the results. Another approach is to devise a function that describes all algorithms, i.e., a more general family of algorithms that encompasses all families. In any case the computational cost may substantially increase.
6.8 Global optimization
A challenge in global optimization is that there are no explicit optimality criteria that can be used in the formulation. The definition of global optimality \( f\left( \mathbf{x}^*\right) \le f\left( \mathbf{x}\right) \), \(\forall \mathbf{x} \in X \) can be added to the optimization of algorithm formulation but this results in a (generalized) semiinfinite problem (SIP). There are methods to deal with SIP problems but they are computationally very expensive [29]. Another challenge is that deterministic and stochastic global methods do not follow the simple update using just monomials as in Definition 4. As such, major conceptual difficulties are expected for such algorithms, including an estimation of the cost of calculating the relaxations. Moreover, to estimate the cost of such algorithms the results of ongoing analysis in the spirit of automatic differentiation will be required [10, 16] and the cost does not only depend on the values of the derivatives at a point. For instance in \(\alpha \)BB [1, 27] one needs to estimate the eigenvalues of the Hessian for the domain; in McCormick relaxations [28, 42] the cost depends on how many times the composition theorem has to be applied. Recall also the discussion on cost for some local algorithms such as linesearch methods.
6.9 Optimal algorithms with optimal steps
6.10 Continuous form
Discrete methods are often reformulated in a continuous form. For instance, Boggs proposed a continuous form of Newton’s method [8, 9] \( \dot{x}(t)=\frac{f(x(t))}{f^{(1)}(x(t))}\). See also the recent embedding of discrete algorithms like Nesterov’s scheme in continuous implementations [39, 43]. It seems possible to consider the optimization of these continuous variants of the algorithms using similar formulations. Typically discretization methods are used to optimize with such dynamics embedded. Thus, this continuous formulation seems more challenging to solve than the discrete form above. If a particular structure of the problem can be recognized, it could however be interesting, for instance, to apply a relaxation similar to [37].
6.11 Dynamic simulation algorithms
The task here is to simulate a dynamical system, e.g., ordinary differential equations (ODE) \( \dot{\mathbf{x}}(t)=\mathbf{f}(\mathbf{x}(t)) \) along with some initial conditions \(\mathbf{x}(t=0)=\mathbf{x}^{init}\) for a given time interval \([0,t_f]\). We need to define what is meant by a feasible algorithm. A natural definition involves the difference of the computed time trajectory from the exact solution, which is not known, and thus does not yield an explicit condition. One should therefore check the degree to which the resulting approximate curve, possibly after interpolation, satisfies the differential equation over this time interval.
6.12 Dynamic optimization algorithms
Dynamic optimization combines the difficulties of optimization and dynamic simulation. Moreover, it results in a somewhat amusing cyclic problem: we require an algorithm for the solution of dynamic optimization problems to select a dynamic optimization algorithm. As aforementioned, this is not prohibitive, e.g., in the offline design of an algorithm to be used online.
6.13 Algorithms in the limit
Considering algorithms that provably converge to the correct solution in the limit makes the optimization problem more challenging. The infinitedimension formulation (1) is in principle applicable with the aforementioned challenges. If the finite (parametrized) formulation (5), was directly applied, an infinite number of variables would have to be solved for. In such cases one could test for asymptotic selfsimilarity of the algorithm behavior as a way of assessing its asymptotic result.
6.14 Quantum Algorithms
A potential breakthrough in computational engineering would be realized by quantum computers. These will require new algorithms, and there are several scientists that are developing such algorithms. It would thus be of extreme interest to consider the extension of the proposed formulation to quantum algorithms and/or their realworld realizations. This may result in “regular” optimization problems or problems that need to be solved with quantum algorithms themselves.
7 Conclusions
An MINLP formulation is proposed, that can devise optimal algorithms (among a relatively large family of algorithms) for several prototypical problems, including solution of nonlinear equations and nonlinear optimization. Simple numerical case studies demonstrate that wellknown algorithms can be identified, and so can new ones. We argue that the formulation is conceptually extensible to many interesting classes of problems, including quantum algorithms. Substantial work is now needed to develop and implement these extensions so as to numerically devise optimal algorithms for interesting classes of challenging problems where such algorithms are simply not known. Also, the similarity to modelreference adaptive control (MRAC) and internal model control (IMC) can be further explored in the future. The optimization problems obtained can, however, be very challenging and no claim is made that optimal algorithm discovery will be computationally cheap. However, in addition to the theoretical interest, there are certainly applications. Finding guaranteed optimal algorithms for a given problem implies understanding/classifying the difficulty of this problem. And it will certainly be worthwhile to automatically devise algorithms offline, that will be used to solved problems online.
Notes
Acknowledgements
IGK and AM are indebted to the late C.A. Floudas for bringing them together. Fruitful discussions with G.A. Kevrekidis and the help of C.W. Gear with the continuous formulation are greatly appreciated. The anonymous reviewers provided helpful feedback that resulted in an improved manuscript.
References
 1.Adjiman, C.S., Floudas, C.A.: Rigorous convex underestimators for general twicedifferentiable problems. J. Glob. Optim. 9(1), 23–40 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
 2.Arya, S., Mount, D.M., Netanyahu, N., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In: Proceedings of the 5th ACMSIAM Symposium Discrete Algorithms, pp. 573–582 (1994)Google Scholar
 3.Bacher, R.: Automatic generation of optimization code based on symbolic nonlinear domain formulation. In: Proceedings of the 1996 International Symposium on Symbolic and Algebraic Computation, pp. 283–291. ACM (1996)Google Scholar
 4.Bain, S., Thornton, J., Sattar, A.: Methods of automatic algorithm generation. In: PRICAI 2004: Trends in Artificial Intelligence, pp. 144–153. Springer (2004)Google Scholar
 5.Biegler, L.T.: Nonlinear programming: concepts, algorithms, and applications to chemical processes. MPSSIAM Series on Optimization. SIAMSociety for Industrial and Applied Mathematics (2010)Google Scholar
 6.Bientinesi, P.: Mechanical derivation and systematic analysis of correct linear algebra algorithms. Ph.D. Thesis, Graduate School of The University of Texas at Austin (2006)Google Scholar
 7.Birge, J.R., Louveaux, F.: Introduction to Stochastic Programming. Springer, Berlin (1997)zbMATHGoogle Scholar
 8.Boggs, P.T.: The solution of nonlinear systems of equations by astable integration techniques. SIAM J. Numer. Anal. 8, 767–785 (1971)MathSciNetCrossRefzbMATHGoogle Scholar
 9.Boggs, P.T.: The convergence of the BenIsrael iteration for nonlinear least squares problems. Math. Comput. 30, 512–522 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Bompadre, A., Mitsos, A.: Convergence rate of McCormick relaxations. J. Glob. Optim. 52(1), 1–28 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 11.Brooke, A., Kendrick, D., Meeraus, A.: GAMS: A User’s Guide. The Scientific Press, Redwood City (1988)Google Scholar
 12.Byrd, R.H., Nocedal, J., Waltz, R.A.: KNITRO: An Integrated Package for Nonlinear Optimization, vol. 83, pp. 35–59. Springer, Berlin (2006)zbMATHGoogle Scholar
 13.Coelho, C.P., Phillips, J.R., Silveira, L.M.: Robust rational function approximation algorithm for model generation. In: Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, pp. 207–212. ACM (1999)Google Scholar
 14.Deuflhard, P.: Newton Methods for Nonlinear Problems: Affine Invariance and Adaptive Algorithms, vol. 35. Springer, Berlin Heidelberg (2004)zbMATHGoogle Scholar
 15.Drori, Y., Teboulle, M.: Performance of firstorder methods for smooth convex minimization: a novel approach. Math. Program. 145(1), 451–482 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 16.Du, K.S., Kearfott, R.B.: The cluster problem in multivariate global optimization. J. Glob. Optim. 5(3), 253–265 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
 17.Economou, C.G.: An operator theory approach to nonlinear controller design. Ph.D. Thesis, California Institute of Technology Pasadena, California (1985)Google Scholar
 18.Falk, J.E., Hoffman, K.: A nonconvex maxmin problem. Naval Res. Logist. 24(3), 441–450 (1977)MathSciNetCrossRefzbMATHGoogle Scholar
 19.Garber, B.A., Hoeflinger, D., Li, X., Garzaran, M.J., Padua, D.: Automatic generation of a parallel sorting algorithm. In: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pp. 1–5. IEEE (2008)Google Scholar
 20.Kim, D., Fessler, J.A.: Optimized firstorder methods for smooth convex minimization. Math. Program. 159(1), 81–107 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
 21.Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection, vol. 1. MIT press, Cambridge (1992)zbMATHGoogle Scholar
 22.Koza, J.R.: Genetic programming ii: Automatic discovery of reusable subprograms. Cambridge, MA, USA (1994)Google Scholar
 23.Kuhner, M., Burgoon, D., Keller, P., Rust, S., Schelhorn, J., Sinnott, L., Stark, G., Taylor, K., Whitney, P.: Automatic algorithm generation (2002). US Patent App. 10/097,198Google Scholar
 24.Lessard, L., Recht, B., Packard, A.: Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. ArXiv eprints (2014)Google Scholar
 25.Li, Q., Tai, C., E, W.: Dynamics of stochastic gradient algorithms. arXiv:1511.06251 (2015)
 26.Luenberger, D.G.: Introduction to Linear and Nonlinear Programming, vol. 28. AddisonWesley, Reading (1973)zbMATHGoogle Scholar
 27.Maranas, C.D., Floudas, C.A.: A global optimization approach for LennardJones microclusters. J. Chem. Phys. 97(10), 7667–7678 (1992)CrossRefGoogle Scholar
 28.McCormick, G.P.: Computability of global solutions to factorable nonconvex programs: part I. Convex underestimating problems. Math. Program. 10(1), 147–175 (1976)CrossRefzbMATHGoogle Scholar
 29.Mitsos, A.: Global optimization of semiinfinite programs via restriction of the right hand side. Optimization 60(10–11), 1291–1308 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 30.Mitsos, A., Chachuat, B., Barton, P.I.: McCormickbased relaxations of algorithms. SIAM J. Optim. 20(2), 573–601 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 31.Mitsos, A., Lemonidis, P., Barton, P.I.: Global solution of bilevel programs with a nonconvex inner program. J. Glob. Optim. 42(4), 475–513 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 32.Nemirovsky, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. J. Wiley, New York (1983)Google Scholar
 33.Parada, L., Sepulveda, M., Herrera, C., Parada, V.: Automatic generation of algorithms for the binary knapsack problem. In: Evolutionary Computation (CEC), 2013 IEEE Congress on, pp. 3148–3152. IEEE (2013)Google Scholar
 34.Ricart, G., Agrawala, A.K.: An optimal algorithm for mutual exclusion in computer networks. Commun. ACM 24(1), 9–17 (1981)MathSciNetCrossRefGoogle Scholar
 35.Ruuth, S.: Global optimization of explicit strongstabilitypreserving Runge–Kutta methods. Math. Comput. 75, 183–207 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 36.Ruuth, S., Spiteri, R.: Highorder strongstabilitypreserving Runge–Kutta methods with downwindbiased spatial discretizations. SIAM J. Numer. Anal. 42(3), 974–996 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
 37.Sager, S., Bock, H.G., Reinelt, G.: Direct methods with maximal lower bound for mixedinteger optimal control problems. Math. Program. 118(1), 109–149 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 38.Stuber, M.D., Scott, J.K., Barton, P.I.: Convex and concave relaxations of implicit functions. Optim. Methods Softw. 30(3), 424–460 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 39.Su, W., Boyd, S., Candes, E.J.: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. ArXiv eprints (2015)Google Scholar
 40.Tawarmalani, M., Sahinidis, N.V.: A polyhedral branchandcut approach to global optimization. Math. Program. 103(2), 225–249 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 41.Tozoni, D.C., Rezende, P.J.D., Souza, C.C.D.: Algorithm 966: a practical iterative algorithm for the art gallery problem using integer linear programming (2016)Google Scholar
 42.Tsoukalas, A., Mitsos, A.: Multivariate McCormick relaxations. J. Glob, Optim. 59(2–3), 633–662 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 43.Wibisono, A., Wilson, A.C., Jordan, M.I.: A Variational Perspective on Accelerated Methods in Optimization. ArXiv eprints (2016)Google Scholar
 44.Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. arXiv preprint arXiv:1603.04245 (2016)
 45.Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evolut. Comput. 1(1), 67–82 (1997)CrossRefGoogle Scholar
 46.Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet em: a provably optimal algorithm for crowdsourcing. In: Advances in neural information processing systems, pp. 1260–1268 (2014)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.