The voice of optimization


We introduce the idea that using optimal classification trees (OCTs) and optimal classification trees with-hyperplanes (OCT-Hs), interpretable machine learning algorithms developed by Bertsimas and Dunn (Mach Learn 106(7):1039–1082, 2017), we are able to obtain insight on the strategy behind the optimal solution in continuous and mixed-integer convex optimization problem as a function of key parameters that affect the problem. In this way, optimization is not a black box anymore. Instead, we redefine optimization as a multiclass classification problem where the predictor gives insights on the logic behind the optimal solution. In other words, OCTs and OCT-Hs give optimization a voice. We show on several realistic examples that the accuracy behind our method is in the 90–100% range, while even when the predictions are not correct, the degree of suboptimality or infeasibility is very low. We compare optimal strategy predictions of OCTs and OCT-Hs and feedforward neural networks (NNs) and conclude that the performance of OCT-Hs and NNs is comparable. OCTs are somewhat weaker but often competitive. Therefore, our approach provides a novel insightful understanding of optimal strategies to solve a broad class of continuous and mixed-integer optimization problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems., software available from

  2. Alvarez, A. M., Louveaux, Q., & Wehenkel, L. (2017). A machine learning-based approximation of strong branching. INFORMS Journal on Computing, 29(1), 185–195.

    MathSciNet  Article  Google Scholar 

  3. Bengio, Y. (2009) Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127.

  4. Bengio, Y., Lodi, A., & Prouvost, A. (2018). Machine learning for combinatorial optimization: A methodological tour d’horizon. arXiv:1811.0612.

  5. Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039–1082.

    MathSciNet  Article  Google Scholar 

  6. Bertsimas, D., & Dunn, J. (2019). Machine learning under a modern optimization lens. London: Dynamic Ideas Press.

    Google Scholar 

  7. Bertsimas, D., & Stellato, B. (2019). Online mixed-integer optimization in milliseconds. arXiv:1907.02206.

  8. Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to linear optimization. New York: Athena Scientific.

    Google Scholar 

  9. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., et al. (2016). End to end learning for self-driving cars. arXiv:1604.17316.

  10. Bonami, P., Lodi, A., & Zarpellon, G. (2018). Learning a classification of mixed-integer quadratic programming problems. In W. J. van Hoeve (Ed.), Integration of constraint programming, artificial intelligence, and operations research (pp. 595–604). Cham: Springer.

    Google Scholar 

  11. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    Google Scholar 

  12. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. The Wadsworth statistics/probability series. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

    Google Scholar 

  13. Calafiore, G. C. (2010). Random convex programs. SIAM Journal on Optimization, 20(6), 3427–3464.

    MathSciNet  Article  Google Scholar 

  14. Clarke, E., Gupta, A., Kukula, J., & Strichman, O. (2002). SAT based abstraction-refinement using ILP and machine learning techniques. In E. Brinksma & K. G. Larsen (Eds.), Computer aided verification (pp. 265–279). Berlin: Springer.

    Google Scholar 

  15. Copeland, J. (2012). Alan turing: The codebreaker who saved ‘millions of lives’. BBC News. Online; posted 19-June-2012.

  16. Dai, H., Dai, B., & Song, L. (2016). Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd international conference on international conference on machine learning—volume 48,, ICML’16 (pp. 2702–2711).

  17. Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., & Song, L. (2017). Learning combinatorial optimization algorithms over graphs. arXiv:1704.01665.

  18. Diamond, S., & Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83), 1–5.

    MathSciNet  MATH  Google Scholar 

  19. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4), 237–264.

    MathSciNet  Article  Google Scholar 

  20. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. London: MIT Press.

    Google Scholar 

  21. Gurobi Optimization Inc. (2020). Gurobi optimizer reference manual.

  22. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. No. 2 in Springer series in statistics. New York: Springer.

    Google Scholar 

  23. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  24. Hoffman, A. J. (1979). Binding constraints and helly numbers. Annals of the New York Academy of Sciences, 319(1 Second Intern), 284–288.

    MathSciNet  Article  Google Scholar 

  25. Hutter, F., Hoos, H. H., Leyton-Brown, K., & Stützle, T. (2009). ParamILS: An automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36(1), 267–306.

    Article  Google Scholar 

  26. Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. In C. A. Coello (Ed.), Learning and intelligent optimization (pp. 507–523). Berlin: Springer.

    Google Scholar 

  27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  28. Khalil, E. B., Bodic, P. L., Song, L., Nemhauser, G., & Dilkina, B. (2016). Learning to branch in mixed integer programming. In Proceedings of the 30th AAAI conference on artificial intelligence, AAAI’16 (pp. 724–731). London: AAAI Press.

  29. Klaučo, M., Kalúz, M., & Kvasnica, M. (2019). Machine learning-based warm starting of active set methods in embedded model predictive control. Engineering Applications of Artificial Intelligence, 77, 1–8.

    Article  Google Scholar 

  30. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 1097–1105). Curran Associates Inc.

  31. Kruber, M., Lübbecke, M. E., & Parmentier, A. (2017). Learning when to use a decomposition. In D. Salvagnin & M. Lombardi (Eds.), Integration of AI and OR techniques in constraint programming (pp. 202–210). Cham: Springer.

    Google Scholar 

  32. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  33. López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., & Stützle, T. (2016). The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3, 43–58.

    MathSciNet  Article  Google Scholar 

  34. Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91.

    Google Scholar 

  35. McAllester, D. A., & Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. In Proceedings of the 13th annual conference on computational learning theory.

  36. McDiarmid, C. (1989). On the method of bounded differences. London mathematical society lecture note series (pp. 148–188). Cambridge: Cambridge University Press.

    Google Scholar 

  37. Minton, S. (1996). Automatically configuring constraint satisfaction programs: A case study. Constraints, 1(1–2), 7–43.

    MathSciNet  Article  Google Scholar 

  38. Misra, S., Roald, L., & Ng, Y. (2019). Learning for constrained optimization: Identifying optimal active constraint sets. arXiv:1802.09639v4.

  39. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in PyTorch. In NIPS-W.

  40. Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., & Dill, D. L. (2019). Learning a SAT solver from single-bit supervision. In International conference on learning representations.

  41. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.

    Article  Google Scholar 

  42. Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34.

    MathSciNet  Article  Google Scholar 

  43. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). New York: A Bradford Book.

    Google Scholar 

  44. Takapoui, R., Moehle, N., Boyd, S., & Bemporad, A. (2017). A simple effective heuristic for embedded mixed-integer quadratic programming. International Journal of Control, 93, 1–11.

    MathSciNet  MATH  Google Scholar 

  45. Xu, L., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2008). SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research, 32(1), 565–606.

    Article  Google Scholar 

  46. Zheng, Y. S., & Federgruen, A. (1991). Finding optimal (s, S) policies is about as simple as evaluating a single policy. Operations Research, 39(4), 654–665.

    Article  MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Bartolomeo Stellato.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Hendrik Blockeel.


Appendix 1: Optimal classification trees

OCTs and OCT-Hs developed by Bertsimas and Dunn (2017, 2019) are a recently proposed generalization of classification and regression trees (CARTs) developed by Breiman et al. (1984) that construct decision trees that are near optimal with significantly higher prediction accuracy while retaining their interpretability. Bertsimas and Dunn (2019) have shown that given a NN (feedforward, convolutional and recurrent), we can construct an OCT-H that has the same in sample accuracy, that is OCT-Hs are at least as powerful as NNs. The constructions can sometimes generate OCT-Hs with high depth. However, they also report computational results that show that OCT-Hs and NNs have comparable performance in practice even for depths of OCT-Hs below 10.


OCT recursively partition the feature space \({\mathbf{R}}^p\) to construct hierarchical disjoint regions. A tree can be defined as a set of nodes \(t \in {\mathcal {T}}\) of two types \({\mathcal {T}}= {\mathcal {T}}_B \cup {\mathcal {T}}_L\):

Branch nodes:

Nodes \(t \in {\mathcal {T}}_B\) at the tree branches describe a split of the form \({a_t^T\theta < b_t}\) where \(a_t \in {\mathbf{R}}^p\) and \(b_t \in {\mathbf{R}}\). They partition the space in two subsets: the points on the left branch satisfying the inequality and the remaining ones points on the right branch. If splits involve a single variable we denote them as parallel and we refer to the tree as optimal classification tree (OCT).. This is achieved by enforcing all components of \(a_t\) to be all 0 except from one. Otherwise, if the components of \(a_t\) can be freely nonzero, we denote the splits as hyperplanes and we refer to the tree as optimal classification tree with-hyperplanes (OCT-H).

Leaf nodes:

Nodes \(t \in {\mathcal {T}}_L\) at the tree leaves make a class prediction for each point falling into that node.

An example OCT for the Iris dataset appears in Fig. 8 (Bertsimas and Dunn 2019). For each new data point it is straightforward to follow which hyperplanes are satisfied and to make a prediction. This characteristic makes OCTs and OCT-Hs highly interpretable. Note that the level of interpretability of the resulting trees can be also tuned by changing minimum sparsity of \(a_t\). The two extremes are maximum sparsity OCTs and minimum sparsity OCT-Hs but we can specify anything in between.

Fig. 8

Example OCT for the Iris dataset (Bertsimas and Dunn 2019)


With the latest computational advances in MIO, Bertsimas and Dunn (2019) were able to exactly formulate the tree training as a MIO problem and solve it in a reasonable amount of time for problem sizes arising in real-world applications.

The OCT cost function is a tradeoff between the misclassification error at each leaf and the tree complexity

$$\begin{aligned} {\mathcal {L}}_{OCT} = \sum _{t \in {\mathcal {T}}_L} L_t + \alpha \sum _{t \in {\mathcal {T}}_B} \Vert a_t\Vert _1, \end{aligned}$$

where the \(L_t\) is the misclassification error at node t and the second term represents the complexity of the tree measured as the sum of the norm of the hyperplane coefficients in all the splits. The parameter \(\alpha > 0\) regulates the tradeoff. For more details about the cost function and the constraints of the problem, we refer the reader to Bertsimas and Dunn (2019, Sects. 2.2, 3.1).

Bertsimas and Dunn apply a local search method (Bertsimas and Dunn 2019, Sect. 2.3) that manages to solve OCT problems for realistic sizes in fraction of the time an off-the-shelf optimization solver would take. The algorithm proposed iteratively improves and refines the current tree until a local minimum is reached. By repeating this search from different random initialization points the authors compute several local minima and then take the best one as the resulting tree. This heuristic showed remarkable performance both in terms of computation time and quality of the resulting tree becoming the algorithm included in the OptimalTrees.jl Julia package (Bertsimas and Dunn 2019).

Appendix 2: Neural networks

NNs have recently become one of the most prominent machine learning techniques revolutionizing fields such as speech recognition (Hinton et al. 2012) and computer vision (Krizhevsky et al. 2012). The wide range of applications of these techniques recently extended to autonomous driving (Bojarski et al. 2016) and reinforcement learning (Silver et al. 2017). The popularity of neural networks is also due to the widely used open-source libraries learning on CPUs and GPUs coming from both academia and industry such as TensorFlow (Abadi et al. 2015), Caffe (Jia et al. 2014) and PyTorch (Paszke et al. 2017). We use feedforward neural networks which offer a good tradeoff between simplicity and accuracy without resorting to more complex architectures such as convolutional or recurrent neural networks (LeCun et al. 2015).


Given L layers, a neural network is a composition of functions of the form

$$\begin{aligned} \hat{s} = f_L(f_{L-1}(\dots f_1(\theta ))), \end{aligned}$$

where each function consists of

$$\begin{aligned} y_{l} = f(y_{l-1}) = g(W_l y_{l-1} + b_l). \end{aligned}$$

The number of nodes in each layer is \(n_l\) and corresponds to the dimension of the vector \(y_{l} \in {\mathbf{R}}^{n_l}\). Layer \(l=1\) is defined as the input layer and \(l=L\) as the output layer. Consequently \(y_{1} = \theta\) and \(y_{L} = \hat{s}\). The linear transformation in each layer is composed of an affine transformation with parameters \(W_{l} \in {\mathbf{R}}^{n_l \times n_{l-1}}\) and \(b_l \in {\mathbf{R}}^{n_l}\). The activation function \(g: {\mathbf{R}}^{n_l} \rightarrow {\mathbf{R}}^{n_l}\) models nonlinearities. We chose as activation function the rectified linear unit (ReLU) defined as

$$\begin{aligned} g(x) = \max (x, 0), \end{aligned}$$

for all the layers \(l=1,\ldots ,L-1\). Note that the \(\max\) operator is intended element-wise. We chose a ReLU because on the one hand it provides sparsity to the model since it is 0 for the negative components of x and because on the other hand it does not suffer from the vanishing gradient issues of the standard sigmoid functions (Goodfellow et al. 2016). An example neural network can be found in Fig. 9.

Fig. 9

Example feedforward neural network with functions \(f_i,\;i=1,\ldots ,L\) defined in (12)

For the output layer we would like the network to output not only the predicted class, but also a normalized ranking between them according to how likely they are to be the right one. This can be achieved with a softmax activation function in the output layer defined as

$$\begin{aligned} g(x)_j = \frac{e^{x_j}}{\sum _{j=1}^{M} e^{x_j}}, \end{aligned}$$

where \(j=1,\ldots ,M\) are the elements of g(x). Hence, \(0 \le g(x) \le 1\) and the predicted class is \({\mathrm{argmax}}(\hat{s})\).


Before training the network, we rewrite the labels for the neural network learning using a one-hot encoding, i.e., \(s_i \in {\mathbf{R}}^M\) where M is the total number of classes and all the elements of \(s_i\) are 0 except the one corresponding to the class which is 1.

We define a smooth cost function amenable to algorithms such as gradient descent, i.e., the cross-entropy loss

$$\begin{aligned} {\mathcal {L}}_{{\mathrm{NN}}} = \sum _{i=1}^{N} -s_i^T\log (\hat{s}_i), \end{aligned}$$

where \(\log\) is intended element-wise. The cross-entropy loss \({\mathcal {L}}\) can also be interpreted as the distance between the predicted probability density of the labels compared to the true one.

The actual training phase consists of applying stochastic gradient descent using the derivatives of the cost function using the back-propagation rule. This method works very well in practice and provides good out-of-sample performance with short training times.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bertsimas, D., Stellato, B. The voice of optimization. Mach Learn 110, 249–277 (2021).

Download citation


  • Parametric optimization
  • Interpretability
  • Sampling
  • Multiclass classification