Abstract
We introduce the idea that using optimal classification trees (OCTs) and optimal classification trees with-hyperplanes (OCT-Hs), interpretable machine learning algorithms developed by Bertsimas and Dunn (Mach Learn 106(7):1039–1082, 2017), we are able to obtain insight on the strategy behind the optimal solution in continuous and mixed-integer convex optimization problem as a function of key parameters that affect the problem. In this way, optimization is not a black box anymore. Instead, we redefine optimization as a multiclass classification problem where the predictor gives insights on the logic behind the optimal solution. In other words, OCTs and OCT-Hs give optimization a voice. We show on several realistic examples that the accuracy behind our method is in the 90–100% range, while even when the predictions are not correct, the degree of suboptimality or infeasibility is very low. We compare optimal strategy predictions of OCTs and OCT-Hs and feedforward neural networks (NNs) and conclude that the performance of OCT-Hs and NNs is comparable. OCTs are somewhat weaker but often competitive. Therefore, our approach provides a novel insightful understanding of optimal strategies to solve a broad class of continuous and mixed-integer optimization problems.
This is a preview of subscription content, access via your institution.







References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/, software available from tensorflow.org.
Alvarez, A. M., Louveaux, Q., & Wehenkel, L. (2017). A machine learning-based approximation of strong branching. INFORMS Journal on Computing, 29(1), 185–195.
Bengio, Y. (2009) Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127.
Bengio, Y., Lodi, A., & Prouvost, A. (2018). Machine learning for combinatorial optimization: A methodological tour d’horizon. arXiv:1811.0612.
Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039–1082.
Bertsimas, D., & Dunn, J. (2019). Machine learning under a modern optimization lens. London: Dynamic Ideas Press.
Bertsimas, D., & Stellato, B. (2019). Online mixed-integer optimization in milliseconds. arXiv:1907.02206.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to linear optimization. New York: Athena Scientific.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., et al. (2016). End to end learning for self-driving cars. arXiv:1604.17316.
Bonami, P., Lodi, A., & Zarpellon, G. (2018). Learning a classification of mixed-integer quadratic programming problems. In W. J. van Hoeve (Ed.), Integration of constraint programming, artificial intelligence, and operations research (pp. 595–604). Cham: Springer.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. The Wadsworth statistics/probability series. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
Calafiore, G. C. (2010). Random convex programs. SIAM Journal on Optimization, 20(6), 3427–3464.
Clarke, E., Gupta, A., Kukula, J., & Strichman, O. (2002). SAT based abstraction-refinement using ILP and machine learning techniques. In E. Brinksma & K. G. Larsen (Eds.), Computer aided verification (pp. 265–279). Berlin: Springer.
Copeland, J. (2012). Alan turing: The codebreaker who saved ‘millions of lives’. BBC News. https://www.bbc.com/news/technology-18419691. Online; posted 19-June-2012.
Dai, H., Dai, B., & Song, L. (2016). Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd international conference on international conference on machine learning—volume 48, JMLR.org, ICML’16 (pp. 2702–2711).
Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., & Song, L. (2017). Learning combinatorial optimization algorithms over graphs. arXiv:1704.01665.
Diamond, S., & Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83), 1–5.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4), 237–264.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. London: MIT Press.
Gurobi Optimization Inc. (2020). Gurobi optimizer reference manual. http://www.gurobi.com.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. No. 2 in Springer series in statistics. New York: Springer.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hoffman, A. J. (1979). Binding constraints and helly numbers. Annals of the New York Academy of Sciences, 319(1 Second Intern), 284–288.
Hutter, F., Hoos, H. H., Leyton-Brown, K., & Stützle, T. (2009). ParamILS: An automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36(1), 267–306.
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. In C. A. Coello (Ed.), Learning and intelligent optimization (pp. 507–523). Berlin: Springer.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
Khalil, E. B., Bodic, P. L., Song, L., Nemhauser, G., & Dilkina, B. (2016). Learning to branch in mixed integer programming. In Proceedings of the 30th AAAI conference on artificial intelligence, AAAI’16 (pp. 724–731). London: AAAI Press. http://dl.acm.org/citation.cfm?id=3015812.3015920.
Klaučo, M., Kalúz, M., & Kvasnica, M. (2019). Machine learning-based warm starting of active set methods in embedded model predictive control. Engineering Applications of Artificial Intelligence, 77, 1–8.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 1097–1105). Curran Associates Inc.
Kruber, M., Lübbecke, M. E., & Parmentier, A. (2017). Learning when to use a decomposition. In D. Salvagnin & M. Lombardi (Eds.), Integration of AI and OR techniques in constraint programming (pp. 202–210). Cham: Springer.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., & Stützle, T. (2016). The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3, 43–58.
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91.
McAllester, D. A., & Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. In Proceedings of the 13th annual conference on computational learning theory.
McDiarmid, C. (1989). On the method of bounded differences. London mathematical society lecture note series (pp. 148–188). Cambridge: Cambridge University Press.
Minton, S. (1996). Automatically configuring constraint satisfaction programs: A case study. Constraints, 1(1–2), 7–43.
Misra, S., Roald, L., & Ng, Y. (2019). Learning for constrained optimization: Identifying optimal active constraint sets. arXiv:1802.09639v4.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in PyTorch. In NIPS-W.
Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., & Dill, D. L. (2019). Learning a SAT solver from single-bit supervision. In International conference on learning representations.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359.
Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). New York: A Bradford Book.
Takapoui, R., Moehle, N., Boyd, S., & Bemporad, A. (2017). A simple effective heuristic for embedded mixed-integer quadratic programming. International Journal of Control, 93, 1–11.
Xu, L., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2008). SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research, 32(1), 565–606.
Zheng, Y. S., & Federgruen, A. (1991). Finding optimal (s, S) policies is about as simple as evaluating a single policy. Operations Research, 39(4), 654–665. https://doi.org/10.1287/opre.39.4.654.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Hendrik Blockeel.
Appendices
Appendix 1: Optimal classification trees
OCTs and OCT-Hs developed by Bertsimas and Dunn (2017, 2019) are a recently proposed generalization of classification and regression trees (CARTs) developed by Breiman et al. (1984) that construct decision trees that are near optimal with significantly higher prediction accuracy while retaining their interpretability. Bertsimas and Dunn (2019) have shown that given a NN (feedforward, convolutional and recurrent), we can construct an OCT-H that has the same in sample accuracy, that is OCT-Hs are at least as powerful as NNs. The constructions can sometimes generate OCT-Hs with high depth. However, they also report computational results that show that OCT-Hs and NNs have comparable performance in practice even for depths of OCT-Hs below 10.
Architecture
OCT recursively partition the feature space \({\mathbf{R}}^p\) to construct hierarchical disjoint regions. A tree can be defined as a set of nodes \(t \in {\mathcal {T}}\) of two types \({\mathcal {T}}= {\mathcal {T}}_B \cup {\mathcal {T}}_L\):
- Branch nodes:
-
Nodes \(t \in {\mathcal {T}}_B\) at the tree branches describe a split of the form \({a_t^T\theta < b_t}\) where \(a_t \in {\mathbf{R}}^p\) and \(b_t \in {\mathbf{R}}\). They partition the space in two subsets: the points on the left branch satisfying the inequality and the remaining ones points on the right branch. If splits involve a single variable we denote them as parallel and we refer to the tree as optimal classification tree (OCT).. This is achieved by enforcing all components of \(a_t\) to be all 0 except from one. Otherwise, if the components of \(a_t\) can be freely nonzero, we denote the splits as hyperplanes and we refer to the tree as optimal classification tree with-hyperplanes (OCT-H).
- Leaf nodes:
-
Nodes \(t \in {\mathcal {T}}_L\) at the tree leaves make a class prediction for each point falling into that node.
Example OCT for the Iris dataset (Bertsimas and Dunn 2019)
Learning
With the latest computational advances in MIO, Bertsimas and Dunn (2019) were able to exactly formulate the tree training as a MIO problem and solve it in a reasonable amount of time for problem sizes arising in real-world applications.
The OCT cost function is a tradeoff between the misclassification error at each leaf and the tree complexity
where the \(L_t\) is the misclassification error at node t and the second term represents the complexity of the tree measured as the sum of the norm of the hyperplane coefficients in all the splits. The parameter \(\alpha > 0\) regulates the tradeoff. For more details about the cost function and the constraints of the problem, we refer the reader to Bertsimas and Dunn (2019, Sects. 2.2, 3.1).
Bertsimas and Dunn apply a local search method (Bertsimas and Dunn 2019, Sect. 2.3) that manages to solve OCT problems for realistic sizes in fraction of the time an off-the-shelf optimization solver would take. The algorithm proposed iteratively improves and refines the current tree until a local minimum is reached. By repeating this search from different random initialization points the authors compute several local minima and then take the best one as the resulting tree. This heuristic showed remarkable performance both in terms of computation time and quality of the resulting tree becoming the algorithm included in the OptimalTrees.jl Julia package (Bertsimas and Dunn 2019).
Appendix 2: Neural networks
NNs have recently become one of the most prominent machine learning techniques revolutionizing fields such as speech recognition (Hinton et al. 2012) and computer vision (Krizhevsky et al. 2012). The wide range of applications of these techniques recently extended to autonomous driving (Bojarski et al. 2016) and reinforcement learning (Silver et al. 2017). The popularity of neural networks is also due to the widely used open-source libraries learning on CPUs and GPUs coming from both academia and industry such as TensorFlow (Abadi et al. 2015), Caffe (Jia et al. 2014) and PyTorch (Paszke et al. 2017). We use feedforward neural networks which offer a good tradeoff between simplicity and accuracy without resorting to more complex architectures such as convolutional or recurrent neural networks (LeCun et al. 2015).
Architecture
Given L layers, a neural network is a composition of functions of the form
where each function consists of
The number of nodes in each layer is \(n_l\) and corresponds to the dimension of the vector \(y_{l} \in {\mathbf{R}}^{n_l}\). Layer \(l=1\) is defined as the input layer and \(l=L\) as the output layer. Consequently \(y_{1} = \theta\) and \(y_{L} = \hat{s}\). The linear transformation in each layer is composed of an affine transformation with parameters \(W_{l} \in {\mathbf{R}}^{n_l \times n_{l-1}}\) and \(b_l \in {\mathbf{R}}^{n_l}\). The activation function \(g: {\mathbf{R}}^{n_l} \rightarrow {\mathbf{R}}^{n_l}\) models nonlinearities. We chose as activation function the rectified linear unit (ReLU) defined as
for all the layers \(l=1,\ldots ,L-1\). Note that the \(\max\) operator is intended element-wise. We chose a ReLU because on the one hand it provides sparsity to the model since it is 0 for the negative components of x and because on the other hand it does not suffer from the vanishing gradient issues of the standard sigmoid functions (Goodfellow et al. 2016). An example neural network can be found in Fig. 9.
Example feedforward neural network with functions \(f_i,\;i=1,\ldots ,L\) defined in (12)
For the output layer we would like the network to output not only the predicted class, but also a normalized ranking between them according to how likely they are to be the right one. This can be achieved with a softmax activation function in the output layer defined as
where \(j=1,\ldots ,M\) are the elements of g(x). Hence, \(0 \le g(x) \le 1\) and the predicted class is \({\mathrm{argmax}}(\hat{s})\).
Learning
Before training the network, we rewrite the labels for the neural network learning using a one-hot encoding, i.e., \(s_i \in {\mathbf{R}}^M\) where M is the total number of classes and all the elements of \(s_i\) are 0 except the one corresponding to the class which is 1.
We define a smooth cost function amenable to algorithms such as gradient descent, i.e., the cross-entropy loss
where \(\log\) is intended element-wise. The cross-entropy loss \({\mathcal {L}}\) can also be interpreted as the distance between the predicted probability density of the labels compared to the true one.
The actual training phase consists of applying stochastic gradient descent using the derivatives of the cost function using the back-propagation rule. This method works very well in practice and provides good out-of-sample performance with short training times.
Rights and permissions
About this article
Cite this article
Bertsimas, D., Stellato, B. The voice of optimization. Mach Learn 110, 249–277 (2021). https://doi.org/10.1007/s10994-020-05893-5
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
- Parametric optimization
- Interpretability
- Sampling
- Multiclass classification