A randomized method for handling a difficult function in a convex optimization problem, motivated by probabilistic programming
 113 Downloads
Abstract
We propose a randomized gradient method for handling a convex function whose gradient computation is demanding. The method bears a resemblance to the stochastic approximation family. But in contrast to stochastic approximation, the present method builds a model problem. The approach is adapted to probability maximization and probabilistic constrained problems. We discuss simulation procedures for gradient estimation.
Keywords
Convex optimization Stochastic optimization Probabilistic problems1 Introduction
A motivation for the above forms are the classic probability maximization and probabilistic constrained problems, where \(\phi ({\varvec{z}} ) = \log F( {\varvec{z}} ) \) with a logconcave distribution function \( F( {\varvec{z}} ) \). We briefly overview a couple of closely related probabilistic programming approaches. For a broader survey, see Fábián et al. (2018). Given a distribution and a number \( p\; ( 0< p < 1 ) \), a probabilistic constraint confines search to the level set \( {{{\mathcal {L}}}}( F, p ) = \{\, {\varvec{z}} \,\, F( {\varvec{z}} ) \ge p \,\} \) of the distribution function \( F( {\varvec{z}} ) \). Prékopa (1990) initiated a novel solution approach by introducing the concept of pefficient points. The point \( {\varvec{z}} \) is pefficient if \( F( {\varvec{z}} ) \ge p \) and there exists no \( {\varvec{z}}^{\prime } \) such that \( {\varvec{z}}^{\prime } \le {\varvec{z}},\, {\varvec{z}}^{\prime } \not = {\varvec{z}},\, F( {\varvec{z}}^{\prime } ) \ge p \). Prékopa et al. (1998) considered problems with random parameters having a discrete finite distribution. They began with enumerating pefficient points and based on them, built a convex relaxation of the problem.
Dentcheva et al. (2000) formulated the probabilistic constraint in a split form: \( T{\varvec{x}} = {\varvec{z}} \) with \( {\varvec{z}} \in {{{\mathcal {L}}}}( F, p ) \); and constructed a Lagrangian dual by relaxing the constraint \( T{\varvec{x}} = {\varvec{z}} \). The resulting dual functional is the sum of the respective optimal objective values of two simpler problems. The first auxiliary problem is a linear programming problem, and the second one is the minimization of a linear function over the level set \({{{\mathcal {L}}}}( F, p ) \). Based on this decomposition, the authors developed a method, called cone generation, that finds new pefficient points in the course of the optimization process.
As minimization over the level set \( {{\mathcal {L}}}( F, p ) \) entails a substantial computational effort, the master part of the decomposition framework should succeed with as few pefficient points as possible. Efficient solution methods were developed by Dentcheva et al. (2004) and Dentcheva and Martinez (2013); the latter applies regularization to the master problem. Approximate minimization over the level set \( {{\mathcal {L}}}( F, p ) \) is another enhancement. Dentcheva et al. (2004) constructed approximate pefficient points through approximating the original distribution by a discrete one. More recently, van Ackooij et al. (2017) employed a special bundletype method for the solution of the master problem, based on the ondemand accuracy approach of de Oliveira and Sagastizábal (2014). This means working with inexact data and regulating accuracy in the course of the optimization. Approximate pefficient points with ondemand accuracy were generated employing the integer programming approach of Luedtke et al. (2010).
Our former paper Fábián et al. (2018) focussed on probability maximization, and proposed a polyhedral approximation of the epigraph of the probabilistic function. This approach is analogous to the use of pefficient points (has actually been motivated by that concept). The dual function is constructed and decomposed in the manner of Dentcheva et al. (2000), but the nonlinear subproblem is easier. In Dentcheva et al. (2000), finding a new pefficient point amounts to minimization over the level set \({{\mathcal {L}}}(F, p)\). In contrast, a new approximation point in Fábián et al. (2018) is found by unconstrained minimization, with considerably less computational effort. Moreover, a practical approximation scheme was developed in the latter paper: instead of exactly solving an unconstrained subproblem occurring during the process, just a single line search is sufficient. The approach is easy to implement and endures noise in gradient computation.
In the present paper, we extend the inner approximation approach of Fábián et al. (2018) to a randomized method handling gradient estimates. The motivation is our experience reported in that former paper: when solving probability maximization problems, most computational efforts were spent on computing gradients. (Computing a single component of the gradient vector required an effort comparable to that of computing a distribution function value). We conclude that easily computable estimates for the gradients are well worth using, even if the iteration count increases due to estimation errors.
The paper is organized as follows. In Sect. 2 we work in an idealized setting, under the following assumptions:
Assumption 1
Assumption 2
Given \( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n \), the function value \( \phi ( {\varvec{z}} ) \) and the gradient vector \( \nabla \phi ( {\varvec{z}} ) \) can be computed exactly.
We present a brief overview of the models and of the column generation approach proposed in Fábián et al. (2018) to the unconstrained problem (1). The epigraph of the convex function \( \phi ( {\varvec{z}} ) \) is approximated by a convex combination of finitely many points (obtained by evaluating the function value in the known iterates.) New points (columns in a model problem) are generated by unconstrained minimization of a probabilistic function. The column generation problem is solved with a gradient descent method. Due to Assumption 1, an approximate solution is sufficient, taking a limited number of descent steps.
In Sect. 3 we extend the method to gradient estimates, replacing Assumption 2 with
Assumption 3
Given \( {\varvec{z}}, {\varvec{u}} \in \mathrm{I}\mathrm{R}^n \), the function value \( \phi ( {\varvec{z}} ) \) can be computed exactly, and the norm \( \Vert \nabla \phi ( {\varvec{z}} )  {\varvec{u}} \Vert \) can be estimated with a predefined relative accuracy. Moreover, realizations of an unbiased stochastic estimate \( {\varvec{G}} \) of the gradient vector \( \nabla \phi ( {\varvec{z}} ) \) can be constructed such that \( \hbox {E}( \Vert {\varvec{G}}  \nabla \phi ( {\varvec{z}} ) \Vert ^2 ) \) remains below a predefined tolerance. (Higher accuracy in case of norm estimation, and tighter tolerance on variance entail larger computational effort.)
We develop a randomized version of the column generation method, and present reliability considerations based on Assumption 1.
In Sect. 4 we deal with the convex constrained problem (2), still in the idealized setting of Assumption 1. We consider a parametric version of an unconstrained problem of the form (1). We present an approximation scheme for the constrained problem that requires the approximate solution of a short sequence of unconstrained problems. Initial problems in this sequence are solved with a large stopping tolerance, and the accuracy is gradually increased. This approximation scheme is first developed in a deterministic form (based on the deterministic method of Sect. 2), and then extended to admit the randomized method of Sect. 3. Reliability considerations are presented for the randomized scheme.
The approach is adapted to probabilistic programming problems in Sect. 5. Here we consider \( \phi ( {\varvec{z}} ) = \log F( {\varvec{z}} ) \) with a nondegenerate ndimensional standard normal distribution function \( F( {\varvec{z}} ) \). Assumption 1 obviously does not hold for every \( {\varvec{z}} \in \mathrm{I}\mathrm{R}^n \) with a probabilistic \( \phi ( {\varvec{z}} ) \). However, as illustrated in Fábián et al. (2018), Assumption 1 holds for the points of a bounded ball around the origin. (The ratio \( \frac{\alpha }{\omega } \) decreases as the radius of the ball increases.) Owing to the specialities of the probabilistic function, the column generation process can be guaranteed to remain in a ball of sufficiently large radius. Such a procedure was sketched in Fábián et al. (2018). That construction provides a theoretical justification for limiting our investigations to a bounded ball, but it does not yield usable estimates for the values \( \alpha \) and \( \omega \). While efficiency considerations of the previous sections are inherited to probabilistic problems, reliability considerations cannot be based on Assumption 1. The quality of a model is measured by different means, based on special features of the probabilistic function.
Section 6 contains an overview of algorithms for the estimation of multivariate normal probability distribution function values and gradients. We discuss the numerical integration algorithm of Genz (1992), and the variance reduction Monte Carlo simulation algorithms of Deák (1980, 1986), Szántai (1976, 1985, 1988) and Ambartzumian et al. (1998), mentioning related works. These variance reduction Monte Carlo simulation algorithms have originally been developed to be used in primaltype methods for probabilistic constrained problems. An abundant stream of research in this direction has been initiated by the models, methods and applications pioneered by Prékopa and his school. Based on these algorithms, a gradient estimate satisfying Assumption 3 can be constructed by a twostage sampling procedure, as mentioned in Sect. 6.5.
Section 7 describes a computational experiment. The aim is to demonstrate the workability of the randomized column generation scheme of Sect. 3, in case of probabilistic problems.
2 Column generation in an idealized setting
2.1 Polyhedral models
2.2 Linear programming formulations
Observation 4
 (a)
\(\phi _k( \overline{{\varvec{z}}} ) \; =\; \sum _{i=0}^k\,\phi _i {\overline{\lambda }}_i \; =\; {\overline{\vartheta }} + \overline{{\varvec{u}}}^T \overline{{\varvec{z}}}\),
 (b)
\({\overline{\vartheta }} =  \phi _k^{\star } ( \overline{{\varvec{u}}} )\),
 (c)
\(\phi _k( \overline{{\varvec{z}}} ) + \phi _k^{\star } ( \overline{{\varvec{u}}} ) = \overline{{\varvec{u}}}^T \overline{{\varvec{z}}} \) and hence \( \overline{{\varvec{u}}} \in \partial \phi _k( \overline{{\varvec{z}}} ) \).
 (a)
The first equality follows from the equivalence of (10) on the one hand, and (6)–(7) on the other hand. The second equality is a straight consequence of complementarity.
 (b)
 (c)
The equality is a consequence of (a) and (b). This is Fenchel’s equality between \( \overline{{\varvec{u}}} \) and \( \overline{{\varvec{z}}} \), with respect to the model function \( \phi _k(\cdot )\). On \( \overline{{\varvec{u}}} \) being a subgradient, see, e.g., Section 23 in Rockafellar (1970).
2.3 A column generation procedure
Theorem 5
This theorem can be found e.g., in Chapter 8.6 of Luenberger and Ye (2008). Ruszczyński (2006) in Chapter 5.3.5, Theorem 5.7 presents a slightly different form. The following corollary was obtained in Fábián et al. (2018):
Corollary 6
This can be shown by substituting \( f( {\varvec{z}} ) =  {\overline{\rho }}( {\varvec{z}} ),\; {\varvec{z}}^0 = \overline{{\varvec{z}}} \) in (16), and applying (14). The objective function \(  {\overline{\rho }}( {\varvec{z}} ) \) inherits Assumption 1 from \( \phi ( {\varvec{z}} ) \). Performing j steps with j such that \( ( 1  \frac{\alpha }{\omega } )^j \le \beta \) yields an appropriate \( {\widehat{{\varvec{z}}}} = {\varvec{z}}^j \).
In view of the Markowitz rule mentioned above, the vector \( {\widehat{{\varvec{z}}}} \) in Corollary 6 is a fairly good improving vector in the column generation scheme.
Observation 7
\( \overline{{\mathcal {R}}} \) (and hence \( \overline{{\mathcal {B}}} \)) is an upper bound on the gap between the respective optima of the model problem (10) and the original convex problem (3).
Proof
Since \( ( \overline{{\varvec{u}}}, \overline{{\varvec{y}}} ) \) is a feasible solution of the dual problem (4), it follows that (19) is an upper bound on the gap between the respective optima of the dual model problem (9) and the dual problem (4). The observation follows from convex duality. \(\square \)
Remark 8
Prescribing a loose optimality tolerance on \(\overline{{\mathcal {B}}} \) results in an early termination of the column generation process. Common experience with LP problems is that computational effort is substantially reduced by loosening the stopping tolerance.
Remark 9
Looking at the columngeneration approach from a dual viewpoint we can see a cuttingplane method. This relationship between the primal and dual approaches is well known, see, e.g., Frangioni (2002, 2018). Details for the present case were worked out in the research report Fábián and Szántai (2017), a former version of the present paper.
The dual viewpoint admits a visual justification of the convergence of the sequence of the optimal dual vectors \( \overline{{\varvec{u}}} \). (Moreover, the cuttingplane method can be regularized, but we do not consider regularization in this paper.)
3 Working with gradient estimates
First we extend Theorem 5. Let \( f : \mathrm{I}\mathrm{R}^n \rightarrow \mathrm{I}\mathrm{R}\) be such that Assumptions 1 and 3 hold. We wish to minimize \(f( {\varvec{z}} ) \) over \( \mathrm{I}\mathrm{R}^n \) using a stochastic descent method. Let \( {\varvec{z}}^{\circ } \in \mathrm{I}\mathrm{R}^n \) denote an iterate, and \({\varvec{g}}^{\circ } = \nabla f( {\varvec{z}}^{\circ } ) \) the corresponding gradient.
Theorem 10
Under the above assumptions, we perform a steepest descent method using gradient estimates: at the current iterate \( {\varvec{z}}^{\circ } \), a gradient estimate \( {\varvec{G}}^{\circ } \) is generated and a line search is performed in that direction. We assume that gradient estimates at the respective iterates are generated independently, and (20)–(21) hold for each of them.
Proof
Let \( {\varvec{G}}^0, \ldots , {\varvec{G}}^{j1} \) denote the respective gradient estimates for the iterates \( {\varvec{z}}^0, \ldots , {\varvec{z}}^{j1} \).
Finally, (22) follows from the iterative application of (27). \(\square \)
Coming back to problem 1, let Assumptions 1 and 3 hold for the objective function \( \phi ( {\varvec{z}} ) \). We show that the column generation scheme of Sect. 2.3 can be implemented as a randomized method using gradient estimates. Specifically, we need to approximately solve the column generation subproblem (15).
Corollary 11
Proof
We apply Theorem 10 to \( f( {\varvec{z}} ) =  {\overline{\rho }}( {\varvec{z}} ) \). This function inherits Assumptions 1 and 3 from \( \phi ( {\varvec{z}} ) \). Let \( \varrho = 1  \frac{\alpha }{\omega ( {\sigma ^2} + 1 )} \) with some \( \sigma > 0 \). We assume that gradient estimates at the respective iterates are generated independently, and (20)–(21) hold for each of them.
Remark 12
Gradients of the function \(  {\overline{\rho }}( {\varvec{z}} ) \) have the form \( \nabla \phi ( {\varvec{z}} )  \overline{{\varvec{u}}} \). The further the column generation procedure progresses, the smaller the norm \( \Vert \nabla \phi ( \overline{{\varvec{z}}} )  \overline{{\varvec{u}}} \Vert \) gets (see Observation 4(c)).
To satisfy the requirement (20) on variance, better and better estimates are needed. We control accuracy according to Assumption 3.
3.1 Bounding the optimality gap and reliability considerations for the randomized column generation scheme
Assume that our initial model included the columns \( {\varvec{z}}_0, \ldots , {\varvec{z}}_{\iota } \). In the course of the column generation scheme, we select further columns according to Corollary 11, with gradient estimates generated independently. Let the parameters \( \sigma \) and \( \beta \) be fixed for the whole scheme, e.g., set \( \beta = 0.5 \). On the other hand, we keep increasing the reliability of the individual steps during the process, i.e., let \( p = p_{\kappa }\;\; ( \kappa = \iota +1, \iota +2, \ldots ) \) decrease with \(\kappa \).
Example 13
Given the number \(\iota \) of the initial columns, let \( p_{\kappa } = ( \kappa \iota + 9 )^{2}\quad ( \kappa = \iota +1, \iota +2, \ldots ) \). Then we have \( \prod _{\kappa =\iota +1}^{\infty }\, ( 1  p_{\kappa } )\, = 0.9 \). (This is easily proven. We learned it from Szász 1951, Volume II., Chapter X., Section 642).
To achieve reliability \( 1  p_{\kappa } \) set in Example 13, we need to make \( O( \log \kappa ) \) steps with the stochastic descent method when selecting the column \({\varvec{z}}_{\kappa } \).
We terminate the column generation process when \(\overline{{\mathcal {B}}} \) of (28) gets below the prescribed accuracy. With the setting of Example 13, the terminal bound is correct with a probability at least 0.9, regardless of the number of new columns generated over the course of the procedure.
3.2 On stochastic gradient methods
The aim of this section is to place Theorem 10 and the column generation scheme into the broader context of stochastic gradient methods. The idea of stochastic approximation goes back to Robbins and Monro (1951). Important contributions include Ermoliev (1969), Gaivoronski (1978), Nemirovski and Yudin (1978, 1983), Nesterov (1983, 2009), Ermoliev (1983), Ruszczyński and Syski (1986), Uryasev (1988), Pflug (1988, 1996), Polyak (1990), Polyak and Juditsky (1992), Benveniste et al. (1993), Nemirovski et al. (2009), Lan (2012). The approach is attractive from a theoretical point of view, but early forms might perform poorly in practice. Recent forms combine theoretical depth with practical effectiveness.
Methods differ in the construction of gradient estimates and in the determination of step lengths. Establishing an appropriate stopping rule is also a critical issue. Many of the methods apply averaging (like the example method sketched below), and some employ the dual space also.
As a recent example of the stochastic gradient approach, we sketch the robust stochastic approximation method of Nemirovski and Yudin. It is assumed that, given \( {\varvec{x}} \in X \), realizations of a random vector \( {\varvec{G}} \) can be constructed such that \( \hbox {E}( {\varvec{G}} ) = \nabla f( {\varvec{x}} ) \), and \( \hbox {E}( \Vert {\varvec{G}} \Vert ^2 ) \le M^2 \) holds with a constant M independent of \( {\varvec{x}} \).
We proved Theorem 10 for unconstrained minimization (over \( \mathrm{I}\mathrm{R}^n \)). In our approach, the constraint \( A {\varvec{x}} \le {\varvec{b}} \) in the convex problem (1) was taken into account through a column generation scheme. Comparing the column generation scheme with the above stochastic gradient approach, a solution of the linear programming model problem (10) is analogous to the iterate averaging (34) and the projection in (31). The analogy is not complete. Having solved the linear programming model problem, we perform an approximate line search instead of a simple translation by \(  h_k G_k \). Larger effort of an individual step in the column generation scheme pays off when gradient estimation is taxing as compared to function value computation.
Having pondered a reviewer comment concerning further combinations of column generation and stochastic gradient schemes, we see a high potential in this approach. Different combinations of the column generation and stochastic gradient schemes may be efficient for functions with different characteristics.
4 Handling a difficult constraint
We work out an approximation scheme for the solution of the convex constrained problem (2). This scheme consists of the solution of a sequence of problems of the form (1), with a tightening stopping tolerance.
We consider the linear constraint set \( A {\varvec{x}} \le {\varvec{b}} \) of problem (1). The last constraint of this set is \( {\varvec{a}}^r {\varvec{x}} \le b_r \), where \( {\varvec{a}}^r \) denotes the rth row of A, and \( b_r \) denotes the rth component of \( {\varvec{b}} \). Assume that this last constraint is a cost constraint, and let \( {\varvec{c}}^T = {\varvec{a}}^r \) denote the cost vector. We consider a parametric form of the cost constraint, namely, \( {\varvec{c}}^T {\varvec{x}} \le d \), where \( d \in \mathrm{I}\mathrm{R}\) is a parameter.
Let \( \chi ( d ) \) denote the optimal objective value of problem (35), as a function of the parameter d. This is obviously a monotone decreasing convex function. Let \( {{\mathcal {I}}} \subset \mathrm{I}\mathrm{R}\) denote the domain over which the function is finite. We have either \( {{\mathcal {I}}} = \mathrm{I}\mathrm{R}\) or \( {{\mathcal {I}}} = [\, {\underline{d}}, +\infty ) \) with some \( {\underline{d}} \in \mathrm{I}\mathrm{R}\). Using the notation of the unconstrained problem, we say that \( \chi ( d ) \) is the optimum of (1\(:\, b_r = d \)) for \( d \in {{\mathcal {I}}} \).
Coming to the constrained problem (2), we may assume \( \pi \in \chi ( {{\mathcal {I}}} ) \). Let \( d^{\star } \in {{\mathcal {I}}} \) be a solution of the equation \( \chi ( d ) = \pi \), and let \( l^{\star }( d ) \) denote a linear support function to \( \chi ( d ) \) at \( d^{\star } \). In this section we work under.
Assumption 14
The support function \( l^{\star }( d ) \) has a significant negative slope, i.e., \( {l^{\star }}^{\prime } \ll 0 \).
From \( {l^{\star }}^{\prime } < 0 \), it follows that the optimal objective value of (2) is \( d^{\star } \). (This slope will be used in estimating the number of Newton steps required to reach a prescribed accuracy; see Corollary 19, below. That is why we need it to be significantly negative.)
Remark 15
Assumption 14 is reasonable if the righthandside value \( \pi \) has been set by an expert, on the basis of preliminary experimental information. (A nearzero slope \( {l^{\star }}^{\prime } \) means that a slight relaxation of the probabilistic constraint allows a significant cost reduction.)
We find a nearoptimal \( {\widehat{d}} \in {{\mathcal {I}}} \) using an approximate version of Newton’s method. The idea of regulating tolerances in such a procedure occurs in the discussion of the Constrained Newton Method in Lemaréchal et al. (1995). Based on the convergence proof of the Constrained Newton Method, a simple convergence proof of Newton’s method was reconstructed in Fábián et al. (2015). We adapt the latter to the present case.
First, we describe a deterministic approximation scheme. A randomized version is worked out in Sect. 4.2.
4.1 A deterministic approximation scheme
Let Assumptions 1 and 2 hold. A sequence of unconstrained problems (1\(:\, b_r = d_{\ell } \))\(\;\; ( \ell = 1,2,\ldots ) \) is solved with increasing accuracy. Over the course of this procedure, we build a single model \( \phi _k( {\varvec{z}} ) \) of the nonlinear objective \( \phi ( {\varvec{z}} ) \), i.e., k is ever increasing. Columns added during the solution of (1\(:\, b_r = d_{\ell } \)) are retained in the model and reused in the course of the solution of (1\(:\, b_r = d_{\ell +1} \)).
Given the \( \ell \)th iterate \( d_{\ell } \in {{\mathcal {I}}} \), we need to estimate \( \chi ( d_{\ell } ) \) with a prescribed accuracy. This is done by performing a column generation scheme with the master problem (10\(:\, b_r = d_{\ell } \)). Let \( \overline{{\mathcal {B}}}_{\ell } \) denote an upper bound on the gap between the respective optima of the model problem (10\(:\, b_r = d_{\ell } \)) and the convex problem (1\(:\, b_r = d_{\ell } \)). Such a bound is constructed according to the expression (18).
Let \( d_0, d_1 \in {{\mathcal {I}}},\; d_0< d_1 < d^{\star } \) be the starting iterates. The sequence of the iterates will be strictly monotone increasing, and converging to \( d^{\star } \) from below.
4.1.1 Nearoptimality condition for the constrained problem
4.1.2 Stopping condition for the unconstrained subproblem
Let \( \delta \;\; ( 0 < \delta \ll \frac{1}{2} ) \) denote a fixed tolerance. (We can set e.g., \( \delta = 0.25 \) for the whole process).
If (i) occurs then \( {\widehat{d}} := d_{\ell } \) satisfies the nearoptimality condition (37), and the Newtonlike procedure stops.
4.1.3 Finding successive iterates
Due to the convexity of \( \chi ( d ) \) and to Assumption 14, the linear function \( l_{\ell }( d ) \) obviously has a negative slope \( l_{\ell }^{\prime } \le {l^{\star }}^{\prime } \ll 0 \). Moreover \( l_{\ell }( d ) \le \chi ( d ) \) holds for \( d_{\ell } \le d \).
Remark 16
In a Newtonlike scheme, the selection of the starting iterates strongly affects efficiency. Let us first consider the selection of \( d_1 \). An expert familiar with the model may easily set a budget slightly overtight. In the absence of such expert, we can resort to heuristics, evaluating \( \chi ( d ) \) in a set of test points.
Once \( d_1 \) has been set, we can consider \( d_0 \). A good choice for \( d_0 \) is one that results in a large \( d_2 \). We have to take into account the accuracy of our evaluation of \( \chi ( d_1 ) \) on the one hand, and the slope of \( \chi ( d ) \) on the other hand. A possible way of organizing the selection process is the following. First, we evaluate \( \chi ( d_1 ) \) by solving the problem (35\(: d = d_1 \)). In the course of the solution, we build a model of the objective function. This model can then be used to estimate the slope of \( \chi ( d ) \).
4.1.4 Convergence
Theorem 17
Proof
Example 18
Let \( \delta = 0.25 \), then \( \gamma = ( \frac{1}{2 ( 1  \delta )} )^2\; < 0.5 \).
Corollary 19
Note that \(  {l^{\star }}^{\prime }  \gg 0 \) due to Assumption 14.
Given a problem, let us consider the efforts of its approximate solution as a function of the prescribed accuracy \(\epsilon \). According to (49), that is on the order of \(\; \log \frac{1}{\epsilon } \).
4.2 A randomized version of the approximation scheme
Concerning the function \( \chi (d) \), let Assumption 14 hold. Our aim, in principle, is the same as it has been in the deterministic case: find \( {\widehat{d}} \in {{\mathcal {I}}} \) such that \( \pi + \epsilon \ge \chi ( {\widehat{d}} ) \ge \pi \) holds with a preset tolerance \( \epsilon \). In the present uncertain environment, however, we may have to content ourselves with \( {\hat{d}} \) such that \( \pi + \epsilon \ge \chi ( {\hat{d}} ) > \pi  \epsilon \) holds. This problem statement is justifiable if the function \( \chi (d) \) is not constant for \( d > d^{\star } \). Let Assumption 20, below, hold.
Assumption 20
For our stopping tolerance \( \epsilon \), there exists (an unknown) \(d^{\star }_{\epsilon } \in {{\mathcal {I}}} \) such that \(\chi (d^{\star }_{\epsilon } ) = \pi  \epsilon \).
Let \( q\; ( 0.5 \ll q < 1 ) \) denote a preset reliability. Using the randomized column generation scheme, a sequence of unconstrained problems (1\(:\, b_r = d_{\ell } \))\(\;\; ( \ell = 1,2,\ldots ) \) is solved, each with reliability q, and with an accuracy determined by the Newtonlike approximation scheme. As in the deterministic case, we build a single model \( \phi _k( {\varvec{z}} ) \) of the nonlinear objective \( \phi ( {\varvec{z}} ) \), i.e., k is ever increasing. Let \( k_{\ell 1} \) denote the number of columns at the outset of the solution of problem (1\(:\, b_r = d_{\ell } \)).
Given the \( \ell \)th iterate \( d_{\ell } \in {{\mathcal {I}}} \), we estimate \( \chi ( d_{\ell } ) \) by performing a column generation scheme with the master problem (10\(:\, b_r = d_{\ell } \)). Applying the procedure of Sect. 3.1, we obtain an estimate \( \overline{{\mathcal {B}}}_{\ell } \) for the gap between the respective optima of the model problem (10\(:\, b_r = d_{\ell } \)) and the convex problem (1\(:\, b_r = d_{\ell } \)). Keeping to the setting of Example 13, we set the reliability parameter to \( q = 0.9 \), obtaining \( \hbox {P}(\, \overline{{\mathcal {B}}}_{\ell }\, \ge \, \hbox {`gap'} \,)\, \ge 0.9 \). (Note that the columns with indices up to \( k_{\ell 1} \) belong to the initial model, hence in terms of Sect. 3.1, we have \( \iota = k_{\ell 1} \).)
4.2.1 Stopping condition for the unconstrained subproblem
If condition (\(\beta \)) occurs, then we stop the Newtonlike process.
If condition (\(\gamma \)) occurs, then we carry on to a new iterate \(d_{\ell +1} > d_{\ell } \), like we did in the deterministic scheme.
4.2.2 Convergence and reliability
Let the unconstrained subproblems each be solved with a reliability of \( q = 0.9 \), and let \( \delta , \gamma \) be set according to Example 18. Moreover, let us assume that the randomized Newtonlike scheme did not stop in L steps. The aim of this section is to show that, provided L is large enough, an \(\epsilon \)optimal solution of the constrained problem has been reached with a high probability.

In case \( d_{\ell } \le d^{\star } \): We call step \( \ell \) correct if \(\; d_{\ell +1} \le d^{\star } \;\) and \(\; 0.5 \cdot L_{\ell }( d_{\ell } )\,  L_{\ell }^{\prime }  \ge L_{\ell +1}( d_{\ell +1} )\,  L_{\ell +1}^{\prime }  \;\) also holds, otherwise we call step \( \ell \) incorrect.

In case \( d_{\ell } > d^{\star } \): We call step \( \ell \) correct if a backstep occurs (i.e., if \( d_{\ell +1} = d_{\ell 1} \)), otherwise we call it incorrect.
If the difference between the number of the correct steps and the number of the incorrect steps exceeds \( N( \epsilon ) \), then an \( \epsilon \)optimal solution of the constrained problem has been reached, according to Corollary 19.
The difference between the number of the correct steps and the number of the incorrect steps is \(\; L  2 \sum _{\ell =1}^L Z_{\ell } \). In order to show that the difference likely exceeds \( N( \epsilon ) \), we need an upper bound on the probability that \( \sum _{\ell =1}^L Z_{\ell } \) is significantly larger than \( \hbox {E}( \sum _{\ell =1}^L Z_{\ell } ) \).
Generalized Chernoff–Hoeffding bounds were proposed by Panconesi and Srinivasan (1997). Intuitive proofs of such bounds, based on a simple combinatorial argument, were given by Impagliazzo and Kabanets (2010). (In this latter paper, concentration bounds are also explained in terms of successes of random experiments, just our present situation.) We are going to use a Chernofftype bound, Theorem 1.1 in Impagliazzo and Kabanets (2010):
Theorem 21
It is easy to see that our objects satisfy the precondition (53) with \( p = 0.1 \). Indeed, it follows from the repeated application of (52). A formal proof may apply induction on n. For \( n = 1 \), we have \( \hbox {P}( Z_{1} = 1 ) \le 0.1 \). Now let us assume that (53) holds for \( 1 \le n < k \). The statement for \( n = k \) follows from (52), by setting \( {{\mathcal {I}}}_k = A \cap \{ 1, \ldots , k1 \} \).
Proposition 22
Let the unconstrained problems each be solved with a reliability of \( q = 0.9 \); let \( \delta , \gamma \) be set according to Example 18; and let \( L = \max \{\, 22,\, 3 N( \epsilon )\, \} \) with \( N( \epsilon ) \) defined in Corollary 19.
Assume that the randomized Newtonlike scheme did not stop in L steps. Then an \( \epsilon \)optimal solution of the constrained problem has been reached with a probability at least 0.9.
Remark 23
If case \( (\beta ) \) occurred in the stopping condition of the previous section, then further checks are needed to ensure reliability.
Remark 24
The stopping tolerance prescribed for the unconstrained subproblems is ever tightened in accordance with the progress of the Newtonlike approximation scheme. However, the prescribed tolerance is never tighter than \( \delta \cdot \epsilon = 0.25 \epsilon \).
5 Adapting the approach to probabilistic problems
In this section we consider \( \phi ( {\varvec{z}} ) = \log F( {\varvec{z}} ) \) with a nondegenerate ndimensional standard normal distribution function \(F({\varvec{z}})\). Assumption 1 does not hold with such a function. However, as illustrated in Fábián et al. (2018), Assumption 1 holds over any bounded ball around the origin. (The ratio \(\frac{\alpha }{\omega } \) decreases as the radius of the ball increases.) Moreover, a construction was sketched in Fábián et al. (2018) that limits the column generation process to a ball of sufficiently large radius. That construction does not yield usable estimates for the values \( \alpha \) and \( \omega \).
When applied to probabilistic problems, we look on Corollaries 6 and 11 merely as a means of justification of the efficiency of the procedure. The gap between the respective optima of the model problem and the original probabilistic problem is measured by different means, to be described presently. In this setting we may perform just a single line search in each column generation problem.
In order to apply the procedures described in the previous sections, we need Assumption 3 to hold, with the relaxation that function values \( \phi ( {\varvec{z}} ) \) are computed with a high accuracy (instead of exactly). In the present case of a probabilistic \( \phi ( {\varvec{z}} ) \), highprecision computation of \( \log F( {\varvec{z}} ) \) is impractical in points \({\varvec{z}} \) with a low \( F( {\varvec{z}} ) \). Hence we need a technical assumption that helps keeping the process in a region where highprecision computation of \( \phi ( {\varvec{z}} ) \) is possible.
Assumption 25
A significantly high probability can be achieved. Specifically, a feasible point \( \check{{\varvec{z}}} \) is known such that \(F(\check{{\varvec{z}}} ) \ge 0.5 \).
By including \( \check{{\varvec{z}}} \) of Assumption 25 among the initial columns of the master problem, we always have \(F(\overline{{\varvec{z}}} ) \ge 0.5 \) with the current solution \(\overline{{\varvec{z}}} \) defined in (12). Hence \(\phi (\overline{{\varvec{z}}} ) \) can be computed with a high accuracy.
We perform a single line search in each column generation subproblem, starting always from the current \( \overline{{\varvec{z}}} \). It means that a highquality estimate can be generated for the gradient, which designates the direction of the line search. Once the direction of the search is determined, we only work with function values (there is no need for any further gradient information in the current column generation subproblem). The line search is performed with a high accuracy over the region \({{\mathcal {L}}}( F,\, 0.5 ) = \{\, {\varvec{z}} \,\, F( {\varvec{z}} ) \ge 0.5 \,\} \) which includes the optimal solution of the probability maximization problem (3).
We can carry on with the line search even if we have left the safe region \( {{\mathcal {L}}}( F,\, 0.5 ) \). Given a point \( \hat{{\varvec{z}}} \) along the search ray, let \( {\hat{p}} > 0 \) be such that \( {\hat{p}} \le F( \hat{{\varvec{z}}} ) \) holds almost surely. (Simulation procedures generally provide a confidence interval together with an estimate.) If the vector \( \hat{{\varvec{z}}} \) is to be included in the master problem (10) as a new column, then we set the corresponding cost coefficient as \( \phi =  \log {\hat{p}} \). Under such an arrangement, our model remains consistent, i.e., the model function \( \phi _k( {\varvec{z}} ) \) is almost surely an inner approximation of the probabilistic function \( \phi ( {\varvec{z}} ) \).
5.1 A bounded formulation
Observation 26
The difference between the respective optima of problems (56) and (57) is insignificant.
Proof
Let \( {\varvec{z}} \) be a part of a feasible solution of (56), and let us consider the box \( ( {\varvec{z}} + {{\mathcal {N}}} ) \cap {{\mathcal {Z}}} \), where \( {{\mathcal {N}}} \) denotes the negative orthant. In case this box is empty, we have \( F( {\varvec{z}} ) \approx 0 \) due to the specification of \( {{\mathcal {Z}}} \). Taking into account Assumption 25, such \( {\varvec{z}} \) cannot be a part of an optimal solution of (56).
In case the box \( ( {\varvec{z}} + {{\mathcal {N}}} ) \cap {{\mathcal {Z}}} \) is not empty, let \( \varPi _{{\varvec{z}}} \) denote its ’most positive’ vertex. We have \( \varPi _{{\varvec{z}}} \in {{\mathcal {Z}}},\; \varPi _{{\varvec{z}}} \le {\varvec{z}} \), and \( F( \varPi _{{\varvec{z}}} ) \approx F( {\varvec{z}} ) \). If \( F( {\varvec{z}} ) < 0.5 \), then, due to Assumption 25 again, \( {\varvec{z}} \) cannot be a partial optimal solution of (56).
In the remaining case of \( F( \varPi _{{\varvec{z}}} ) \approx F( {\varvec{z}} ) \ge 0.5 \), we have \( \phi ( \varPi _{{\varvec{z}}} ) \approx \phi ( {\varvec{z}} ) \). Moreover \( \varPi _{{\varvec{z}}} \) is a partial feasible solution of (57), due to \( \varPi _{{\varvec{z}}} \in {{\mathcal {Z}}},\; \varPi _{{\varvec{z}}} \le {\varvec{z}} \). \(\square \)
Remark 27
In a pure form of this bounded scheme, new columns would always be selected from \( {{\mathcal {Z}}} \). An obvious drawback of such a scheme is that Theorem 5 does not apply to the resulting bounded optimization problem. In Sect. 5.2 we develop a hybrid scheme, including a restriction to \({{\mathcal {Z}}}\) in the master problem, but selecting new columns by unconstrained maximization.
5.2 A hybrid form of the column generation scheme
We implemented this procedure. In our experiments reported in Sect. 7, the restriction \( {\varvec{z}}^{\prime } \in {{\mathcal {Z}}} \) was never active in any optimal solution of the master problem.
Let \( \overline{{\varvec{z}}} \in {{\mathcal {Z}}} \) denote the current primal iterate, obtained in the form (12) using an optimal solution of the model problem. Let \( \overline{{\varvec{g}}} = \nabla \phi ( \overline{{\varvec{z}}} ) \) be the corresponding gradient. Let moreover \( (\, {\overline{\vartheta }},\, \overline{{\varvec{u}}} \,) \) be part of an optimal dual solution of the current model problem. Finally, let \( \overline{{\mathcal {R}}} \) denote the gap between the respective optima of the model problem and the original probabilistic problem.
Observation 28
Proof
The following observation shows that we can use \( \overline{{\varvec{G}}} \) to estimate the maximum on the righthand side of (60).
Observation 29
Proof
5.3 Regulating accuracy and reliability when solving an unconstrained probabilistic problem
 We need Corollary 11 to ensure efficiency of a descent step in the course of column selection. Hence (20) should hold with an appropriate \( \sigma \) between the vectors \( {\varvec{g}}^{\circ } = \overline{{\varvec{g}}}  \overline{{\varvec{u}}} \) and \( {\varvec{G}}^{\circ } = \overline{{\varvec{G}}}  \overline{{\varvec{u}}} \). Specifically,should hold.$$\begin{aligned} \hbox {E}\left( \left\ \overline{{\varvec{G}}}  \overline{{\varvec{g}}} \right\ ^2 \right) \; \le \; {\sigma ^2} \left\ \overline{{\varvec{g}}}  \overline{{\varvec{u}}} \right\ ^2 \end{aligned}$$(67)
 We need (61) to hold with appropriate parameters \( \varDelta \) and p to ensure that the bound \( \overline{{\mathcal {B}}} \) is tight and reliable. We slightly reformulate the definition of \( \overline{{\mathcal {B}}} \) in (66) as follows:$$\begin{aligned} \Big ( \phi _k( \overline{{\varvec{z}}} )  \phi ( \overline{{\varvec{z}}} ) \Big ) +\; \max _{{\varvec{z}} \in {{\mathcal {Z}}}}\; \Big ( ( \overline{{\varvec{u}}}  \overline{{\varvec{g}}} ) + ( \overline{{\varvec{g}}}  \overline{{\varvec{G}}} ) \Big )^T ( {\varvec{z}}  \overline{{\varvec{z}}} )\; +\; \varDelta \cdot \mathrm {diag}( {{\mathcal {Z}}} ). \end{aligned}$$(68)
In setting the parameters \( \sigma \) and \( \varDelta \), we aim to find a balance between the error of the polyhedral model function on the one hand, and the error of the gradient estimation on the other hand. According to Observation 4(c), \(\; \overline{{\varvec{u}}} \in \partial \phi _k( \overline{{\varvec{z}}} ) \) holds. Taking into account \(\; \overline{{\varvec{g}}} = \nabla \phi ( \overline{{\varvec{z}}} ) \), the vector \( \overline{{\varvec{u}}}  \overline{{\varvec{g}}} \) in (67) and (68) represents the gradient error of the polyhedral model function \( \phi _k( {\varvec{z}} ) \). Similarly, \( \phi _k( \overline{{\varvec{z}}} )  \phi ( \overline{{\varvec{z}}} ) \) in (68) represents the error in function value. On the other hand, the vector \( \overline{{\varvec{G}}}  \overline{{\varvec{g}}} \) in (67) and (68) represents the error of the gradient estimate \( \overline{{\varvec{G}}} \).
A balance between those two types of error is found by a twostage procedure. We begin with estimating the order of the magnitude of \( \Vert \overline{{\varvec{u}}}  \overline{{\varvec{g}}} \Vert \), and then refine the estimation as needed.
6 Estimation of the multivariate normal probability distribution function values and gradients
It is known that any conditional probability distribution of the multivariate normal probability distribution is also normal. Therefore from Formula (69) it follows that we can calculate the multivariate normal probability distribution function values and their partial derivatives by the same procedure. This is the reason why in this section we give a list of possible procedures for the estimation of multivariate probability distribution function values only.
6.1 Genz’s method
This method was published in Genz (1992). In this paper Genz was dealing with the estimation of the multivariate normal probability content of a rectangle, which is a more general problem than the calculation of multivariate probability distribution function values.
The main idea is to transform the integration region to the unit cube \([0,1]^n\) by a sequence of elementary transformations. This comes at the expense of a slightly more complicated integrand.
The sequence begins with the Cholesky transformation which transforms the components of the multivariate normally distributed random vector into independent random variables, however the integration limits become more complicated. Then the integration variables are transformed further by the inverse function of the one dimensional standard normal probability distribution function. The effect of this transformation is that all integrands will be equal one but the integration limits become even more complicated. Finally, by a simple linear transformation, the integration region changes to the unit cube \([0,1]^n\) and the integrand functions will be the differences of the earlier complicated integration limits.
We remark that the ith integrand function is always independent of the ith integrand variable and can be pulled out of one integral which allows explicit integration of the innermost integral. This way the numerical integration may be carried out on the unit cube \([0,1]^{n1}\).
This sequence of transformations has also forced a priority ordering on the components of \({\mathbf {x}}\) which makes the problem amenable to the application of subregion adaptive algorithms. The method works best if the components are presorted so that the innermost integration has the most “weight”.
Genz describes three different methods for solving this transformed integral. The first method is based on a polynomial approximation of the integrand. For better performance, the unit cube is split into subregions which are subsequently partitioned further whenever the approximation is not accurate enough. The second method uses quasirandom integration points. Finally, the third method uses pseudorandom integration points, which results in error estimates that are statistical in nature.
6.2 Deák’s method
This method was first published in Deák (1980) and later in Deák (1986). Its main thrust is to decompose the normal random vector into two parts, a direction and a distance from the origin. This decomposition can be used both in the generation of sample points and in the calculation of the probability content of a rectangle. It is well known that the direction is uniformly distributed on the ndimensional unit sphere, the distance from the origin has a chidistribution with n degrees of freedom and they are independent of each other.
A simple Monte Carlo method is to generate N sample points uniformly distributed on the ndimensional unit sphere, determine the probability content of the intersection of the rectangle in issue with the generated directions and finally average them. The determination of the probability content of the intersection can be done simply by applying a code to calculate the probability distribution function of the chidistribution. The advantage of this method is that it counts the probability content of the rectangle not in a ’point to point’ way, rather in a ’line section to line section’ way.
In addition it is easy to apply some type of antithetic random variables technique to reduce the variance further. Deák devised an improvement over this scheme that is intended to distribute a large number of directions as uniformly as possible on the unit sphere.
The estimator can then be calculated jointly for the set of \(2^k {n \atopwithdelims ()k}\) directions resulting in faster calculation and further variance reduction. The parameter k can in principle be chosen arbitrarily from the set \(\{1,2,\ldots ,n\}\), but the computational complexity increases very fast. Best results are obtained for \(k=2\) or \(k=3\).
It is easy to see that the variance of even the simplest Deák estimator is less than the variance of the crude Monte Carlo method, for a given sample size N.
We remark here that the recent paper Teng et al. (2015) on spherical Monte Carlo simulations for multivariate normal probabilities provides various related simulation schemes.
6.3 Szántai’s method
The procedure was first published in Hungarian, see Szántai (1976) and Szántai (1985). In English it was first published in Szántai (1988) and it is quoted in Sects. 6.5 and 6.6 of the book Prékopa (1995).
This procedure can be applied to any multivariate probability distribution function. The only condition is that we have to be able to calculate the one and the twodimensional marginal probability distribution function values. Accuracy can easily be controlled by changing the sample size. This way we can construct gradient estimates satisfying Assumption 3.
Further two random variables having expected value \({\overline{P}}\) can be defined by taking the differences between the true probability value and its second order lower and upper Boole–Bonferroni bounds. The definitions of these bounds can be found in the book Prékopa (1995).
We can estimate the expected value of these three random variables in the same Monte Carlo simulation procedure and so we get three different estimates for the probability value \({\overline{P}}\). If we estimate the pairwise covariances of these estimates it will be easy to get a final, minimal variance estimate, too. This technique is well known as regression in the simulation literature.
Gassmann (1988) combined Szántai’s general algorithm and Deák’s algorithm into a hybrid algorithm. The efficiency of this algorithm was explored in Deák et al. (2002).
One can use higher than second order Boole–Bonferroni bounds, too. It will further reduce the variance of the final estimation. However, the necessary CPU time increases, which may reduce the overall efficiency of the resulting estimation. Many new bounds for the probability of the union of events have been developed in the last two decades. These bounds use not only the aggregated information of the first few binomial moments but they also use the individual product event probabilities which sum up the binomial moments. The most important results of this type can be found in the papers by Hunter (1976), Worsley (1982), Tomescu (1986), Prékopa et al. (1995), Bukszár and Prékopa (2000), Bukszár and Szántai (1999), Boros and Veneziani (2002) and MádiNagy and Prékopa (2004). Szántai showed in his paper Szántai (2000), that the efficiency of his variance reduction technique can be improved significantly if one uses some of the above listed bounds.
6.4 The method of Ambartzumian, Der Kiureghian, Ohanian and Sukiasian
Ambartzumian et al. (1998) proposed to use the Sequential Conditioned Importance Sampling (SCIS) algorithm for the estimation of the cumulative distribution function values of a multivariate normal distribution. This is a variance reduction algorithm which is especially effective in the case of estimating extremely small probability values. This algorithm is based on the Sequential Conditioned Sampling (SCS) technique which is the following.
It is known that the crude Monte Carlo method for estimating very small multivariate normal probability distribution function values is less effective. However, Ambartzumian et al. (1998) proved that in such cases the SCS technique can be easily extended into SCIS by using an importance sampling density function (practically a truncated univariate normal density function) at each step.
6.5 Application of the numerical integration and the variance reduction Monte Carlo simulation algorithms in our procedures for probability maximization
In the course of the procedures proposed in this paper, we many times need to obtain a fixed size confidence interval for our probability distribution function value estimations. This is pronounced in Assumption 3; we need this for determining gradient estimates fulfilling the inequality given in (20) and we do this when constructing the fixed size multidimensional confidence interval described in (61). All this can be done by applying the results of Stein (1945). This is a twostage sampling procedure. In the first stage we take a sample of size \(n_{1}\) where \(n_{1}\) is a positive integer not smaller than 2, otherwise it is arbitrary. Then in a second stage we take a sample of \(n_{2}\) elements where \(n_{2}\) is computed on the basis of the result of the first stage sampling. This way the total sample of size \(n_{1}+n_{2}\) results in a fixed size interval of the required confidence level. For a summary, see Section 7.10 of the book Prékopa (1995).
We believe that the above described two stage sampling technique can be realized on the variance reduction Monte Carlo simulation algorithms of Sects. 6.2, 6.3 and 6.4 more easily than on the numerical integration algorithm of Sect. 6.1. Careful numerical testing is necessary to choose the most appropriate procedure which may be different in the different phases of our optimization procedure.
7 A computational experiment
The aim of this experiment is to demonstrate the workability of the randomized column generation scheme of Sect. 3, in case of probabilistic problems. Namely, we have \( \phi ( {\varvec{z}} ) = \log F( {\varvec{z}} ) \) with a nondegenerate ndimensional standard normal distribution function \( F( {\varvec{z}} ) \).
7.1 Cash matching problem
Like in the previous paper (Fábián et al. 2018), we tested our implementation on a cash matching problem, with a fifteen dimensional normal distribution. In this problem we are interested in investing a certain amount of cash on behalf of a pension fund that needs to make certain payments over the coming 15 years of time. This problem originates from Dentcheva et al. (2004) and Henrion (2004). The cash matching test problem had originally been formulated as cost minimization under a probabilistic constraint. We transformed the problem to probability maximization under a cost constraint.
7.2 Implementation
We used MATLAB with the IBM ILOG CPLEX (Version 12.6.3) optimization toolbox and the numerical computation of multivariate normal distribution values was performed with the QSIMVNV Matlab function implemented by Genz (1992).
Our solver is based on the implementation used in our former paper Fábián et al. (2018). In the present version we used the randomized procedure of Sect. 3. We implemented the bounding method of Sect. 5, with the hybrid form of Sect. 5.2.
The initial solution was set by the procedure described in Fábián et al. (2018). The time needed for setting the initial solution was negligible as compared to the time needed for a single iteration with the column generation scheme.
In the course of the randomized column generation scheme, we perform just a single line search in each column generation subproblem. This line search starts from the current \( \overline{{\varvec{z}}} \) vector. Gradients of the form \( \nabla \phi ( \overline{{\varvec{z}}} )  \overline{{\varvec{u}}} \) need to be estimated, as mentioned in Remark 12. This goes back to the estimation of the gradient \( \nabla F( \overline{{\varvec{z}}} ) \) of the distribution function. A component of \( \nabla F( \overline{{\varvec{z}}} ) \) is, in turn, obtained according to (69).
Accuracy in Genz’s subroutine is controlled by setting the sample size. In the present simple implementation of the iterative scheme, we control accuracy in such a way that the norm of the error of the current gradient \( \nabla \phi ( \overline{{\varvec{z}}} )  \overline{{\varvec{u}}} \) be less than one tenth of the norm of the previous gradient \( \Vert \nabla \phi ( \overline{{\varvec{z}}}_ )  \overline{{\varvec{u}}}_ \Vert \).
7.3 Results and observations
In accordance with the hybrid bounding form of Sect. 5.2, we did not restrict new columns \( {\varvec{z}}_i \) to the box \( {{\mathcal {Z}}} \). Still, the probability level was high in all iterates, \( F( {\varvec{z}}_i ) \ge 0.9 \) holding with the columns added in the course of the column generation process. This allowed highaccuracy computation of all probabilistic function values. As mentioned in Sect. 5.2, the restriction \({\varvec{z}}^{\prime } \in {{\mathcal {Z}}} \) of (59) was never active in any optimal solution \( {\varvec{z}}^{\prime } = \overline{{\varvec{z}}} \) of the master problem.
Density function values occurring in the computation of partial derivatives (69) have always been significant. From the 15 density function values occurring in a single gradient computation, two were always around the magnitude of \({{10}^{2}}\), another one around \(5*{{10}^{3}}\), and the rest around \({{10}^{3}}\). In other problems, nearzero density function values may occur in (69) for many partial derivatives. For such components, the corresponding conditional distribution function need not be computed.
Our present, very simple implementation took about 2 min to perform 50 iterations on the cashmatching problem. Though it may seem long, we expect that technical improvements will substantially shorten solution times. (According to our experience, technical improvements may result in a speedup of one or two magnitudes.)
8 Conclusion and discussion
In this paper, we proposed a stochastic approximation procedure to minimize a function whose gradient estimation is taxing. In course of the process, we build an inner approximating model of the objective function. To handle a difficult constraint function, we proposed a Newtonlike scheme, employing a parametric form of the stochastic minimization procedure. The scheme enables the regulation of accuracy and reliability in a coordinated manner.
We adapted this approach to probabilistic problems. In comparison with the outer approximation approach widely used in probabilistic programming, we mention that the latter is difficult to implement due to noise in gradient computation. The outer approximation approach applies a direct cuttingplane method. Even a fairly accurate gradient may result in a cut cutting into the epigraph (especially in regions farther away from the current iterate). One either needs sophisticated tolerance handling to avoid cutting into the epigraph—see, e.g., Szántai (1988), Mayer (1998), Arnold et al. (2014),—or else one needs a sophisticated convex optimization method that can handle cuts cutting into the epigraph—see, e.g., de Oliveira et al. (2011), van Ackooij and Sagastizábal (2014). Yet another alternative is perpetual adjustment of existing cuts to information revealed in the course of the process; see Higle and Sen (1996).
Inner approximation of the level set \( {{\mathcal {L}}}( F, p ) = \{\, {\varvec{z}} \,\, F( {\varvec{z}} ) \ge p \,\} \), an approach initiated by Prékopa (1990), results in a model that is easy to validate. The level set is approximated by means of pefficient points. In the cone generation approach initiated by Dentcheva et al. (2000), new approximation points are found by minimization over \( {{\mathcal {L}}}( F, p ) \). As this entails a substantial computational effort, the master part of the decomposition framework should succeed with as few pefficient points as possible. This calls for specialized solution methods like those of Dentcheva et al. (2004), Dentcheva and Martinez (2013), van Ackooij et al. (2017). An increasing level of complexity is noticeable.
In this paper we apply inner approximation of the epigraph of the probabilistic function \( \phi ( {\varvec{z}} ) =  \log F( {\varvec{z}} ) \). This approach endures noise in gradient computation without any special effort. Noisy gradient estimates may yield iterates that do not improve much on our current model. But we retain a true inner approximation of the function, provided function values are evaluated with appropriate accuracy. This inherent stability of the model enables the application of randomized methods of simple structure.
For probability maximization, we propose a stochastic approximation procedure with relatively easy generation of new test points. A probabilistic constraint function is handled in a Newtonlike scheme, approximately solving a short sequence of probability maximization problems, with increasing accuracy. As this scheme is built from randomized components, we provide a statistical analysis of its validity.
The proposed stochastic approximation procedure can be implemented using standard components. The master problem is conveniently solved by an offtheshelf solver. New approximation points are found through simple line search whose direction can be determined by standard implementations of classic Monte Carlo simulation procedures. The Newtonlike scheme can be implemented through minor variations on a standard Newton method.
In case of a probabilistic function derived from a multivariate standard normal distribution, computing a single nonzero component of a gradient vector will involve an effort comparable to that of computing a function value. The variance reduction Monte Carlo simulation procedures described in Sect. 6 were successfully applied in outer approximation approaches to the solution of jointly probabilistic constrained stochastic programming problems, see Szántai (1988). We trust that they will perform as well in the inner approximation approach discussed in the present paper. An elaborate implementation and a systematic computational study will be needed to verify this. We mention that a means of alleviating the difficulty of gradient computation in case of multivariate normal distribution has recently been proposed by Hantoute et al. (2018).
Emerging applications of probabilistic programming afford room for different solution approaches; e.g., new models of electricity markets or traffic control, brought about by novel infocommunication technologies.
Notes
Acknowledgements
Open access funding provided by John von Neumann University (NJE). We are grateful to the Editors and the anonymous Reviewers for insightful and constructive comments that led to substantial improvements of the paper.
References
 Ambartzumian, R., Der Kiureghian, A., Ohanian, V., & Sukiasian, H. (1998). Multinormal probability by sequential conditioned importance sampling: Theory and applications. Probabilistic Engineering Mechanics, 13, 299–308.CrossRefGoogle Scholar
 Arnold, T., Henrion, R., Möller, A., & Vigerske, S. (2014). A mixedinteger stochastic nonlinear optimization problem with joint probabilistic constraints. Pacific Journal of Optimization, 10, 5–20.Google Scholar
 Benveniste, A., Métivier, M., & Priouret, P. (1993). Adaptive algorithms and stochastic approximations. New York: Springer.Google Scholar
 Birge, J., & Louveaux, F. (1997). Introduction to stochastic programming. New York: Springer.Google Scholar
 Boros, E., & Veneziani, P. (2002). Bounds of degree 3 for the probability of the union of events. Technical report, Rutgers Center for Operations Research, RUTCOR Research Report 32002.Google Scholar
 Bukszár, J., Prékopa, A. (2000). Probability bounds with cherrytrees. Technical report, Rutgers Center for Operations Research, RUTCOR Research Report 442000.Google Scholar
 Bukszár, J., & Szántai, T. (1999). Probability bounds given by hypercherrytrees. Alkalmazott Matematikai Lapok, 2, 69–85. (in Hungarian) .Google Scholar
 de Oliveira, W., & Sagastizábal, C. (2014). Level bundle methods for oracles with ondemand accuracy. Optimization Methods and Software, 29, 1180–1209.CrossRefGoogle Scholar
 de Oliveira, W., Sagastizábal, C., & Scheimberg, S. (2011). Inexact bundle methods for twostage stochastic programming. SIAM Journal on Optimization, 21, 517–544.CrossRefGoogle Scholar
 Deák, I. (1980). Three digit accurate multiple normal probabilities. Numerische Mathematik, 35, 369–380.CrossRefGoogle Scholar
 Deák, I. (1986). Computing probabilities of rectangles in case of multinormal distributions. Journal of Statistical Computation and Simulation, 26, 101–114.CrossRefGoogle Scholar
 Deák, I., Gassmann, H., & Szántai, T. (2002). Computing multivariate normal probabilities: A new look. Journal of Statistical Computation and Simulation, 11, 920–949.Google Scholar
 Dentcheva, D., Lai, B., & Ruszczyński, A. (2004). Dual methods for probabilistic optimization problems. Mathematical Methods of Operations Research, 60, 331–346.CrossRefGoogle Scholar
 Dentcheva, D., & Martinez, G. (2013). Regularization methods for optimization problems with probabilistic constraints. Mathematical Programming, 138, 223–251.CrossRefGoogle Scholar
 Dentcheva, D., Prékopa, A., & Ruszczyński, A. (2000). Concavity and efficient points of discrete distributions in probabilistic programming. Mathematical Programming, 89, 55–77.CrossRefGoogle Scholar
 Ermoliev, Y. (1969). On the stochastic quasigradient method and stochastic quasiFeyer sequences. Cybernetics, 5, 208–220.CrossRefGoogle Scholar
 Ermoliev, Y. (1983). Stochastic quasigradient methods and their application to system optimization. Stochastics, 9, 1–36.CrossRefGoogle Scholar
 Fábián, C., & Szántai, T. (2017). A randomized method for smooth convexminimization, motivated by probability maximization. Technical report, OptimizationOnline, March 2017.Google Scholar
 Fábián, C., Csizmás, E., Drenyovszki, R., van Ackooij, W., Vajnai, T., Kovács, L., et al. (2018). Probability maximization by inner approximation. Acta Polytechnica Hungarica, 15, 105–125.Google Scholar
 Fábián, C., Eretnek, K., & Papp, O. (2015). A regularized simplex method. Central European Journal of Operations Research, 23, 877–898.CrossRefGoogle Scholar
 Frangioni, A. (2018). Standard bundle methods: Untrusted models and duality. Technical reports, Department of Informatics, University of Pisa, Italy. http://eprints.adm.unipi.it/2378/1/StandardBundle.pdf. Accessed August 26, 2018
 Frangioni, A. (2002). Generalized bundle methods. SIAM Journal on Optimization, 13, 117–156.CrossRefGoogle Scholar
 Gaivoronski, A. (1978). Nonstationary stochastic programming problems. Kybernetika, 4, 89–92.Google Scholar
 Gassmann, H. (1988). Conditional probability and conditional expectation of a random vector. In Y. Ermoliev & R. B. Wets (Eds.), Numerical techniques for stochastic optimization (pp. 237–254). Berlin: Springer.CrossRefGoogle Scholar
 Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1, 141–150.Google Scholar
 Hantoute, A., Henrion, R., PérezAros, P. (2018). Subdifferential characterization of probability functions under Gaussian distribution. Mathematical Programming. https://doi.org/10.1007/s1010701812379
 Henrion, R. (2004). Introduction to chance constraint programming. Technical report, WeierstrassInstitut für Angewandte Analysis und Stochastik. www.wiasberlin.de/people/henrion/ccp.ps
 Higle, J., Sen, S. (1996). Stochastic decomposition: A statistical method for large scale stochastic linear programming. In: Nonconvex optimization and its applications vol. 8. Springer.Google Scholar
 Hunter, D. (1976). Bounds for the probability of a union. Journal of Applied Probbility, 13, 597–603.CrossRefGoogle Scholar
 Impagliazzo, R., & Kabanets, V. (2010). Constructive proofs of concentration bounds. In: M. Serna, R. Shaltiel, K. Jansen, J. Rolim (Eds) Approximation, randomization, and combinatorial optimization. Algorithms and techniques, RANDOM 2010, APPROX 2010. Lecture Notes in Computer Science vol. 6302 (pp. 617–631). Berlin: Springer.Google Scholar
 Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133, 365–397.CrossRefGoogle Scholar
 Lemaréchal, C., Nemirovski, A., & Nesterov, Y. (1995). New variants of bundle methods. Mathematical Programming, 69, 111–147.CrossRefGoogle Scholar
 Luedtke, J., Ahmed, S., & Nemhauser, G. (2010). An integer programming approach for linear programs with probabilistic constraints. Mathematical Programming, 122, 247–272.CrossRefGoogle Scholar
 Luenberger, D., Ye, Y. (2008). Linear and nonlinear programming. In International series in operations research and management science. Springer.Google Scholar
 MádiNagy, G., & Prékopa, A. (2004). On multivariate discrete moment problems and their applications to bounding expectations and probabilities. Mathematics of Operations Research, 29, 229–258.CrossRefGoogle Scholar
 Mayer, J. (1998). Stochastic linear programming algorithms: A comparison based on a model management system. Philadelphia: Gordon and Breach Science Publishers.Google Scholar
 Nemirovski, A., Yudin, D. (1978). On Cezari’s convergence of the steepest descent method for approximating saddle point of convexconcave functions. Soviet Mathematics Doklady, 19.Google Scholar
 Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19, 1574–1609.CrossRefGoogle Scholar
 Nemirovski, A., & Yudin, D. (1983). Problem complexity and method efficiency in optimization, Wileyinterscience series in discrete mathematics (Vol. 15). New York: Wiley.Google Scholar
 Nesterov, Y. (1983). A method for unconstrained convex minimization with the rate of convergence of \(o(1/k^2)\). Doklady AN SSSR, 269, 543–547.Google Scholar
 Nesterov, Y. (2009). Primaldual subgradient methods for convex problems. Mathematical Programming, 120, 221–259.CrossRefGoogle Scholar
 Nesterov, Y., & Vial, J. P. (2008). Confidence level solutions for stochastic programming. Automatica, 44, 1559–1568.CrossRefGoogle Scholar
 Panconesi, A., & Srinivasan, A. (1997). Randomized distributed edge coloring via an extension of the Chernoff–Hoeffding bounds. SIAM Journal on Computing, 26, 350–368.CrossRefGoogle Scholar
 Pflug, G. (1988). Stepsize rules, stopping times and their implementation in stochastic quasigradient algorithms. In Y. Ermoliev & R. Wets (Eds.), Numerical techniques for stochastic optimization (pp. 353–372). Berlin: Springer.CrossRefGoogle Scholar
 Pflug, G. (1996). Optimization of stochastic models. The interface between simulation and optimization. Boston: Kluwer.Google Scholar
 Polyak, B. (1990). New stochastic approximation type procedures. Automat i Telemekh, 7, 98–107.Google Scholar
 Polyak, B., & Juditsky, A. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30, 838–855.CrossRefGoogle Scholar
 Prékopa, A., Vizvári, B., Regős, G. (1995). Lower and upper bounds on probabilities of Boolean functions of events. Technical report, Rutgers Center for Operations Research, RUTCOR Research Report 3695.Google Scholar
 Prékopa, A. (1990). Dual method for a onestage stochastic programming problem with random RHS obeying a discrete probability distribution. ZOR: Methods and Models of Operations Research, 34, 441–461.Google Scholar
 Prékopa, A. (1995). Stochastic programming. Dordrecht: Kluwer Academic Publishers.CrossRefGoogle Scholar
 Prékopa, A., Vizvári, B., & Badics, T. (1998). Programming under probabilistic constraint with discrete random variable. In F. Giannesi, T. Rapcsák, & S. Komlósi (Eds.), New trends in mathematical programming (pp. 235–255). Dordrecht: Kluwer.CrossRefGoogle Scholar
 Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.CrossRefGoogle Scholar
 Rockafellar, R. (1970). Convex analysis. Princeton: Princeton University Press.CrossRefGoogle Scholar
 Ruszczyński, A., Syski, W. (1986). A method of aggregate stochastic subgradients with online stepsize rules for convex stochastic programming problems. In: Prékopa A, Wets R (eds) Stochastic programming 84 Part II, Mathematical Programming Studies (vol. 28, pp. 113–131) Berlin: Springer.Google Scholar
 Ruszczyński, A. (2006). Nonlinear optmization. Princeton: Princeton University Press.Google Scholar
 Stein, C. (1945). A twosample test for a linear hypothesis whose power is indpendent of the variance. Annals of Mathematical Statistics, 16, 243–258.CrossRefGoogle Scholar
 Szántai, T. (1985). Numerical evaluation of probabilities concerning multidimensional probability distributions. Thesis, Hungarian Academy of Sciences, Budapest.Google Scholar
 Szántai, T. (1976). A procedure for determination of the multivariate normal probability distribution function and its gradient values. Alkalmazott Matematikai Lapok, 2, 27–39. (in Hungarian) .Google Scholar
 Szántai, T. (1988). A computer code for solution of probabilisticconstrained stochastic programming problems. In Y. Ermoliev & R. B. Wets (Eds.), Numerical techniques for stochastic optimization (pp. 229–235). Berlin: Springer.CrossRefGoogle Scholar
 Szántai, T. (2000). Improved bounds and simulation procedures on the value of the multivariate normal probability distribution function. Annals of Operations Research, 100, 85–101.CrossRefGoogle Scholar
 Szász, P. (1951). Elements of differential and integral calculus. Budapest: Közoktatásügyi Kiadóvállalat (in Hungarian).Google Scholar
 Teng, H. W., Kang, M. H., & Fuh, C. D. (2015). On spherical Monte Carlo simulations for multivariate normal probabilities. Advances in Applied Probability, 47, 817–836.CrossRefGoogle Scholar
 Tomescu, I. (1986). Hypertrees and Bonferroni inequalities. Journal of Combinatorial Theory, Series B, 41, 209–217.CrossRefGoogle Scholar
 Uryasev, S. (1988). Adaptive stochastic quasigradient methods. In Y. Ermoliev & R. Wets (Eds.), Numerical techniques for stochastic optimization (pp. 373–384). Berlin: Springer.CrossRefGoogle Scholar
 van Ackooij, W., Berge, V., de Oliveira, W., & Sagastizábal, C. (2017). Probabilistic optimization via approximate pefficient points and bundle methods. Computers & Operations Research, 77, 177–193.CrossRefGoogle Scholar
 van Ackooij, W., & Sagastizábal, C. (2014). Constrained bundle methods for upper inexact oracles with application to joint chance constrained energy problems. SIAM Journal on Optimization, 24, 733–765.CrossRefGoogle Scholar
 Worsley, K. (1982). An improved Bonferroni inequality and applications. Biometrika, 69, 297–302.CrossRefGoogle Scholar
Copyright information
OpenAccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.