Keywords

1 Introduction

Intelligent agents are usually faced with the task of optimizing some utility function \(\mathbf {U}\) that is a priori unknown and can only be evaluated sample-wise. We do not restrict ourselves on the form of this function, thus in principle it could be a classification or regression loss, a reward function in a reinforcement learning environment or any other utility function. The framework of information-theoretic bounded rationality [16, 17] and related information-theoretic models [3, 14, 20, 21, 23] provide a formal framework to model agents that behave in a computationally restricted manner by modeling resource constraints through information-theoretic constraints. Such limitations also lead to the emergence of hierarchies and abstractions [5], which can be exploited to reduce computational and search effort. Recently, the main principles have been successfully applied to spiking and artificial neural networks, in particular feedforward-neural network learning problems, where the information-theoretic constraint was mainly employed as some kind of regularization [7, 11, 12, 18]. In this work we introduce bounded rational decision-making with adaptive generative neural network priors. We investigate the interaction between anytime sample-based decision-making processes and concurrent improvement of prior policies through learning, where the prior policies are parameterized as Variational Autoencoders [10]—a recently proposed generative neural network model.

The paper is structured as follows. In Sect. 2 we discuss the basic concepts of information-theoretic bounded rationality, sampled-based interpretations of bounded rationality in the context of Markov Chain Monte Carlo (MCMC), and the basic concepts of Variational Autoencoders. In Sect. 3 we present the proposed decision-making model by combining sample-based decision-making with concurrent learning of priors parameterized by Variational Autoencoders. In Sect. 4 we evaluate the model with toy examples. In Sect. 5 we discuss our results.

2 Methods

2.1 Bounded Rational Decision Making

The foundational concept in decision-making theory is Maximum Expected Utility [22], whereby an agent is modeled as choosing actions such that it maximizes its expected utility

$$\begin{aligned} \max _{p(a|w)} \sum _w \rho (w) \sum _{a}{p(a|w)\mathbf {U}(w, a)}, \end{aligned}$$
(1)

where a is an action from the action space A and w is a world state from the world state space W, and \(\mathbf {U}(w,a)\) is a utility function. We assume that the world states are distributed according to a known and fixed distribution \(\rho (w)\) and that the world sates w are finite and discrete. In the case of a single world state or world state distribution \(\rho (w)=\delta (w-w_0)\), the decision-making problem simplifies into a single function optimization problem \(a^* = {{\mathrm{arg\,max}}}_a \mathbf {U}(a)\). In many cases, solving such optimization problems may require an exhaustive search, where simple enumeration is extremely expensive.

A bounded rational decision maker tackles the above decision-making problem by settling on a good enough solution. Finding a bounded optimal policy requires to maximize the utility function while simultaneously remaining within some given constraints. The resulting policy is a conditional probability distribution p(a|w), which essentially consists of choosing an action a given a particular world state w. The constraints of limited information-processing resources can be formalized by setting an upper bound on the \({{\mathrm{\text {D}_\text {KL}}}}\) (say B bits) that the decision-maker is maximally allowed to spend to transform its prior strategy into a posterior strategy through deliberation. This results in the following constrained optimization problem [5]:

$$\begin{aligned} \max _{p(a|w)} \sum _w \rho (w) \sum _{a}{p(a|w)\mathbf {U}(w, a)}, \text { s.t. } {{\mathrm{\text {D}_\text {KL}}}}(p(a|w)||p(a)) \le \text {B}. \end{aligned}$$
(2)

This constrained optimization problem can be formulated as an unconstrained problem [16]:

$$\begin{aligned} \max _{p(a|w)} \left( \sum _w \rho (w) \sum _{a}{p(a|w)\mathbf {U}(w, a) - \frac{1}{\beta }{{\mathrm{\text {D}_\text {KL}}}}(p(a|w)||p(a))} \right) , \end{aligned}$$
(3)

where the inverse temperature \(\beta \in \mathbb {R}^+\) is a Lagrange multiplier that influences the trade off between expected utility gain and information cost. For \(\beta \rightarrow \infty \) the agent behaves perfectly rational and for \(\beta \rightarrow 0\) the agent can only act according to the prior policy. The optimal prior policy in this case is given by the marginal \(p(a) = \sum _{w \in W}{\rho (w) p(a|w)}\) [5], in which case the Kullback-Leibler divergence becomes equal to the mutual information, i.e. \({{\mathrm{\text {D}_\text {KL}}}}(p(a|w)||p(a))=I(W;A)\). The solution to the optimization problem (3) can be found by iterating the following set of self-consistent equations [5]:

\( {\left\{ \begin{array}{ll} \begin{array}{rcl} p(a|w) &{}=&{} \frac{1}{Z(w)}p(a) \exp (\beta _1 \mathbf {U}(w,a)) \\ p(a) &{}=&{} \sum _w \rho (w) p(a|w), \\ \end{array} \end{array}\right. } \)

where \(Z(w) = \sum _a p(a) \exp (\beta _1 \mathbf {U}(w,a)) \) is normalization factor. Computing such a normalization factor is usually computationally expensive as it involves summing over spaces with high cardinality. We avoid this by Monte Carlo approximation.

2.2 MCMC as Sample-Based Bounded Rational Decision-Making

Monte Carlo methods are mostly used to solve two related kinds of problems. One is to generate samples x from a given distribution q(x) and the other is to estimate the expectation of a function. For example, if g(x) is a function for which we need to compute the expectation \(\varPhi = {{\mathrm{\mathbb {E}}}}_{q(x)}[g(x)]\) we can draw N samples \(\{x_i\}^N_{i=1}\) to obtain the estimate \(\hat{\varPhi } = \frac{1}{N} \sum _{i=1}^N{g(x_i)}\) [15]. Samples can be drawn by employing Markov Chains to simulate stochastic processes. A Markov Chain can be defined by an initial probability \(p^0(x)\) and a transition probability \(\mathbf T (x', x)\), which gives the probability of transitioning from state x to \(x'\). The probability of being in state \(x'\) at the (\(t+1)\)-th iteration is given by:

$$\begin{aligned} p^{t+1}(x') = \sum _x\mathbf{T (x', x)p^t(x)}. \end{aligned}$$
(4)

Such a chain can be used to generate sample proposals from a desired target distribution q(x), if the following prerequisites are met [15]. Firstly, the chain must be ergodic, i.e. the chain must converge to q(x) independent of the initial distribution \(p^0(x)\). Secondly, the desired distribution must be an invariant distribution of the chain. A distribution q(x) is an invariant of \(\mathbf T (x', x)\) if its probability vector is an eigenvector of the transition probability matrix. A sufficient, but not necessary condition to fulfill this requirement is detailed balance, i.e. the probability of going from state x to \(x'\) is the same as going from \(x'\) to x: \(q(x)\mathbf T (x',x) = q(x')\mathbf T (x,x')\).

An MCMC chain can be viewed as a bounded rational decision-making process for a single context w in the sense that it performs an anytime optimization of a utility function \(\mathbf {U}(a)\) with some precision \(\gamma \) and that it is initialized with a prior p(a). The target distribution has to be chosen as \(q(a)\propto e^{\gamma \mathbf {U}(a)}\) in this case. A decision is made with the last sample when the chain is stopped. The resource corresponds then to the number of steps the chain has taken to evaluate the function \(\mathbf {U}(a)\). To find the transition probabilities \(\mathbf T (x',x)\) of the chain, we assume detailed balance and a Metropolis-Hastings scheme \(\mathbf T (x',x)=g(x'|x) A(x'|x)\) such that

$$\begin{aligned} \frac{\mathbf{T }(x',x)}{\mathbf{T }(x,x')}=\frac{g(x'|x) A(x'|x)}{g(x|x') A(x|x')}=e^{\gamma \left( \mathbf {U}(x')-\mathbf {U}(x)\right) } \end{aligned}$$
(5)

with a proposal distribution \(g(x'|x)\) and an acceptance probability \(A(x'|x)\). One common choice that satisfies Eq. (5) is

$$\begin{aligned} A(x'|x) = \min \left\{ 1, \frac{g(x'|x)}{g(x|x')}e^{\gamma \left( \mathbf {U}(x')- \mathbf {U}(x)\right) }\right\} , \end{aligned}$$
(6)

which can be further simplified when using a symmetric proposal distribution with \(g(x'|x)=g(x|x')\), resulting in \(A(x'|x) = \min \left\{ 1, e^{\gamma \left( \mathbf {U}(x')-\mathbf {U}(x)\right) }\right\} \).

Note that the decision of the chain will in general follow a non-equilibrium distribution, but that we can use the bounded rational optimum as a normative baseline to quantify how efficiently resources are used by analyzing how closely the bounded rational equilibrium is approximated.

2.3 Representing Prior Strategies with Variational Autoencoders

While an anytime optimization process such as MCMC can be regarded as a transformation from prior to posterior, the question remains how to choose the prior. While the prior may be assumed to be fixed, it would be far more efficient if the prior itself were subjected to an optimization process that minimizes the overall information-processing costs. Since in the case of multiple world states w the optimal prior is given by the marginal \(p(a)=\sum _w \rho (w)p(a|w)\), we can use the outputs a of the anytime decision-making process to train a generative model of the prior p(a). If the generative model was chosen from a parametric family such as a Gaussian distribution, then training would consist in updating mean and variance of the Gaussian. Choosing such a parametric family imposes restrictions on the shape of the prior, in particular in the continuous domain. Therefore, we investigate non-parametric generative models of the prior, in particular neural network models such as Variational Autoencoders (VAEs).

VAEs were introduced by [10] as generative models that use a similar architecture as deterministic autoencoder networks. Their functioning is best understood as variational Bayesian inference in a latent variable model \(p(x\vert z,\theta )\) with prior p(z), where x is observable data, and z is the latent variable that explains the data, but that cannot be observed directly. The aim is to find a parameter \(\hat{\theta }_{ML}\) that maximizes the likelihood of the data \(p(x|\theta ) = \int p(x\vert z,\theta )p(z)dz\). Samples from \(p(x|\theta )\) can then be generated by first sampling z and then sampling an x from \(p(x|z,\theta )\). As the maximum likelihood optimization may prove difficult due to the integral, we may express the likelihood in a different form by assuming a distribution \(q(z|x,\eta )\) such that

$$\begin{aligned} \log p(x|\theta ,\eta )= & {} \int q(z|x,\eta ) \log \frac{p(x|z,\theta )p(z)}{q(z|x,\eta )} \mathop {dz} + \underbrace{\int q(z|x,\eta ) \log \frac{q(z|x,\eta )}{p(z|x,\theta )}\mathop {dz}}_{={{\mathrm{\text {D}_\text {KL}}}}(q||p) \ge 0} \nonumber \\\ge & {} \int q(z|x,\eta ) \log \frac{p(x|z,\theta )p(z)}{q(z|x,\eta )} \mathop {dz} =:\mathrm {F}(\theta ,\eta ). \end{aligned}$$
(7)

Assuming that the distribution \(q(z|x,\eta )\) is expressive enough to approximate the true posterior \(p(z|x,\theta )\) reasonably well, we can neglect the \({{\mathrm{\text {D}_\text {KL}}}}\) between the two distributions, and directly optimize the lower bound \(\mathrm {F}(\theta ,\eta )\) through gradient descent. In VAEs \(q(z|x,\eta )\) is called the encoder that translates from x to z and \(p(x|z,\theta )\) is called the decoder that translates from z to x. Both distributions and the prior p(z) are assumed to be Gaussian

$$\begin{aligned} p(x|z,\theta )= & {} \mathcal {N}\left( x\vert \mu _\theta (z), \sigma ^2 \mathbb {I} \right) \\ q(z|x,\eta )= & {} \mathcal {N}\left( z\vert \mu _\eta (x), \varSigma _\eta (x) \right) \\ p(z)= & {} \mathcal {N}(z|0,\mathbb {I}), \end{aligned}$$

where \(\mu _\theta (z)\), \(\mu _\eta (x)\) and \(\varSigma _\eta (x)\) are non-linear functions implemented by feed-forward neural networks and where it is ensured that \(\sigma ^2 \searrow 0\) and that \(\varSigma _\eta (x)\) is a covariance matrix.

Note that the optimization of the autoencoder itself can also be viewed as a bounded rational choice

$$\begin{aligned} \max _{\theta ,\eta }\Bigg ( \mathbb {E}_{q(z|x,\eta )}\left[ \log {p(x\vert z,\theta )}\right] - {{\mathrm{\text {D}_\text {KL}}}}\left( q(z\vert x,\eta )\vert \vert p(z)\right) \Bigg ), \end{aligned}$$
(8)

where the expected likelihood is maximized while the encoder distribution \(q(z\vert x,\eta )\) is kept close to the prior p(z).

Fig. 1.
figure 1

For each incoming world state w our model samples a prior indexed by \(x_i \thicksim p(x\vert w)\). Each prior \(p(a\vert x)\) is represented by a VAE. To arrive at the posterior policy \(p(a \vert w,x)\), an anytime MCMC optimization is seeded with \(a_0 \thicksim p(a\vert x)\) to generate a sample from \(p(a \vert w,x)\). The prior selection policy is also implemented by an MCMC chain and selects agents that have achieved high utility on a particular w.

3 Modeling Bounded Rationality with Adaptive Neural Network Priors

In this section we combine MCMC anytime decision-processes with adaptive autoencoder priors. In the case of a single world state, the combination is straightforward in that each decision selected by the MCMC process is fed as an observable input to an autoencoder. The updated autoencoder is then used as an improved prior to initialize the next MCMC decision. In case of multiple world states, there are two straightforward scenarios. In the first scenario there are as many priors as world states and each of them is updated independently. For each world state we obtain exactly the same solution as in the single world state case. In the second scenario there is only a single prior over actions for all world states. In this case the autoencoder is trained with the decisions by all MCMC chains such that the autoencoder should converge to the optimal rate distortion prior. A third, more interesting scenario occurs when we allow multiple priors, but less than world states—compare Fig. 1. This is especially plausible when dealing with continuous world states, but also in the case of large discrete spaces.

3.1 Decision Making with Multiple Priors

Decision-making with multiple priors can be regarded as a multi-agent decision-making problem where several bounded rational decision-makers are combined into a single decision-making process [5]. In our case the most suitable arrangement of decision-makers is a two-step process where first each world state is assigned probabilistically to a prior which is then used in the second step to initialize an MCMC chain—compare Fig. 1. The output of that chain is then used to train the autoencoder corresponding to the selected prior. As each prior may be responsible for multiple world states, each prior will learn an abstraction that is specialized for this subspace of world states. This two-stage decision-process can be formalized as a bounded rational optimization problem

$$\begin{aligned} \max _{p(a|w,x), p(x|w)} \left( \mathbb {E}_{p(a\vert w,x)}[\mathbf {U}(w,a)] - \frac{1}{\beta _1}I(W;X) - \frac{1}{\beta _2}I(W;A|X) \right) , \end{aligned}$$
(9)

where p(x|w) is selecting the responsible prior p(a|x) indexed by x for world state w. The resource parameter for the first selection stage is given by \(\beta _1\) and by \(\beta _2\) for the second decision made by the MCMC process. The solution of optimization (9) is given by the following set of equations:

$$\begin{aligned} {\left\{ \begin{array}{ll} \begin{array}{rcl} p(x|w) &{}=&{} \frac{1}{Z(w)}p(x) \exp (\beta _1 \varDelta {{\mathrm{\text {F}_{\text {par}}}}}(w,x)) \\ p(x) &{}=&{} \sum \nolimits _w \rho (w) p(x|w) \\ p(a|w,x) &{}=&{} \frac{1}{Z(w,x)} p(a|x) \exp (\beta _2 \mathbf {U}(w,a)) \\ p(a|x) &{}=&{} \sum \nolimits _w p(w|x)p(a|w,x) \\ \varDelta {{\mathrm{\text {F}_{\text {par}}}}}(w,x) &{}=&{} \mathbb {E}_{p(a|w,x)}[\mathbf {U}(w,a)] - \frac{1}{\beta _2}{{\mathrm{\text {D}_\text {KL}}}}(p(a|w,x)\vert \vert p(a|x)), \end{array} \end{array}\right. } \end{aligned}$$
(10)

where Z(w) and Z(w, x) are the normalization factors and \(\varDelta {{\mathrm{\text {F}_{\text {par}}}}}(w,x)\) is the free energy of the action selection stage. The marginal distribution p(a|x) encapsulates an action selection policy consisting of the priors p(a|w, x) weighted by the responsibilities given by the Bayesian posterior p(w|x). Note that the Bayesian posterior is not determined by a given likelihood model, but is the result of the optimization process (9).

3.2 Model Architecture

Equation (10) describe abstractly how a two-step decision process with bounded rational decision-makers should be optimally partitioned. In this section we propose a sample-based model of a bounded rational decision process that approximately corresponds to Eq. (10) such that the performance of the decision process can be compared against its normative baseline. To translate Eq. (10) into a stochastic process we proceed in three steps. First, we implement the priors p(a|x) as Variational Autoencoders. Second, we formulate an MCMC chain that is initialized with a sample from the prior and generates a decision \(a\sim p(a|x,w)\). Third, we design an MCMC chain that functions as a selector between the different priors.

Autoencoder Priors. Each prior p(a|x) in Eq. (10) is represented by a VAE that learns to generate action samples that mimic the samples given by the MCMC chains—compare Fig. 2. The functions \(\mu _\theta (z)\), \(\mu _\eta (a)\) and \(\varSigma _\eta (a)\) are implemented as feed-forward neural networks with one hidden layer. The units in the hidden layer were all chosen with sigmoid activation function, the output units in the case of the \(\mu \)-functions were also chosen as sigmoids and for the \(\varSigma \)-function as ReLU. During training the weights \(\eta \) and \(\theta \) are adapted to optimize the expected log-likelihood of the action samples that are given by the decisions made by the MCMC chains for all world states that have been assigned to the prior p(a|x). Due to the Gaussian shape of the decoder distribution, optimizing the log-likelihood corresponds to minimizing quadratic loss of the reconstruction error. After training, the network can generate sample actions itself by feeding the decoder network with samples from \(\mathcal {N}(z|0,\mathbb {I})\).

MCMC Decision-Making. To implement the bounded rational decision-maker p(a|w, x) we obtain an action sample \(a\sim p(a|x)\) from the autoencoder prior to initialize an MCMC chain that optimizes the target utility \(\mathbf {U}(w,a)\) for the given world state. We run the MCMC chain for \(N_{\max }\) steps. In each step we generate a proposal from a Gaussian distribution with \(g(a'|a)=\mathcal {N}(a'\vert a,\sigma ^2)\) and accept with probability

$$\begin{aligned} A(a'|a) = \min \big \{1, \exp ({\gamma (\mathbf {U}(w, a') - \mathbf {U}(w,a))})\big \}. \end{aligned}$$
(11)

Over the course of \(N_{\text {max}}\) time steps, the precision \(\gamma \) is adjusted following an annealing schedule conditioned on the maximum number of steps \(N_{\text {max}}\). We use an inverse Boltzmann annealing schedule, i.e. \(\gamma ^{(k)} = \gamma ^{0} + \alpha \log (1 + k)\), where \(\alpha \) is a tuning parameter. The rationale behind this is that we assume the sampling process to be coarse grained in the beginning and is getting finer during the search.

Fig. 2.
figure 2

The encoder translates the observed action into a latent variable z, whereas the decoder translates the latent variable z into a proposed action a. During training the weights \(\eta \) and \(\theta \) are adapted to optimize the expected log-likelihood of the observed action samples. After training, the network can generate actions by feeding the decoder network with samples from \(\mathcal {N}(z|0,\mathbb {I})\).

Prior Selection. To implement the bounded rational prior selection \(p(x\vert w)\) through an MCMC process, we first sample an x from the prior p(x) and start an MCMC chain that (approximately) optimizes \(\varDelta {{\mathrm{\text {F}_{\text {par}}}}}(w,x)\) for a given world state w sampled from \(\rho (w)\). The prior p(x) is represented by a multinomial and updated by the frequencies of the selected prior indices x. The number of steps in the prior selection MCMC chain was kept constant at a value of \(N_{\mathrm {max}}^{\text {sel}}\) and similarly the precision \(\gamma ^{\text {sel}}\) was annealed over the course of \(N_{\mathrm {max}}^{\text {sel}}\) time steps. The target \(\varDelta {{\mathrm{\text {F}_{\text {par}}}}}(w,x)\) comprises a trade-off between expected utility and information resources. However, it cannot be directly evaluated and would require the computation of \({{\mathrm{\text {D}_\text {KL}}}}(p(a|x,w)\Vert p(a|x))\). Here we use number of steps in the downstream MCMC process as a resource measure. As the number of downstream steps was constant, the model selector’s choice only depended on the average utility achieved by each decision-maker, which results in the acceptance rule

$$\begin{aligned} A(x'|x) = \min \left\{ 1, \exp ({\gamma ^{\text {sel}}({{\mathrm{\mathbb {E}}}}_{p(a|w,x)}[\mathbf {U}(w,a)] - {{\mathrm{\mathbb {E}}}}_{p(a|w,x')}[\mathbf {U}(w,a)]}))\right\} . \end{aligned}$$

As the priors are discrete choices the proposal distribution \(q(x_{\text {p}}\vert x_\text {p})\) samples globally with \(p(x) = \frac{1}{\vert X \vert }\) for all x.

Fig. 3.
figure 3

Top: The line is given by the Rate Distortion Curve that forms a theoretical efficiency frontier, characterized by the ratio between mutual information and expected utility. Crosses represent single-prior agents and dots multi-prior systems. The labels indicate how many steps were assigned to the second MCMC chain of a total of 100 steps. Bottom: Information processing and expected utility is increasing in the number of utility evaluations, as we expected.

4 Empirical Results

To demonstrate our approach we evaluate two scenarios. First, a simple agent, which is equipped with a single prior policy \(p_\eta (a)\), as introduced in Sect. 2. In case of a single agent there is no need for a prior selection stage. Second, we evaluated a multi-prior decision-making system and compared the results to the single prior agent. For the mutli-prior agent, we split a fixed number of MCMC steps between the prior selection and the action selection. The task we designed consists of six world states where each world state has a Gaussian utility function in the interval [0, 1] with a unique optimum. In both settings, we equipped the Variational Autoencoders with one hidden layer consisting of 16 units with ReLU activations. We implemented the experiments using Keras [2]. We show the results in Fig. 3.

Our results indicate that using MCMC evaluation steps as a surrogate for information processing costs can be interpreted as bounded rational decision-making. In Fig. 3 we show the efficiency of several agents with different processing constraints. To compare our results to the theoretical baseline, we discretized the action space into 100 equidistant slices and solved the problem using the algorithm proposed in [5] to implement Eq. (10). Furthermore our results indicate that the multi-prior system generally outperforms the single-prior system in terms of utility.

To illustrate the differences in efficiency between the single prior agent and the multi-prior agents, we plotted in Fig. 4 utility gained through the second MCMC optimization. For multi-prior agents this is caused by specialized priors which provide initializations to the MCMC chains that are close to the optimal action. In this particular case, \(\varDelta \mathbf {U}\) does not become zero because we allow only three priors to cover six world states, thus leading to abstraction, i.e. specializing on actions that fit well for the assigned world states. In single-prior agents, the prior is adapting to all world states, thus providing, on average, an initial action that is suboptimal for the requested world state.

Fig. 4.
figure 4

Our results indicate that having multiple priors is more beneficial, if more steps are available in total. Note that the stochasticity of our method decreases with the number of allowed steps, as shown by the uncertainty band (transparent regions).

5 Discussion

In this study we implemented bounded rational decision makers with adaptive priors. We achieved this with Variational Autoencoder priors. The bounded rational decision-making process was implemented by MCMC optimization to find the optimal posterior strategy, thus giving a computationally simple way of generating samples. As the number of steps in the optimization process was constrained, we could quantify the information processing capabilities of the resulting decision-makers using relative Shannon entropy. Our analysis may have interesting implications, as it provides a normative framework for this kind of combined optimization of adaptive priors and decision-making processes. Prior to our work there have been several attempts to apply the framework of information-theoretic bounded rationality to machine learning tasks [7, 11, 12, 18]. The novelty of our approach is that we design adaptive priors for both the single-step case and the multi-agent case and we demonstrate how to transform information-theoretic constraints into computational constraints in the form of MCMC steps.

Recently, the combination of Monte Carlo optimization and neural networks has gained increasing popularity. These approaches include both using MCMC processes to find optimal weights in ANNs [1, 4] and using ANNs as parametrized proposal distributions in MCMC processes [8, 13]. While our approach is more similar to the latter, the important difference is that in such adaptive MCMC approaches there is only a single MCMC chain with a single (adaptive) proposal to optimize a single task, whereas in our case there are multiple adaptive priors to initialize multiple chains with otherwise fixed proposal, which can be used to learn multiple tasks simultaneously. In that sense our work is more related to mixture-of-experts methods and divide-and-conquer paradigms [6, 9, 24], where we employ a selection policy rather than a blending policy, as we design our model specifically to encourage specialization. In mixture-of-experts models, there are multiple decision-makers that correspond to multiple priors in our case, but experts are typically not modeled as anytime optimization processes. The possibly most popular combination of neural network learning with Monte Carlo methods was achieved by AlphaGo [19], which beat the leading Go champion by optimizing the strategies provided by value networks and policy networks with Monte Carlo Tree Search, leading to a major breakthrough in reinforcement learning. An important difference here is that the neural network is used to directly approximate the posterior and MCMC is used to improve performance by concentrating on the most promising moves during learning, whereas in our case ANNs are used to represent the prior. Moreover, in our work we assumed the utility function (i.e. the value network) to be given. For future work it would be interesting to investigate how to incorporate learning the utility function into our model to investigate more complex scenarios such as in reinforcement learning.