A scalable preference model for autonomous decisionmaking
 246 Downloads
Abstract
Emerging domains such as smart electric grids require decisions to be made autonomously, based on the observed behaviors of large numbers of connected consumers. Existing approaches either lack the flexibility to capture nuanced, individualized preference profiles, or scale poorly with the size of the dataset. We propose a preference model that combines flexible Bayesian nonparametric priors—providing stateoftheart predictive power—with welljustified structural assumptions that allow a scalable implementation. The Gaussian process scalable preference model via Kronecker factorization (GaSPK) model provides accurate choice predictions and principled uncertainty estimates as input to decisionmaking tasks. In consumer choice settings where alternatives are described by few key attributes, inference in our model is highly efficient and scalable to tens of thousands of choices.
Keywords
Autonomous agents Autonomous decisionmaking Bayesian inference Discrete choice Gaussian processes Laplace inference Preferences1 Introduction
Datadriven modeling has become integral to informing a growing array of decisions, yet autonomous decisionmaking remains elusive under all but the most highly structured circumstances. Two prominent application domains, dynamic flight pricing and automated credit approvals, exemplify how automated business rule engines make most operative decisions quickly, cheaply, and reliably (Baker 2013). However, in less structured settings involving individual preferences—from planning the next vacation to trading in complex multiechelon markets—autonomous decisionmaking through software agents remains an active area of research, e.g., Adomavicius et al. (2009).
A key challenge in autonomous decisionmaking in unstructured settings is the identification of what choices a given user deems best. Individuals may be unaware of the drivers underlying their own preferences (Lichtenstein and Slovic 2006), making datadriven preference models instrumental because they can elicit preferences by generalizing from observed choices (Bichler et al. 2010). While we will typically see global patterns in preferences (e.g. cheaper options preferred over more expensive options), we do not expect a single model to capture all behavior. Two individuals may make different choices when faced with the same options, and even a single individual may not always make consistent decisions.
To see the benefit of preference modeling consider, for example, smart grids (Kassakian and Schmalensee 2011), where datadriven learning is anticipated to play a key role in facilitating efficient electricity distribution and use. Particular challenges in this context are electric vehicles that are charged in varying locations, and the incorporation of intermittent and variable renewable electricity sources, such as solar and wind (Valogianni et al. 2014). Datadriven modeling of electricity consumption preferences under different incentives and contexts is essential for effectively incentivizing consumers to choose sustainable behaviors (Watson et al. 2010; Peters et al. 2013). A number of factors may influence an individual’s choice to use her electric vehicle—for example, time of day, cost of electricity, or weather conditions. A preference model can learn that a user prefers not to use her electric vehicle in the morning if electricity is at a higher price point, over not using the vehicle if the cost of electricity is reduced. Such a model could be used to incentivize electric vehicle owners to make the car battery’s energy available to nearby consumers when renewable energy is scarce. Electricity cost and emission reduction, informed by such datadriven preference learning and autonomous decisionmaking, can be significant (Kahlen et al. 2014).
Prior work on preference learning has made significant advances in generating accurate predictions from noisy observations such as electricity meter readings that are inconsistent and heterogeneous (Kohavi et al. 2004; Evgeniou et al. 2005). Recently, nonparametric Bayesian models have proved particularly advantageous. A Bayesian framework, such as that used by Guo and Sanner (2010), explicitly models uncertainty and accommodates inconsistencies in human choices rather than imposing stringent rationality assumptions. By allowing inconsistencies in observed choices to be translated to uncertainty, Bayesian models can distinguish between cases where estimates are certain enough to justify an autonomous action, and cases when the model might benefit from actively acquiring additional evidence or transfer control to a human decisionmaker (SaarTsechansky and Provost 2004; Bichler et al. 2010). Nonparametric Bayesian models are a particularly flexible class of Bayesian models that minimize assumptions made about the structure underlying the data, instead automatically adapting to the complexity of realworld observations (Guo and Sanner 2010; Bonilla et al. 2010; Houlsby et al. 2012).
Existing nonparametric Bayesian methods that achieve stateoftheart predictive accuracies do not scale well to a large number of users. Their prohibitive computational costs cannot be addressed with additional processing power or offline processing, making such methods impractical for modeling a large number of users. Conversely, models that are more scalable, such as the restricted value of information algorithm (Guo and Sanner 2010), tend to be parametric models that offer inferior predictive performance compared with more complex models (Bonilla et al. 2010). If nonparametric Bayesian methods are to become widely adopted in practice, progress must be made to ensure that they both scale well and achieve highquality predictions. In particular, important domains such as energy markets and healthcare require methods that are computationally efficient, and that scale gracefully with respect to the number of users and observations. Contemporary electric distribution systems, for example, produce large amounts of data from up to ten million consumer meters, each transmitting data every few minutes (Widergren et al. 2004). Such large amounts of data must be processed quickly and at high granularity (i.e., unaggregated), as automated responses often rely on finegrained, local information. It is therefore important for preference models to provide consistently fast training times, as well as to incorporate and act on new data in a timely manner.
In this paper we develop and evaluate a novel, nonparametric, Bayesian approach that offers an advantageous augmentation to the existing preference modeling toolset. Our approach, Gaussian process Scalable Preference model via Kronecker factorization (GaSPK), leverages common features of consumer choice settings, particularly the small set of relevant product attributes, to yield stateoftheart scalability. GaSPK formulates a personalized preference model based on a shared set of tradeoffs, designed in a way that facilitates the use of Kronecker covariance matrices. While covariance matrices with Kronecker structure have been used in a preference learning context (Bonilla et al. 2010), no prior work on preference learning has employed their favorable factorization and decomposition properties to produce scalable algorithms. As we will see in Sects. 3.2 and 5, this leads to improved theoretical and empirical scalability over related preference models.
We empirically evaluate GaSPK’s performance relative to that of key benchmarks on three realworld consumer choice datasets. For this study we collected an electricity tariff choice dataset on a commercial crowdsourcing platform for a U.S. retail electricity market. To confirm our findings we evaluated the methods on two benchmark choice datasets on political elections and car purchases. Our results establish that GaSPK is likely to often be the method of choice for modeling preferences of a large number of users. GaSPK produces stateoftheart scalability, while often yielding favorable predictive accuracy as compared to the accuracy achieved by existing approaches.
GaSPK introduces a new benchmark to the preference modeling toolset that is particularly suitable for modeling a large number of users’ preferences when alternatives can be described by a small number of relevant attributes. Its principled handling of uncertainty is also instrumental for autonomous decisionmaking.
2 Gaussian process scalable preference model via Kronecker factorization (GaSPK)
We begin with outlining the GaSPK learning approach. As discussed above, GaSPK aims to augment the nonparametric Bayesian preference modeling toolset to allow for scalability and conceptual simplicity in consumer choice settings. Our discussion begins with a description of the fundamentals of GaSPK. We outline our contributions to facilitate scalability and conceptual simplicity in Sect. 3.
Summary of mathematical notation (symbols are in alphabet order)
Symbol  Definition 

\(\circ \)  Hadamard (elementwise) matrix product 
\(\otimes \)  Kronecker matrix product 
\(C = \left\{ (u, x^{1}, x^{2}, y) \right\} \)  Choice situations: when presented with alternatives \(x^1\) and \(x^2\), user u chose \(y =+1\) (first alternative), or \(y=1\) (second alternative) 
\(\gamma _u^c, {\varGamma }_u, {\varGamma }\)  \({\varGamma }_u=(\gamma _u^1,\dots ,\gamma _u^C)\) is a probability vector indicating the extent of user u’s possession of each of the \(n_C\) characteristics; \({\varGamma } \in {\mathbb {R}}^{n_U \times n_c}\) collects all \({\varGamma }_u\) 
\(d_T, d_{X}\)  Dimensionality of elements in T, X 
\(f_u, f^c\)  Functions \(f:{\mathbb {R}}^{d_T}\rightarrow {\mathbb {R}}\) describing users’ latent evaluation of tradeoffs and characteristic evaluation of tradeoffs, respectively 
\(\theta = \{ l_d \}\)  Lengthscale hyperparameters 
I  Identity matrix 
K  Covariance matrix, \(K \in {\mathbb {R}}^{(n_c n_T) \times (n_c n_T)}\) 
\(L: L L^T = W\)  Lower Cholesky factor of W 
\(n_c\)  Fixed number of characteristics 
\(n_e\)  Number of Eigenvalues used in lowrank approximations 
\(n_T, n_U, n_X\)  Number of elements in T, U, and X 
\(N,{\varPhi }\)  Probability density function (PDF), and cumulative distribution function (CDF) of the standard normal distribution 
\(p(C\{f_u\})\), \(\nabla p(C\{f_u\})\)  Likelihood and its Jacobian \(\frac{\partial p(Cf)}{\partial f_i}\) 
\(t, t^{(d)}, T\)  Tradeoff t, its dth element, and set of all tradeoffs, respectively 
U  Set of all users 
\(W = \nabla \nabla \log p(C\{f_u\})\)  Negative Hessian of the log likelihood 
X  Set of all instances 
\(y \in \{ 1, +1 \}\)  Single choice: \(y =+1\) (first alternative), or \(y=1\) (second alternative) 
Z  Model evidence, also known as marginal likelihood 
Rather than operating directly on order relations \(\succeq _u\), some preference models estimate latent functions from which the order relations can be inferred. For example, the standard discrete choice models proposed by Thurstone (1927) and Bradley and Terry (1952) estimate functions \(\widetilde{f}_u: X \rightarrow {\mathbb {R}}\) that capture the utility \(\widetilde{f}_u(x)\) that user u derives from each instance x. When presented with a previously unobserved choice between instances \(x^1\) and \(x^2\), these models will predict that \(x^1 \succeq _u x^2\) if and only if \(\widetilde{f}_u(x^1) \ge \widetilde{f}_u(x^2)\).
Two key disadvantages of such approaches include the absolute interpretation of utility independently of context, and the stringent rationality assumptions that follow from this treatment. When making decisions individuals have been shown to focus on tradeoffs resulting from their choices rather than on absolute outcomes and thus perceive alternatives within the context in which they are presented (Tversky and Simonson 1993). The assumption of utility models that individuals simply recall absolute, predetermined instance utilities \(\widetilde{f}_u(x)\), and the strict transitivity of \(\succeq _u\) implied by this assumption, are frequently violated in practice. Therefore, we represent our choices in terms of tradeoffs \(t=\tau (x^1,x^2)\), where \(\tau \) is some mapping from \({\mathbb {R}}^{d_X}\times {\mathbb {R}}^{d_X}\) to \({\mathbb {R}}^{d_T}\), where \(d_T\le d_X\). For example, we might choose \(\tau (x^1,x^2) = x^1x^2\).^{1} If the dimensionality \(d_X\) of X is high, we might choose \(\tau \) to be a dimensionalityreducing mapping, so that \(d^T<d_X\); such an approach is supported by findings that, when \(d_X\) is large, consumers tend to base their decisions on a small subset of the dimensions of X (Hensher 2006).
The shared functions \(f^c\) can capture global patterns of behavior—e.g. frugality, environmental consciousness—that are exhibited in different quantities by different users (as determined by the weights \({\varGamma }\)). Sharing information across users in this hierarchical manner allows us to draw statistical strength across users, leading to better predictions even when we have few choice observations for a particular user.
For now, we assume that \({\varGamma }\) is known and focus on the problem of efficiently obtaining probabilistic estimates of the \(f^c\). To do so in a Bayesian context, we start by placing some prior distribution \(p(f^c)\) over these functions. We desire our prior to be flexible and make minimal assumptions about the functional form of the \(f^c\). To this end, we select the nonparametric Gaussian process (GP) prior (Rasmussen and Williams 2006; MacKay 1998). A GP is a distribution over continuous functions \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) such that, for any finite set of locations in \({\mathbb {R}}^d\), the function evaluations have a joint multivariate Gaussian distribution. We write \(f(\cdot ) \sim \mathcal {GP}(m(\cdot ), k(\cdot , \cdot ))\), where m is a mean function (which we set to zero to reflect indifference in the absence of other information), and k is a covariance function that specifies how strongly evaluations of f at t and \(t'\) are correlated.
Evaluating k at all pairs of observed tradeoffs \((t_1, t_2)\) yields the covariance (kernel) matrix K necessary for posterior inference. Importantly, the cost of many key operations on K grows cubically in the number of unique tradeoffs, which presents naïve inference methods with significant scalability challenges. In Sect. 3.2 we show how the structure of our preference learning task can be exploited to substantially reduce this cost, yielding stateoftheart scalability for our setting without significant loss in accuracy.
3 Fast Bayesian inference in GaSPK
The modeling choices made in the basic GaSPK framework described in Sect. 2 are designed to give flexible and powerful modeling capacity, allowing us to obtain highquality predictive performances. However, the goal of this work is to combine stateoftheart performances with computational efficiency. As described in Sect. 2, a naïve implementation of the GaSPK will not scale well as we see more data, since GP inference typically scales cubically with the number of datapoints. Further, the nonGaussian likelihood means we are unable to evaluate the posterior analytically, and must make judicious approximate inference choices to ensure scalability.
In this section we address these issues of scalability. We first introduce modeling choices that facilitate scalable inference (Sect. 3.1), then develop a scalable approximate inference scheme in Sect. 3.2. Our inference algorithm alternates between using Laplace’s method to efficiently obtain the approximate posterior distribution \( p(f^c  C) \approx q(f^c  C)\) of characteristic tradeoff evaluations, and estimating the user characteristics \({\varGamma }\) and the hyperparameters \(\theta \).
3.1 Structured Gaussian processes
When we condition on our finite set of tradeoffs T, inferences about the \(f^c\) correspond to posterior inference in a multivariate Gaussian. Evaluating the covariance function k at all pairs of observed tradeoffs \((t_1, t_2)\) yields the covariance (kernel) matrix K necessary for this posterior inference. Importantly, the cost of many key operations on K grows cubically in the number of unique tradeoffs, which presents naïve inference methods with significant scalability challenges.
However, since our covariance structure factorizes across dimensions (Eq. 2), if we are able to arrange our inputs on a grid, we can formulate our model using Kronecker covariance matrices. Kronecker covariance matrices have favorable factorization and decomposition properties that, as we describe in this section, facilitate scalable inference. While Kroneckerstructured covariances have appeared in other preference learning models (Bonilla et al. 2010; Birlutiu et al. 2013), we believe we are the first to exploit their computational advantages in this context.
In particular, as we will see later in this section, Kronecker covariance matrices are particularly appealing when our input space can be expressed as a lowdimensional grid with a fairly small number of possible values along each dimension. Our problem setting is well suited to the use of such a structure. Consumer and econometric research has established that consumers focus on relatively small subsets of attributes as well as few possible values thereof when choosing amongst alternatives, e.g., Caussade et al. (2005) and Hensher (2006). Motivated by this, we consider settings in which (1) the number of users, instances, and observed choices is large and naïve methods are therefore computationally infeasible; (2) tradeoffs can be represented by a small number of attributes; and (3) each attribute has a small number of values, or can be discretized. We show that when alternatives can be represented by a small number of attributes and values, it is possible to obtain matrices K which are large, but on which important operations can be performed efficiently. In the empirical evaluations that follow, we demonstrate that this approach yields computational advances but also, despite introducing approximations, produces predictive performance that is often superior to what can be achieved with current scalable approaches.
Concretely, we assume that tradeoffs can be arranged on a \(d_T\)dimensional grid, and let \(T_d\) denote the set of unique values that occur on the dth attribute in T. In our electricity tariffs example, tradeoffs can be characterized by (1) price differences per kWh, and (2) differences in renewable sources, so that we may have the following unique tradeoff values: \({T_1 = \left\{ 0.10, 0.09, \ldots , 0.09, 0.10\right\} }\) and \({T_2 = \left\{ 1, 0, 1 \right\} }\). Not all possible combinations of tradeoffs are always observed \((T < T_1 \cdot T_2 = 63)\), and the covariance matrix \({\widetilde{K} = \left[ k(t, t')\right] _{t, t' \in T}}\) is therefore significantly smaller than \(63 \times 63\). A Gaussian process applied to such a structured input space is known as a structured GP (Saatci 2011).

Matrixvector products of the form Kb can be computed at a cost that is linear in the size of b, in contrast to the quadratic cost entailed by standard matrixvector products. This follows from the fact that \(\left( K_i\otimes K_j\right) \text{ vec }(B) = \text{ vec }\left( K_iBK_j^T\right) \), where \(b=\mathbf {(}B)\); since the number of nonzero elements of B is the same as the length of b, this operation is linear in the length of b.
As we will see in Algorithm 2, such products are required to find the posterior mode of our GPs and in general dominate the overall computational budget; this speedup means that they are no longer the dominant computational cost.
 Eigendecompositions of the form \(K = Q^T {\varLambda } Q\) can be computed from the Eigendecompositions of the \(K_d\):at cubic cost in the size of the largest \(K_d\). This is a consequence of the mixed product property of Kronecker products, that states that \((A\otimes B)(C\otimes D) = (AC)\otimes (BD)\) and therefore$$\begin{aligned} Q = \bigotimes _{d=1}^{D}Q_d\quad {\varLambda } = \bigotimes _{d=1}^{D}{\varLambda }_d \end{aligned}$$In particular, this allows us to efficiently determine the Eigenvectors to the \(n_e\) largest Eigenvalues of K, allowing us to obtain computational speedups by replacing K with a lowrank approximation.$$\begin{aligned} (Q_i\Lambda _iQ_i^T)\otimes (Q_j\Lambda _jQ_j^T) = \left( (Q_i\Lambda _i)\otimes (Q_j\Lambda _j)\right) (Q_i^T\otimes Q_j^T) = (Q_i\otimes Q_j)(\Lambda _i\otimes \Lambda _j)(Q_i\otimes Q_j)^T \end{aligned}$$
3.2 Approximate inference in GaSPK
The Kronecker structure described above has proved useful in a regression context, but requires careful algorithmic design to ensure its benefits are exploited in the current context. In Sect. 3.2.1, we develop a scalable inference algorithm using Laplace’s method to estimate the posterior distributions \(p(f^cC,{\varGamma })\).
In a full Bayesian treatment of GaSPK, we would consider \({\varGamma }\) another latent quantity of interest, and infer its posterior distribution. Previous work has addressed similar challenges by either imposing a Gaussian or a Dirichlet process prior on \({\varGamma }\) (Houlsby et al. 2012; Abbasnejad et al. 2013). However, these approaches are computationally expensive, and it can be hard to interpret the resulting joint distribution over weights and characteristics. Instead, we treat \({\varGamma }\) as a parameter to be estimated; in Sect. 3.2.2 we show that we can either find the maximum likelihood value by optimization, or find a heuristic estimator that we show in Sect. 5 performs well in practice at a much lower computational cost.
In the Estep, we use Laplace’s method to approximate the conditional expectation \(E[f^cC,{\varGamma }]\) with the posterior mode \(E[\hat{f^c}C,{\varGamma }]\), as described in Sect. 3.2.1. We then obtain one of the two estimators for \({\varGamma }\) described in Sect. 3.2.2—an optimizationbased estimator that corresponds to the exact Mstep but is slow to compute, or a heuristicbased estimator that is significantly faster to compute. In practice, we suggest using the heuristicbased estimator; as we show in Sect. 5 this approach strikes a good balance between predictive performance and computational efficiency.
3.2.1 Learning the latent functions \(f^c\) conditioned on \({\varGamma }\)
In this paper we use Laplace’s method, because it is computationally fast and conceptually simple. Laplace’s method is a well known approximation for posterior inference in regular GPs (Rasmussen and Williams 2006) and simpler preference learning scenarios (Chu and Ghahramani 2005). Laplace’s method aims to approximate the true posterior p with a single Gaussian q, centered on the true posterior mode \(\hat{f^c}\), and with a variance matching a secondorder Taylor expansion of p at that point (see Fig. 3). Approximating the posterior with a single multivariate Gaussian allows us to conveniently reuse it as the prior in subsequent Bayesian updates which is important for online and active learning from user interactions (SaarTsechansky and Provost 2004). While the approximation can become poor if the true posterior is strongly multimodal or skewed, prior work has shown this limitation has no significant impact in the preference learning context, e.g., Chu and Ghahramani (2005).
In principle, we could directly apply the Laplace mode and variance calculations used by Chu and Ghahramani (2005), which assume a full covariance matrix. However, doing so would negate the benefit of using a structured covariance function. Instead, we formulate our calculations to exploit properties of our covariance matrix, yielding an algorithm which, as we show later in this section, has better scaling properties directly applying the algorithms in (Chu and Ghahramani 2005).
Our development of Laplace inference in GaSPK proceeds in two steps. First, we describe an efficient procedure for finding the posterior mode \(\hat{f}\) (Algorithm 2). We then describe how the posterior variance and predictions for new tradeoffs \(t_*\) can be computed (Algorithm 3). Additional mathematical details are provided in “Appendix A”.
Using Eq. (6), we can efficiently compute the posterior mode by following the steps outlined in Algorithm 2. Note, that all operations in the algorithm are simple matrix operations available in most programming environments. Furthermore, the operations in lines 6 through 8 are all matrixvector operations which generate vectors as intermediate results. Rather than calculating the inverse in line 7 explicitly, we use conjugate gradients (Press et al. 2007) to solve the system \((I + L^TKL) x = L^T K b\) by repeatedly multiplying the parenthesized term with candidates for x, as in Cunningham et al. (2008).
Because K has Kronecker structure and L consists only of diagonal submatrices, multiplications with K and L have linear time and space complexity, hence the overall computational cost is dominated by the \(O(n_Tn_c^3)\) cost of the Cholesky decomposition. Without the Kronecker structure, these multiplications would be \(O(n_T^2n_c^2)\), and their cost would therefore dominate when \(n_T>n_c\).
Figure 4 illustrates the output of Algorithms 2 and 3 for the choices of a single user, using data from a popular preference benchmark dataset (Kamishima and Akaho 2009). Panel (a) shows the posterior mode \(\hat{f_u} = E[f_u]\), which is expectedly high in regions of the tradeoff space perceived as favorable, and low otherwise. The bold line indicates the zero boundary \(\hat{f_u} = 0\), and it is sufficient as a predictor of future choices when predictive certainty estimates are not required. Importantly, it can be computed using only Algorithm 2 and is therefore very fast.
The key distinguishing feature of our probabilistic approach are the variance estimates shown in Panel (b). As shown, the algorithm correctly identifies the region at the center of the panel where the decision boundary already follows a closely determined course to match earlier observations (pale yellow coloring, low variance). If additional observations were to be acquired for the purpose of improving predictions, they should be located in the upper or lower regions of the decision boundary instead, where fewer evidence is currently available (dark red coloring, high variance). Panel (c) shows the combination of both outputs to compute the predictive probabilities \(p(y=+1  f)\). While the decision boundary at \(p(y = +1f) = 0.5\) is the same as the one in Panel (a), this panel also incorporates predictive variances by shrinking the predictive probabilities towards indifference \((p = 0.5)\) in highvariance regions [see Eq. (4)]. Consequently, the corridor in which GaSPK is indifferent (intermediate intensity orange coloring, intermediate probabilities) is narrower in areas with extensive evidence from the data, and wider towards the edges of the panel. This information is an important input to subsequent decisionmaking tasks which require information on whether existing evidence is conclusive enough to make an autonomous decision.
3.2.2 Learning user characteristics
To complete our EMtype algorithm, we must estimate the user characteristics \({\varGamma } = \left[ \gamma ^c_u \right] _{u,c}\) from the data. Recall from Sect. 2 that \(\gamma ^c_u\) denotes the fraction of user u’s behavior explained by characteristic c, that is, \(f_{u}(t) = \sum _{c} \gamma _u^c \cdot f^c(t)\) with \(\sum _c \gamma _u^c = 1\). An exact Mstep estimator for \({\varGamma }\), that returns \(\arg \max \prod _i \Phi \left( y_i\sum _c \gamma _{u_i}^c\cdot f^c(t_i)\right) \) s.t. \( \sum _c \gamma _u^c = 1, u=1,\dots , U\), can be obtained using an interiorpoint optimizer. This yields a (local) optimum for \({\varGamma }\), but is more computationally expensive.
As an alternative, we propose a heuristic approximation to this Mstep, described in Algorithm 4. We note that if \(\gamma _u^c>\gamma _u^{c'}\), then \(f_u\) is likely to be closer to \(f^c\) than to \(f^{c'}\). Therefore, approximating \(f_u\) with \(f^c\) is likely to give a higher likelihood than approximating \(f_u\) with \(f^{c'}\). The heuristic “Mstep” in line 3 computes an approximation to the likelihood that characteristic c alone generated the observed choices. Each iteration of the surrounding loop calculates one column of the \({\varGamma }\) matrix, corresponding to one characteristic. The resulting user characteristics are then rescaled so that they add to one in line 5.
3.2.3 Learning hyperparameters
As in the case of \({\varGamma }\), a full Bayesian treatment the hyperparameters, \(\theta = \{ l_d \}\), is prohibitively expensive. Prior work has often resorted to either gradientbased optimization of the marginal likelihood Z, e.g., Chu and Ghahramani (2005), or to heuristics, e.g., Stachniss et al. (2009) to learn the hyperparameters from the data. In the experiments that follow, we employ a heuristic and set the lengthscales to the median distance between tradeoffs t. This has been found in prior work to be a computationally fast heuristic yielding consistently good empirical results (Houlsby et al. 2012).
4 Related work
Machine learning research has produced preference models based on a broad variety of learning frameworks (Fürnkranz and Hüllermeier 2011). Of particular interest to this research is work on probabilistic preference models that derive principled uncertainty estimates from noisy preference observations. Chu and Ghahramani (2005) were the first to model preference learning using Gaussian processes. However, that approach does not capture heterogeneity across users—an essential property for modeling large, heterogeneous sets of users. More recent work (Bonilla et al. 2010; Birlutiu et al. 2013; Houlsby et al. 2012) has alleviated this shortcoming; however, these contributions have focused on solutions for incorporating heterogeneous preferences, rather than ensuring scalability.
More specifically, the Hierarchical GP model (Birlutiu et al. 2013) is derived from a semiparametric Bayesian formulation that builds on the framework proposed by Bradley and Terry (1952). The authors model each user’s utility function using a GP, which they represent using a basis decomposition \(f_u(x) = w_u^T \phi (x)\). A hierarchical Gaussian prior on the base weights \(w_u\) induces correlations between users (hence our choice of name for the approach). An EMtype algorithm is then used for learning, which iteratively refines the parameters of the hierarchical prior. While the Hierarchical GP model offers stateoftheart accuracy, inference is computationally expensive since we need to effectively learn a Gaussian process for each user.
The Collaborative GP method of Houlsby et al. (2012) also builds on Bradley and Terry (1952). Like GaSPK, it represents users’ utility functions using a weighted superposition of globally shared GPs. Unlike GaSPK, the weights are unnormalized; this adds a redundant degree of freedom which makes interpretability harder. Further, the weights are treated as random variables to infer rather than parameters to optimize, increasing the computational burden. Another key distinction between GaSPK and Collaborative GP is that the latter operates on pairs of alternative instances \((x^i, x^j)\) instead of the associated tradeoffs \(t = \tau (x^i, x^j)\), and it estimates instance utilities rather than tradeoff evaluations. This makes inference in the model significantly more demanding, and the authors employ a combination of Expectation Propagation and Variational Bayes to address this challenge. As shown in Sect. 5, this design choice yields comparable accuracies to those produced by the Hierarchical GP method at lower computational cost, likely due to the smaller number of GPs. However, in general the Collaborative GP will not scale as well as GaSPK: inference still scales cubically in the number of distinct tradeoffs, and approximating the full posterior over weights adds computational complexity.
In the limit of a single latent characteristic, Collaborative GP reduces to regular GP classification with a specific preference kernel (Rasmussen and Williams 2006; Houlsby et al. 2012). Inference in this model is fast and conceptually simple, and as such GP classification constitutes the strongest computational benchmark for GaSPK. As shown in Sect. 5, GaSPK is valuable as it can both achieve GP classification’s computational performance as well as a substantial improvement in predictive accuracy.
Bonilla et al. (2010) presented an earlier GP approach that aimed to accommodate heterogeneity amongst users. However, their approach has been shown to be inferior in both predictive accuracy and computational cost relative to other stateoftheart approaches (Houlsby et al. 2012). Note, that Bonilla et al. (2010) make use of a single Kronecker product to multiply one itemcovariance matrix with one usercovariance matrix. The purpose of this product is fundamentally different from the manner in which the Kronecker product is used in GaSPK, namely to deal with the growing itemcovariance matrix. Furthermore, in Bonilla et al. (2010), capturing heterogeneity yields an even larger matrix, making the resulting method extremely slow, as noted and demonstrated in Houlsby et al. (2012). By contrast, GaSPK uses \((D1)\) Kronecker products, where D refers to the dimensionality of the data, and the Kronecker product is used to factor the tradeoffcovariance matrix, thereby effectively addressing its growth.
Marketing and Econometrics research considered preference measurement methods such as conjoint analysis, Logit and Probit models, and other discrete choice prediction techniques (Greene 2012). Early preference measurement was limited to populationlevel estimates, but more recent techniques accommodate heterogeneity across consumer segments (Allenby and Rossi 1998; Evgeniou et al. 2007). The primary objective of these models is to inform human decision makers, and thus their outputs are interpretable coefficients. By contrast, GaSPK work focuses on preference learning for use in autonomous decisionmaking settings, and has to consider scalability, incremental updates, and other practical issues that arise when moving from passive preference measurements to autonomous decisionmaking (Netzer et al. 2008). In Sect. 5, we illustrate these differences by benchmarking GaSPK against the Mixed Logit model, a wellestablished standard in Marketing and Econometrics. The Mixed Logit model estimates \(f_u^i = w_u x_u^i + \varepsilon _u^i\) where \(\varepsilon _u^i\) is extremevalue distributed, and the \(w_u\) are drawn from a hierarchical prior. Like the other benchmarks, Mixed Logit accommodates random variations in taste among users. This makes inference more difficult than in the standard Logit model—a challenge that is addressed by a computationally expensive sampling procedure. Moreover, as demonstrated in our empirical evaluations, Mixed Logit is less flexible to adapt to the data in comparison to the nonparametric models.
5 Empirical evaluation
In the empirical evaluations that follow we consider learning preferences in consumer choice settings and aim to evaluate whether GaSPK offers a valuable addition to the existing set of nonparametric Bayesian approaches that similarly provide principled uncertainly estimates. To do so, we compare GaSPK to three nonparametric Bayesian approaches shown to yield stateoftheart performance either in scalability or predictive accuracy. We compare the heuristic estimator of Algorithm 4 with an exact Mstep, and show that the heuristic method offers comparable accuracy with much lower computational cost. Further, we show that, compared with other methods, GaSPK (using this heuristic estimator) offers an impressive combination of predictive accuracy and computational efficiency. Our evaluations are performed on an electricity tariff preference dataset collected specifically for this work, as well as two benchmark datasets used earlier in the literature. In our implementation of GaSPK we used the GP toolbox (Vanhatalo et al. 2013), and we make both our code and data publicly available at https://bitbucket.org/gtmanon/gtmanon.
5.1 Datasets and benchmark methods
The first benchmark with which we compare GaSPK is GP classification (Rasmussen and Williams 2006; Houlsby et al. 2012), a nonparametric Bayesian method that exhibits stateoftheart scalability and which GaSPK aims to approximate. Inference was performed using expectation propagation. Because scalability often comes at the cost of predictive accuracy, this evaluation aims to establish whether GaSPK is able to match GP classification for an increasing number of users, and to produce better predictive accuracy, thereby offering an advantageous augmentation of the nonparametric Bayesian toolset for a large number of users.
We also compare GP classification and GaSPK to nonparametric Bayesian approaches that are shown to yield stateoftheart predictive performance but to be computationally intensive. These comparisons aim to assess the relative loss in accuracy incurred by GaSPK and GP classification to produce their respective scalabilities. To do so, we evaluate the performance of Hierarchical GP (Birlutiu et al. 2013) and Collaborative GP (Houlsby et al. 2012), both shown to yield stateoftheart accuracy but to be computationally more expensive. As done in prior work (Houlsby et al. 2012), to allow evaluations with these computationally intensive methods, the data sets used in the empirical evaluations include a moderate number of users. As we will see, the differences in the scalabilities are clearly apparent for these data sets, and the performances differ in order of magnitude.^{5}
We also contrast the nonparametric Bayesian approaches with the wellestablished, parametric Mixed Logit model (see above). These results will aim to establish the benefits from adopting a nonparametric Bayesian framework in our setting. Mixed Logit estimates \(f_u^i = w_u x_u^i + \varepsilon _u^i\) where \(\varepsilon _u^i\) is extremevalue distributed, and the \(w_u\) are drawn from a hierarchical prior. Like the other benchmarks, Mixed Logit accommodates variations in taste among users. This makes inference more difficult than in the standard Logit model, a challenge that is addressed by a computationally expensive sampling procedure. Moreover, in comparison to the nonparametric models, Mixed Logit is also less flexible to adapt to the data. In the evaluations reported below we used the implementation by Train (2003).
We evaluated the methods on three preference datasets collected from human decisionmakers. Recall that a key motivation for this work is the need for computationally fast and scalable preference models towards contemporary applications, such as to automate decisions in dynamic energy markets. One application domain of significant global importance is the modeling of electricity tariff choices of smart grid consumers. In future smart grids, tariffs may be revised frequently and automatically to reflect changes in the cost and availability of renewable energy (such as solar or wind); consequently, tariff choice is expected to become a nearcontinuous process in which both retailers and customers will rely on automated, datadriven decision agents. The ability to predict and act upon tariff choices quickly and with adequate accuracy is therefore an important challenge. To evaluate our approach in this setting, we used data on real electricity tariffs from the Texas retail electricity market. This retail market is the most advanced in the United States, and it provides daily information on available tariffs (see http://www.powertochoose.org). Using the Amazon Mechanical Turk crowdsourcing platform, we acquired data on American participants’ choices between pairs of tariffs offered in Austin, Texas in February 2013. Tariff preferences were acquired on randomly drawn tariff pairs from a set of 261 tariffs, and all modeling techniques were evaluated on predicting consumers’ preference for the same pairs of tariffs.
“Appendix B” provides complete details on the Tariffs data set collection, as well as an example of tariff choice (Table 3). The Tariffs data set reflects important characteristics of many real world applications where the data correspond to many choice alternatives (tariffs), but relatively few observed choices per individual user (see Table 2). As commonly encountered in practice, choices of different users likely correspond to different alternatives and are thus sparsely distributed and difficult to generalize from. This property of the Tariffs dataset is common in real world applications, but is not reflected in other benchmark datasets.
Our evaluations on the Tariffs data set are complemented with two benchmark datasets. Specifically, we used the Cars dataset that contains stated preferences for automobile purchases (Abbasnejad et al. 2013), and the Elections dataset compiled by Houlsby et al. (2012), which captures revealed voters’ preferences over eight political parties in the United Kingdom. The full Elections dataset contains 20 tradeoff dimensions, resulting in a Kronecker covariance matrix that was too large to hold in memory. As described in Sect. 3.1, our tradeoff function \(\tau (x^1,x^2)\) need not involve all dimensions of X, and indeed prior research (Hensher 2006) indicates that, when the number of dimensions are large, users tend to base their choices on a smaller subset of dimensions. We therefore applied greedy forward feature selection to reduce the dimensionality of this datasets to a subset of important predictive features, such that the accuracies after feature selection were comparable to those reported by Houlsby et al. (2012) on the complete feature set using the most accurate benchmark method (Birlutiu et al. 2013).
Characteristics of the datasets used in this study
Dataset  Instances  Users  Tradeoffs stated preferences  Orig. dim.  Sel. dim.  Grid size 

Tariffs  261  61  610  9  9  12,288 
Cars  10  53  2362  5  5  216 
Elections  8  264  7392  20  8  30,375 
GaSPK was applied to versions of the datasets in which the continuous attributes were discretized to between 5 and 25 levels, with the objective of minimizing information loss while keeping the resulting grid size manageable.^{6} All other methods ran on the original, nondiscretized datasets. We employed the Natural Breaks algorithm (Jenks and Caspall 1971) to identify bins for discretization. Natural Breaks is a univariate variation of the kmeans algorithm, which selects bin boundaries such that withinbin variances are minimized while betweenbin variances are maximized.
In the empirical evaluations below we will aim to evaluate whether GaSPK offers advantageous improvements over the existing, scalable GP classification. Simultaneously, the stateoftheart predictive accuracies exhibited by Hierarchical GP and Collaborative GP allow us to assess the reduction in predictive accuracy that GP classification and GaSPK’s computational benefits entail.
5.2 Model scalability and predictive accuracy
Figure 5 shows the training time incurred by each approach for increasing training set sizes. Training times correspond to running Algorithms 2 through 4. As expected, GP classification has the fastest training time, since there is only a single GP to be learned. While the cost of matrix inversion increases cubically with the number of distinct tradeoffs, the simplicity of the GP classification model means this cost remains small relative to other operations and we see little change in the training time as we increase the number of observations. By contrast, the more sophisticated comparison methods display a clear increase in computational cost as we increase the size of the training set.
We now aim to establish whether GaSPK can yield improved accuracy over GP classification, thereby offering an advantageous addition to the set of highly scalable methods, and whether the timesaving heuristic EM algorithm is effective in practice. Figure 6 presents the proportion of correctly predicted test choices as a function of the training set sizes shown in Fig. 5. We note that in all cases, the heuristic version of the GaSPK algorithm shows comparable performance to the optimizationbased version, motivating its use. In all future experiments, we will consider only this heuristicbased algorithm.
As shown, for the Tariffs dataset, the two GaSPK variants exhibit the highest predictive accuracy relative to the accuracy offered by GP classification and that of the computationally slower approaches. It is useful to recall here that, similar to common realworld choice settings, the Tariffs dataset contains many alternatives but relatively few choice observations for each user (see Table 2). GaSPK’s focus on estimating the \(f^c\) is likely instrumental in this setting relative to other methods’ focus on determining user characteristics \({\varGamma }\). As compared to GP classification, GaSPK yields comparable scalability as well as improved accuracy on the Tariffs and Elections data sets. On the Cars data set, GP classification performs well, suggesting that the problem is fairly simple and does not benefit from the additional modeling flexibility afforded by the personalized models. Here, the optimizationbased GaSPK performs as well as the best competing algorithm, while the heuristic GaSPK performs slightly less well than the other GPbased methods. We hypothesize that, in this simple setting, where the modeling flexibility of the personalized models does not seems to yield significant advantages, the approximation induced by the heuristic has a more notable effect. Importantly, as compared to the computationally intensive approaches’ predictive performances, when we use the heuristic EM Algorithm 4, GaSPK’s fast training and scalability are accompanied by predictive accuracies that are consistently good across domains, and which are not significantly worse than the most accurate and computational intensive methods. In the Elections dataset, we note that the GaSPK performs comparably with the competing methods even when the competing methods have access to the full set of 20 predictors, rather than the subset of 8 used by GaSPK. Indeed, in support of the hypothesis that users tend to base their choices on a smaller subset of dimensions, the comparison methods only report a modest improvement when using 20 rather than 8 predictors.
By contrast, GP classification’s scalability comes at the cost of highly inconsistent predictive performance—GP classification yields particularly poor predictions on the Elections dataset, where it is unable to benefit from additional training data.
Key to our discussion is that additional training data allows nonparametric methods to capture more predictive structure in the data, as reflected by the inclining accuracy curves (see, in particular, Fig. 6b). In sharp contrast, the parametric Mixed Logit fails to benefit from additional data because its fixed set of parameters underfits larger training sets. A related effect can be observed in GP classification’s performance on the Elections dataset. On this revealed preference dataset, the Hierarchical and Collaborative GP methods benefit substantially from additional training data early on in the learning curve. As shown, once a representative training sample is available, these methods are able to exploit more observations to capture the heterogeneity in the data. GP classification, however, benefits less from additional training data—its single latent characteristic yields a significant speedup in computation, but it also undermines the flexibility to capture the rich heterogeneity inherent in the Elections data.
In summary, GaSPK strikes a new and advantageous balance relative to existing approaches by offering a combination of the scalability of GP classification with the modeling flexibility and expressiveness of more complicated and computationally costly nonparametric GP approaches. GaSPK effectively adapts to the complexity in the data while scaling gracefully as more data becomes available. GaSPK’s scalability along with its consistently good predictive performance suggests that GaSPK can often be the method of choice in largescale applications involving a large number of users and observations.
5.3 Dimensionality characteristics
GaSPK aims to produce stateoftheart scalability in the number of observations to accommodate realworld applications with a large number of observed choices. Our solution is inspired by prior findings that human choices are determined by a small number of dimensions. GaSPK has thus been designed to provide superior scalability for learning and inference when tradeoffs can be characterized by a small number of dimensions. Our experiments also demonstrate that dimensionality reduction in these settings incurs only a modest loss in predictive accuracy. The tradeoff inherent in GaSPK’s ability to offer stateofthe art scalability in the number of observations and consistently good predictive performance is typical of structured GP methods: GaSPK is fast and highly scalable with respect to the number of observations for lowdimensional settings, while it is unsuitable in domains with high dimensions as this yields exponential growth in its grid size.
Figure 7 shows the resulting training times for several dimensionalities (panels) and levels of discretization (three GaSPK lines per plot). GaSPK’s computational costs are dominated by the fixed cost associated with a given grid size. In particular, because GaSPK’s grid grows exponentially in the number of dimensions, this fixed cost outgrows the variable cost of other methods as the data’s dimensionality increases (see Panel (c) for 9 dimensions and 7 levels). At the same time, as shown in Fig. 7, GaSPK’s training curves are relatively flat; thus, it scales better for large numbers of users and choices in the consumer choice settings for which it is designed.
5.4 Sparse approximation quality
Figure 8 depicts the posterior variance for the first user from a popular preference benchmark dataset, and for varying numbers of Eigenvectors. Note, that the general shape of the posterior variance is similar in all three panels, which indicates that our sparse Laplace approach delivers reasonable results starting from small \(n_e\) values. Differences between panels are primarily limited to the step from \(n_e = 10\) [Panel (a)] to \(n_e = 100\) [Panel (b)]. In Panel (b), the lowvariance area at the center of the panel is noticeably larger than in Panel (a). Surrounding areas similarly shift to lower variances. The subsequent step to \(n_e = 1000\) [Panel (c)] entails almost no further change in posterior variance. A quantitative analysis supports this interpretation: when the model was learned on a randomly selected training set of 80% of the data and evaluated on the remainder, the log predictive likelihood (two standard errors) was \(0.4988\) (0.0067) for \(n_e=10\), and \(0.4992\) (0.0065) for both \(n_e=100\) and \(n_e=1000\).
6 Discussion and conclusions
The GaSPK preference model we develop here aims to offer a novel and advantageous balance between computational scalability and predictive performance targeted at preference learning, such as in consumer choice settings, involving a large number of users and alternatives. GaSPK provides stateoftheart scalability when human choices are informed by a small set of dimensions, allowing it to accommodate data on a large number of users and observations. These properties are particularly critical in important emerging applications, including modeling preferences in smart electric grids and in complex BusinesstoBusiness marketplaces, where preference models must be learned in realtime from a large number of users and observations. GaSPK provides principled probabilistic uncertainty estimates that are fundamental for automated, datadriven decisions.
Figure 9 summarizes the settings under which different approaches are beneficial, and when GaSPK constitutes a new benchmark and an advantageous tradeoff. We show that for settings with large numbers of users and choice observations where choices can be effectively characterized with few dimensions and levels, GaSPK offers good scalability as well as consistently good predictive performance. Thus, GaSPK offers a new benchmark that can often be the method of choice in these settings. In settings where both the dimensionality and the number of observations is high, GP classification provides similarly fast predictions as does GaSPK in lowerdimensional settings, but its predictive performance remains inconsistent due to its limited expressive power to capture complex patterns. When the number of users and observations is small, Hierarchical GP and Collaborative GP are feasible, and they offer statetheart predictive performance.
The research we present here focuses on fast learning of probabilistic tradeoff evaluations \(f^c\) that characterize different segments of the user population. We solved the related problem of learning what combination of these evaluations describes each user through a simple, yet effective iterative scheme. We find that existing alternatives to this simple iterative scheme entail significantly higher computational costs, making them impractical for the settings we consider. It would be valuable for future work to explore alternatives that learn the number of characteristics \(n_c\) from the data at a reasonable cost.
GaSPK learns from pairwise choices of the form “User u prefers alternative a to alternative b,” which are objective and cognitively less demanding for humans to express, but which are also more difficult to learn from than learning from ratings. However, the natural separation between model and observations inherent in Bayesian modeling makes it possible to adapt GaSPK to learn from other data types, in addition to pairwise choices. In particular, Jensen and Nielsen (2014) provide likelihood models for ordinal ratings that are compatible with the framework underlying GaSPK, and that would allow GaSPK to learn from heterogeneous observations of pairwise choices and ratings simultaneously.
The contributions presented here towards efficient and scalable inference also generalize to other important classification problems such as those arising in credit scoring, quality assurance, and other impactful practical challenges. As such, GaSPK offers meaningful contributions to a broad range of domains, where its reliable and consistent computational and predictive performance make it suitable for supporting users’ autonomous decisionmaking.
Footnotes
 1.
One can alternatively formulate the tradeoff using percentage increases or any other relevant transformation. Such alternative transformations may increase the interpretability of the model’s outputs; however, in our experiments we found them to have a negligible effect on the performance of our approach. We conjecture that it is also possible to learn the \(\tau \) mapping from the data using, e.g., warped Gaussian processes (Snelson et al. 2004).
 2.
In some models, the Probit likelihood also includes a noise variance term, \(p(y\cdot ) = \Phi \left( \frac{f_{u}(t)}{\sigma _n^2}\right) \). However, because our tradeoff evaluation interpretation of the \(f_{u}(t)\) is invariant under scaling, we set \(\sigma _n^2 = 1\) without loss of generality.
 3.For two arbitrarily sized matrices A, B, the Kronecker product is defined as:$$\begin{aligned} A \otimes B := \begin{bmatrix} a_{11}B&\quad \cdots&\quad a_{1n}B \\ \vdots&\quad \ddots&\quad \vdots \\ a_{m1}B&\quad \cdots&\quad a_{mn}B \end{bmatrix}. \end{aligned}$$
 4.
ForwardSolve denotes the operation that solves the linear system \(A x = b\) for x. BackwardSolve similarly solves \(x A = b\).
 5.
Even for such small size data sets it was not possible to evaluate the method proposed by Bonilla et al. (2010)—as noted by the authors, this method is not suitable for modeling a large number of users. This is in agreement with the findings of Houlsby et al. (2012), who show that the method of Bonilla et al. (2010) is both slower and achieves lower predictive performance than the Collaborative GP. This limitation underscores the practical significance of scalable approaches with respect to the number of users.
 6.
On our hardware, we restricted overall grid sizes to \(10^4{}10^5\) points.
Notes
Acknowledgements
Funding was provided by National Science Foundation (Grant No. IIS1447721).
References
 Abbasnejad, E., Sanner, S., Bonilla, E., & Poupart, P. (2013). Learning communitybased preferences via Dirichlet process mixtures of Gaussian processes. In Proceedings of the 23rd international joint conference on artificial intelligence (pp. 1213–1219). AAAI Press.Google Scholar
 Adomavicius, G., Gupta, A., & Zhdanov, D. (2009). Designing intelligent software agents for auctions with limited information feedback. Information Systems Research, 20(4), 507.CrossRefGoogle Scholar
 Allenby, G. M., & Rossi, P. E. (1998). Marketing models of consumer heterogeneity. Journal of Econometrics, 89(1), 57–78.CrossRefMATHGoogle Scholar
 Baker, E. W. (2013). Relational model bases: A technical approach to realtime business intelligence and decision making. Communications of the Association for Information Systems, 33(1), 23.Google Scholar
 Bichler, M., Gupta, A., & Ketter, W. (2010). Designing smart markets. Information Systems Research, 21(4), 688–699.CrossRefGoogle Scholar
 Birlutiu, A., Groot, P., & Heskes, T. (2013). Efficiently learning the preferences of people. Machine Learning, 90, 1–28.MathSciNetCrossRefMATHGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning (Vol. 4). New York: Springer.MATHGoogle Scholar
 Bonilla, E. V., Guo, S., & Sanner, S. (2010). Gaussian process preference elicitation. In J.D. Lafferty, C.K.I. Williams, J. ShaweTaylor, R.S. Zemel & A. Culotta (Eds.), Advances in neural information processing systems (pp. 262–270). MIT Press.Google Scholar
 Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39, 324–345.MathSciNetMATHGoogle Scholar
 Caussade, S., Ortúzar, J., Rizzi, L. I., & Hensher, D. A. (2005). Assessing the influence of design dimensions on stated choice experiment estimates. Transportation Research Part B: Methodological, 39(7), 621–640.CrossRefGoogle Scholar
 Chu, W., & Ghahramani, Z. (2005). Preference learning with Gaussian processes. In Proceedings of the 22nd international conference on machine learning (pp. 137–144). ACM.Google Scholar
 Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2008). Fast Gaussian process methods for point process intensity estimation. In International conference on machine learning (pp. 192–199). ACM.Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38.MathSciNetMATHGoogle Scholar
 Evgeniou, T., Boussios, C., & Zacharia, G. (2005). Generalized robust conjoint estimation. Marketing Science, 24(3), 415–429.CrossRefGoogle Scholar
 Evgeniou, T., Pontil, M., & Toubia, O. (2007). A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science, 26(6), 805–818.CrossRefGoogle Scholar
 Fürnkranz, J., & Hüllermeier, E. (2011). Preference learning. New York: Springer.CrossRefMATHGoogle Scholar
 Gilboa, I., & Schmeidler, D. (1995). Casebased decision theory. The Quarterly Journal of Economics, 110(3), 605–639.CrossRefMATHGoogle Scholar
 Greene, W. (2012). Econometric analysis (7th ed.). Upper Saddle River: Prentice Hall.Google Scholar
 Guo, S., & Sanner, S. (2010). Realtime multiattribute Bayesian preference elicitation with pairwise comparison queries. In International conference on artificial intelligence and statistics (pp. 289–296).Google Scholar
 Hensher, D. A. (2006). How do respondents process stated choice experiments? Attribute consideration under varying information load. Journal of Applied Econometrics, 21(6), 861–878.MathSciNetCrossRefGoogle Scholar
 Houlsby, N., Huszar, F., Ghahramani, Z., & HernándezLobato, J. M. (2012). Collaborative Gaussian processes for preference learning. In F. Pereira, C. Burges, L. Bottou & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 2096–2104). MIT Press.Google Scholar
 Jenks, G. F., & Caspall, F. C. (1971). Error on choroplethic maps: Definition, measurement, reduction. Annals of the Association of American Geographers, 61(2), 217–244.CrossRefGoogle Scholar
 Jensen, B. S., & Nielsen, J. B. (2014). Pairwise judgements and absolute ratings with Gaussian process priors. Technical report IMM6151, Technical University of Denmark.Google Scholar
 Kahlen, M., Ketter, W., & van Dalen, J. (2014). Balancing with electric vehicles: A profitable business model. In Proceedings of the 22nd European conference on information systems (pp. 1–16). Tel Aviv, Israel.Google Scholar
 Kamishima, T., & Akaho, S. (2009). Efficient clustering for orders. In D.A. Zighed, S. Tsumoto, Z.W. Ras, H. Hacid (Eds.), Mining complex data.Studies in Computational Intelligence (Vol. 165). Berlin, Heidelberg: Springer.Google Scholar
 Kassakian, J. G., & Schmalensee, R. (2011). The future of the electric grid: An interdisciplinary MIT study. Technical report, Massachusetts Institute of Technology. ISBN: 9780982800867.Google Scholar
 Kohavi, R., Mason, L., Parekh, R., & Zheng, Z. (2004). Lessons and challenges from mining retail ecommerce data. Machine Learning, 57(1–2), 83–113.CrossRefGoogle Scholar
 Lichtenstein, S., & Slovic, P. (2006). The construction of preference. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
 MacKay, D. J. (1998). Introduction to Gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168, 133–166.MATHGoogle Scholar
 Netzer, O., Toubia, O., Bradlow, E. T., Dahan, E., Evgeniou, T., Feinberg, F. M., et al. (2008). Beyond conjoint analysis: Advances in preference measurement. Marketing Letters, 19(3), 337–354.CrossRefGoogle Scholar
 Paolacci, G., Chandler, J., & Ipeirotis, P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.Google Scholar
 Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical report, Technical University of Denmark.Google Scholar
 Peters, M., Ketter, W., SaarTsechansky, M., & Collins, J. E. (2013). A reinforcement learning approach to autonomous decisionmaking in smart electricity markets. Machine Learning, 92, 5–39.MathSciNetCrossRefGoogle Scholar
 Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing (3rd ed.). Cambridge: Cambridge University Press.MATHGoogle Scholar
 QuiñoneroCandela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6, 1939–1959.MathSciNetMATHGoogle Scholar
 Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge: MIT Press.MATHGoogle Scholar
 SaarTsechansky, M., & Provost, F. (2004). Active sampling for class probability estimation and ranking. Machine Learning, 54(2), 153–178.CrossRefMATHGoogle Scholar
 Saatci, Y. (2011). Scalable inference for structured Gaussian process models. Ph.D. thesis, University of Cambridge.Google Scholar
 Snelson, E., Rasmussen, C. E., & Ghahramani, Z. (2004). Warped Gaussian processes. Advances in Neural Information Processing Systems, 16, 337–344.Google Scholar
 Stachniss, C., Plagemann, C., & Lilienthal, A. J. (2009). Learning gas distribution models using sparse Gaussian process mixtures. Autonomous Robots, 26(2–3), 187–202.CrossRefGoogle Scholar
 Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273.CrossRefGoogle Scholar
 Train, K. (2003). Discrete choice methods with simulation. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
 Tversky, A., & Simonson, I. (1993). Contextdependent preferences. Management Science, 39(10), 1179–1189.CrossRefMATHGoogle Scholar
 Valogianni, K., Ketter, W., Collins, J., & Zhdanov, D. (2014). Effective management of electric vehicle storage using smart charging. In Proceedings of 28th AAAI conference on artificial intelligence.Google Scholar
 Vanhatalo, J., Pietiläinen, V., & Vehtari, A. (2010). Approximate inference for disease mapping with sparse Gaussian processes. Statistics in Medicine, 29(15), 1580–1607.MathSciNetGoogle Scholar
 Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V., & Vehtari, A. (2013). GPstuff: Bayesian modeling with Gaussian processes. The Journal of Machine Learning Research, 14(1), 1175–1179.MathSciNetMATHGoogle Scholar
 Watson, R. T., Boudreau, M. C., & Chen, A. J. (2010). Information systems and environmentally sustainable development: Energy Informatics and new directions for the IS community. Management Information Systems Quarterly, 34(1), 23.CrossRefGoogle Scholar
 Widergren, S. E., Roop, J. M., Guttromson, R. T., & Huang, Z. (2004). Simulating the dynamic coupling of market and physical system operations. In IEEE power engineering society general meeting (pp. 748–753). IEEE.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.