Dyad ranking using Plackett–Luce models based on joint feature representations
 831 Downloads
 2 Citations
Abstract
Label ranking is a specific type of preference learning problem, namely the problem of learning a model that maps instances to rankings over a finite set of predefined alternatives. Like in conventional classification, these alternatives are identified by their name or label while not being characterized in terms of any properties or features that could be potentially useful for learning. In this paper, we consider a generalization of the label ranking problem that we call dyad ranking. In dyad ranking, not only the instances but also the alternatives are represented in terms of attributes. For learning in the setting of dyad ranking, we propose an extension of an existing label ranking method based on the Plackett–Luce model, a statistical model for rank data. This model is combined with a suitable feature representation of dyads. Concretely, we propose a method based on a bilinear extension, where the representation is given in terms of a Kronecker product, as well as a method based on neural networks, which allows for learning a (highly nonlinear) joint feature representation. The usefulness of the additional information provided by the feature description of alternatives is shown in several experimental studies. Finally, we propose a method for the visualization of dyad rankings, which is based on the technique of multidimensional unfolding.
Keywords
Preference learning Label ranking Dyad ranking Plackett–Luce model Neural networks Multidimensional unfolding1 Introduction
Preference learning is an emerging subfield of machine learning, which deals with the induction of preference models from observed or revealed preference information (Fürnkranz and Hüllermeier 2010). Such models are typically used for prediction purposes, for example to predict contextdependent preferences of individuals on various choice alternatives. Depending on the representation of preferences, individuals, alternatives, and contexts, a large variety of preference models are conceivable, and many such models have already been studied in the literature.
A specific type of preference learning problem is the problem of label ranking, namely the problem of learning a model that maps instances to rankings (total orders) over a finite set of predefined alternatives (Vembu and Gärtner 2010). An instance, which defines the context of the preference relation, is typically characterized in terms of a set of attributes or features; for example, an instance could be a person described by properties such as sex, age, income, etc. As opposed to this, the alternatives to be ranked, e.g., the political parties of a country, are only identified by their name (label), while not being characterized in terms of any properties or features.
In this paper, we introduce dyad ranking as a generalization of the label ranking problem. In dyad ranking, not only the instances but also the alternatives are represented in terms of attributes. For learning in the setting of dyad ranking, we propose an extension of an existing label ranking method based on the Plackett–Luce (PL) model, a statistical model for rank data (Luce 1959; Plackett 1975). The extension essentially consists of expressing the parameters of this model as functions of a suitably defined joint feature representation of dyads.
To this end, we propose two approaches. The first one is based on a bilinear extension of an existing linear version of the PL model, where the representation is given in terms of a Kronecker product (Schäfer and Hüllermeier 2015). The second one is much more flexible and makes use of a neural network that allows for learning a highly nonlinear joint feature representation. The usefulness of the additional information provided by the feature description of alternatives is shown in several experimental studies.
Another contribution of the paper is a method for the visualization of dyad rankings, which is based on the technique of multidimensional unfolding. While this technique has been used in statistics for quite a while, it has not received much attention in machine learning so far (Murata et al. 2017).
The paper is organized as follows. In the next section, we introduce the problem of dyad ranking. Following a discussion of related problem settings in Sect. 3, we propose the jointfeature Plackett–Luce model in Sect. 4. In Sects. 5 and 6, we introduce two instantiations of this model, the first based on the Kronecker product for representing dyads, and the second making use of neural networks. In Sect. 7, we propose a method for the visualization of dyad rankings, which is based on the technique of multidimensional unfolding. All methods are evaluated experimentally in Sect. 8.
2 Dyad ranking
As will be explained in more detail later on (cf. Sect. 3), the learning problem addressed in this paper has connections to several existing problems in the realm of preference learning. In particular, it can be seen as a combination of dyadic prediction (Menon and Elkan 2010a, b, c) and label ranking (Vembu and Gärtner 2010), hence the term “dyad ranking”. Since our method for tackling this problem is an extension of a label ranking method, we will introduce dyad ranking here as an extension of label ranking.
2.1 Label ranking
Let \(\mathcal {Y} = \{ y_1, \ldots , y_K\}\) be a finite set of (choice) alternatives; adhering to the terminology commonly used in supervised machine learning, and accounting for the fact that label ranking can be seen as an extension of multiclass classification, the \(y_i\) are also called class labels or simply labels. We consider total order relations \(\succ \) on \(\mathcal {Y}\), that is, complete, transitive, and antisymmetric relations, where \(y_i \succ y_j\) indicates that \(y_i\) precedes \(y_j\) in the order. Since a ranking can be seen as a special type of preference relation, we shall also say that \(y_i \succ y_j\) indicates a preference for \(y_i\) over \(y_j\). We interpret this order relation in a wide sense, so that \(a \succ b\) can mean that the alternative a is more liked by a person than alternative b, but also for example that an algorithm a outperforms algorithm b.
Formally, a total order \(\succ \) can be identified with a permutation \(\pi \) of the set \([K]= \{1 , \ldots , K \}\), such that \(\pi (i)\) is the index of the label on position i in the permutation. We denote the class of permutations of [K] (the symmetric group of order K) by \(\mathbb {S}_K\). By abuse of terminology, though justified in light of the above onetoone correspondence, we refer to elements \(\pi \in \mathbb {S}_K\) as both permutations and rankings.
In the setting of label ranking, preferences on \(\mathcal {Y}\) are “contextualized” by instances \(\varvec{x} \in \mathcal {X}\), where \(\mathcal {X}\) is an underlying instance space. Thus, each instance \(\varvec{x}\) is associated with a ranking \(\succ _{\varvec{x}}\) of the label set \(\mathcal {Y}\) or, equivalently, a permutation \(\pi _{\varvec{x}} \in \mathbb {S}_K\). More specifically, since label rankings do not necessarily depend on instances in a deterministic way, each instance \(\varvec{x}\) is associated with a probability distribution \(\mathbf {P}( \cdot \,  \,\varvec{x})\) on \(\mathbb {S}_K\). Thus, for each \(\pi \in \mathbb {S}_K\), \(\mathbf {P}( \pi \,  \,\varvec{x})\) denotes the probability to observe the ranking \(\pi \) in the context specified by \(\varvec{x}\).
As an illustration, suppose \(\mathcal {X}\) is the set of people characterized by attributes such as sex, age, profession, and marital status, and labels are music genres: \(\mathcal {Y}=\{ \mathtt {Rock}, \mathtt {Pop}, \mathtt {Classic}, \mathtt {Jazz}\}\). Then, for \(\varvec{x} = (m,30,\text {teacher}, \text {married})\) and \(\pi =(2,1,3,4)\), \(\mathbf {P}(\pi \,  \,\varvec{x})\) denotes the probability that a 30 years old married man, who is a teacher, prefers Pop music to Rock to Classic to Jazz.
2.2 Dyad ranking as an extension of label ranking
In the setting of label ranking as introduced above, instances are supposed to be characterized in terms of properties—typically, an instance is represented as an rdimensional feature vector \(\varvec{x} =(x_1, \ldots , x_r)\). As opposed to this, the alternatives to be ranked, the labels \(y_i\), are only identified by their name, just like categories in classification.
Needless to say, a learner may benefit from knowledge about properties of the alternatives, too. In fact, if the preferences of an instance are somehow connected to such properties, then alternatives with similar properties should also be ranked similarly. In particular, by sharing information via features, it would in principle be possible to rank alternatives that have never be seen in the training process so far, i.e., to do some kind of “zeroshot learning” (Larochelle et al. 2008).
Returning to our above example of ranking music genres, suppose we know (or at least are quite sure) that \(\mathtt {Rock} \succ _{\varvec{x}} \mathtt {Classic} \succ _{\varvec{x}} \mathtt {Jazz}\) for a person \(\varvec{x}\). We would then expect that \(\mathtt {Pop}\) is ranked more likely close to the top than close to the bottom, simply because Pop music is more similar to Rock than to Classic or Jazz. In contrast to a label ranker, for which the music genres are just uninformative names, we are able to make a prediction of that kind thanks to our knowledge about the different types of music.
Note that (5) covers the important case of pairwise comparisons as a special case (\(M_i = 2\)). Pairwise comparisons are especially appealing in subjective preference judgments (David 1969), and hence can be motivated from this point of view. Moreover, within machine learning, the case of pairwise preferences has been studied quite extensively, because it allows for reducing the problem of ranking to simpler learning problems, such as binary classification (Dekel et al. 2004).
3 Related settings
As already mentioned earlier, the problem of dyad ranking is not only connected to label ranking, but also to several other types of ranking and preference learning problems that have been discussed in the literature.
The term “dyad ranking” derives from the framework of dyadic prediction as introduced by Menon and Elkan (2010b). This framework can be seen as a generalization of the setting of collaborative filtering (CF), in which rowobjects (e.g., clients) are distinguished from columnobjects (e.g., products). Moreover, with each combination of such objects, called a dyad by Menon and Elkan, a value (e.g., a rating) is associated. While in CF, rowobjects and columnobjects are only represented by their name (just like the alternatives in label ranking), they are allowed to have a feature representation (called sideinformation) in dyadic prediction. Menon and Elkan are trying to exploit this information to improve performance in matrix completion, i.e., predicting the values for those object combinations that have not been observed so far, in very much the same way as we are trying to make use of feature information in the context of label ranking.
CF and ranking are combined in collaborative ranking. Here, the aim is to provide personalized recommendations for users in the form of rankings on items (Weimer et al. 2007). In contrast to CF, where rankings can be obtained indirectly by sorting items according to predicted scores (which potentially leads to many ties), collaborative ranking tackles the ranking problem more directly.
Methods for learningtorank or object ranking (Cohen et al. 1999; Kamishima et al. 2011) have received a lot of attention in the recent years, especially in the field of information retrieval (Liu 2011). In general, the goal is to learn a ranking function that accepts a subset \(O \subset \mathcal {O}\) of objects as input, where \(\mathcal {O}\) is a reference set of objects (e.g., the set of all books). As output, the function produces a ranking (total order) \(\succ \) of the objects O. The ranking function is commonly implemented by means of a scoring function \(U:\, \mathcal {O} \longrightarrow \mathbb {R}\), i.e., objects are first scored and then ranked according to their scores (Hüllermeier and Vanderlooy 2009). In order to induce a function of that kind, the learning algorithm is provided with training information, which typically comes in the form of exemplary pairwise preferences between objects. As opposed to label ranking, the alternatives to be ranked are described in terms of properties (feature vectors), while preferences are not contextualized. In principle, methods for object ranking could be applied in the context of dyad ranking, too, namely by equating the object space \(\mathcal {O}\) with the “dyad space” \(\mathcal {Z}\) in (3); in fact, dyads can be seen as a specific type of object, i.e., as objects with a specific structure. Especially close in terms of the underlying methodology is the socalled listwise approach in learningtorank (Cao et al. 2007). We elaborate on the relation to approaches of that kind in more detail in Sect. 6.
In most applications of CR so far, all nodes of the graph stem from the same domain, i.e., only a single type of object is considered (Pahikkala et al. 2010, 2013). The framework as such is more general, however, and the graph can also be bipartite. Then, v in (6) has a different type than \(v'\) and \(v''\), which means that a tuple \((v,v')\) can be seen as a dyad. From this point of view, CR is indeed very close to dyad ranking, also because the kernel function used in CR is based on joint feature representations (Pahikkala et al. 2013).
The authors of CR distinguish four different prediction problems for dyadic data, which are illustrated in Fig. 1 (Pahikkala et al. 2014). Problem A refers to the situation in which a prediction is sought for a dyad \((\varvec{x}, \varvec{y})\), such that both \(\varvec{x}\) and \(\varvec{y}\) have already been encountered in the training data, though not necessarily in combination (\((\varvec{x}_3, \varvec{y}_4)\) in the example). In contrast to this, problem D asks for a prediction on a dyad \((\varvec{x}, \varvec{y})\) such that neither \(\varvec{x}\) nor \(\varvec{y}\) is contained in the training data (zeroshot learning, \((\varvec{x}_6, \varvec{y}_8)\) in the example). Problems B and C are inbetween: either \(\varvec{x}\) or \(\varvec{y}\) has already been encountered, but not both (\((\varvec{x}_2, \varvec{y}_7)\), \((\varvec{x}_6, \varvec{y}_2)\) in the example). A prediction in dyad ranking, which involves several dyads, can be a mixture of these situations. Its flexibility is clearly a strength of the framework: It is applicable to problems of type A and C, and provided predictive features are available on Y, also to settings B and D.
Finally, we note that dyad ranking deals with predictions (rankings of dyads) having a complex structure. Therefore, like for ranking and preference learning in general, there is also a natural connection to the field of structured output prediction (Bakir et al. 2007).
4 Jointfeature Plackett–Luce models
4.1 The basic model
An intuitively appealing explanation of the PL model can be given in terms of a vase model: If \(v_i\) corresponds to the relative frequency of the ith label in a vase filled with labeled balls, then \(\mathbf {P}(\pi \,  \,\varvec{v})\) is the probability to produce the ranking \(\pi \) by randomly drawing balls from the vase in a sequential way and putting the label drawn in the kth trial on position k (unless the label was already chosen before, in which case the trial is annulled). This explanation corresponds to the interpretation of the model as being a multistage model.
4.2 Plackett–Luce model with features
Given estimates of these parameters, prediction for new query instances \(\varvec{x}\) can be done in a straightforward way: \(\hat{\varvec{v}} = (\hat{v}_1, \ldots , \hat{v}_K)\) is computed based on (11), and a ranking \(\hat{\pi }\) is determined by sorting the labels \(y_k\) in decreasing order of their (predicted) skills \(\hat{v}_k\). This ranking \(\hat{\pi }\) is a reasonable prediction, as it corresponds to the mode of the distribution \(\mathbf {P}( \cdot \,  \,\hat{\varvec{v}})\).
4.3 Plackett–Luce model with joint features
5 Bilinear Plackett–Luce model (BilinPL)
The bilinear model (14) appears to be the most obvious generalization of the linear model (11) in the context of dyad ranking. As such, it constitutes a natural starting point, although one may of course think of more complex, nonlinear extensions, for example using (pairwise) kernel functions (Basilico and Hofmann 2004). Likewise, one may also think of joint feature maps even simpler than the cross product \(\varvec{x} \otimes \varvec{y}\), for example the concatenation \([\varvec{x}, \varvec{y}]\). This, however, would not allow for capturing interactions between the dyad members. Besides, it is not a proper generalization, as it does not cover (11) as a special case.
5.1 Identifiability of the bilinear PL model
The bilinear PL model introduced above defines a probability distribution on dyad rankings that is parameterized by the weight matrix \(\mathbf {W}\). An important question, also from a learning point of view, concerns the identifiability of this model. Recall that, for a parameterized class of models \(\mathcal {M}\), identifiability requires a bijective relationship between models \(M_\theta \in \mathcal {M}\) and parameters \(\theta \), that is, models are uniquely identified by their parameters. Or, stated differently, parameters \(\theta \ne \theta ^*\) induce different models \(M_{\theta } \ne M_{\theta ^*}\). Identifiability is a prerequisite for a meaningful interpretation of parameters and, perhaps even more importantly, guarantees unique solutions for optimization procedures such as maximum likelihood estimation.
Obviously, the original PL model (7) with constant skill parameters \(\varvec{v} = (v_1, \ldots , v_K)\) is not identifiable, since the model is invariant against multiplication of the parameter by a constant factor \(c > 0\): The models parameterized by \(\varvec{v}\) and \(\varvec{v}^* = (c v_1 , \ldots , c v_K)\) represent exactly the same probability distribution, i.e., \(\mathbf {P}(\pi \,  \,\varvec{v}) = \mathbf {P}(\pi \,  \,\varvec{v}^*)\) for all rankings \(\pi \). The PL model is, however, indeed identifiable up to this kind of multiplicative scaling. Thus, by fixing one of the weights to the value 1, the remaining \(K1\) weights can be uniquely identified.
Now, what about the identifiability of our bilinear PL model, i.e., to what extent is such a model uniquely identified by the parameter \(\mathbf {W}\)? We can show the following result (a proof is given in Appendix A).
Proposition 1
Suppose the feature representation of labels does not include a constant feature, i.e., \(\mathcal {Y}_i > 1\) for each of the domains in (2), and that the feature representation of instances includes at most one such feature (accounting for a bias, i.e., an intercept of the bilinear model). Then, the bilinear PL model with skill values defined according to (14) is identifiable.
5.2 Comparison between the linear and bilinear PL model
It is not difficult to see that the linear model (11), subsequently referred to as LinPL, is indeed a special case of the bilinear model (14). In fact, the former is recovered from the latter by means of a (1ofK) dummy encoding of the alternatives: The label \(y_k\) is encoded by a Kdimensional vector with a 1 in position k and 0 in all other positions. The columns of the matrix \(\mathbf {W}\) are then given by the weight vectors \(\varvec{w}^{(k)}\) in (11).
Against the background of these considerations, one should expect dyad ranking to be advantageous to standard label ranking provided the assumptions underlying the bilinear model (14) are indeed valid, at least approximately. In that case, learning with (meta)labels and disregarding properties of the alternatives would come with an unnecessary loss of information (that would need to be compensated by additional training data). In particular, using the standard label ranking approach is supposedly problematic in the case of many metalabels and comparatively small amounts of training data.

The linear PL model, like standard label ranking in general, assumes all alternatives to be known beforehand and to be included in the training process. If generalization beyond alternatives encountered in the training process is needed, then BilinPL can be used while LinPL cannot.

If the assumption (14) of the bilinear model is correct, then BilinPL should learn faster than LinPL, as it needs to estimate fewer parameters. Yet, since LinPL can represent all dependencies that can be represented by BilinPL, the learning curve of the former should reach the one of the latter with growing sample size.
5.3 Learning the bilinear PL model
5.3.1 Learning via maximum likelihood estimation
5.3.2 Optimization via iterative majorization
The MM algorithm proposed by Hunter (2004) belongs to the standard approaches for maximum likelihood estimation of the basic PL model parameters \(\varvec{v}\). The acronym MM stands for majorizationminorization (minorizationmajorization) in the case of minimization (maximization) problems. The prominent EM algorithm can be seen as a special instance of MM (Dempster et al. 1977). While the standard Newton algorithm could in principle be applied to minimize the convex PL objective, it is known that the MM approach is more robust and faster in direct comparison (Guiver and Snelson 2009). Although MM algorithms often need more iterations, they perform specifically well when operating far from the optimum point (Lange et al. 2000). The MM algorithm is furthermore superior to Newton’s method in the context of PL model training, because the latter requires special countermeasures to prevent erratic behavior (Hunter 2004). The overall advantage of MM over Newton in terms of speed is mainly due to the timeintensive inversion of the Hessian in any Newton step.
Recently proposed alternative optimization approaches are based on Bayesian inference. For example, Guiver and Snelson (2009) propose approximate inference based on expectation propagation and Caron and Doucet (2012) make use of Gibbs sampling. Maystre and Grossglauser (2015) propose an alternative approach for the basic PL model that is based on interpreting the maximum likelihood estimate as a stationary distribution of a Markov chain. This enables a fast and efficient spectral inference algorithm.
Recall that, in contrast to the basic PL model (7), the parameters \(v_i\) are not real numbers in the BilinPL model (14) but functions of a weight vector \(\varvec{w}\) and a jointfeature vector. To adopt the MM algorithm for obtaining \(\varvec{w}^*\), we take a closer look at MM and adopt the perspective of minimization by majorization of the NLL (18).
 1.
Initialize the first supporting point \(\varvec{u} = \varvec{u_0}\).
 2.
Find update \(\varvec{w}\), so that \(g(\varvec{w}, \varvec{u}) \le g(\varvec{u}, \varvec{u})\).
 3.
If \(f(\varvec{u})  f(\varvec{w}) < \epsilon \), then stop and return \(\varvec{w}\),
else set \(\varvec{u} = \varvec{w}\) and go back to line 2
5.3.3 Online learning using stochastic gradient descent
An interesting alternative approach, especially in the largescale learning regime, is stochastic gradient descent (SGD) with selective sampling (Bottou 1998, 2010). The core idea of SGD is to update model parameters iteratively on the basis of a randomly picked example. The algorithm is explicated for the case of dyad ranking with the BilinPL model in Algorithm 1. It requires the specification of a data source, the total number of iterations, an initial learning rate, and a regularization parameter.
6 Plackett–Luce networks
Due to the bilinearity assumption, BilinPL comes with a relatively strong bias. This may or may not turn out as an advantage, depending on whether the assumption holds sufficiently well, but in any case requires a proper feature engineering. The approach introduced in this section, called Plackett–Luce Network (PLNet), offers an alternative: It allows for learning a highly nonlinear jointfeature map expressed in terms of a neural network (Schäfer and Hüllermeier 2016).
6.1 Architecture
By the special choice of the linear activation function for the neuron at the output layer, the architecture is related to the jointfeature PL reference model (12) outlined in Sect. 4.3. The activation output of the neurons of the penultimate layer can be considered as a jointfeature vector \(\Phi (\varvec{x},\varvec{y})\), which is then linearly combined with a vector of weights to produce a utility score. Thus, the upper part of the network, i.e., the input layer down to the penultimate layer, can be considered as a jointfeature map \(\Phi (\varvec{x}, \varvec{y})\).
6.2 Training
After an update has taken place, the procedure can be repeated on the remaining training examples. The algorithm stops when the error has been sufficiently diminished. We call this procedure, which is a hybrid between the online and the batch variant of backpropagation, staged backpropagation (SBP). In batch training, the updated weights are accumulated for n training examples before updating the network, whereas in online backpropagation, the network is updated after each example. In SBP, the network is updated in an online manner over the training examples, but in a batchwise mode on the elements of each single example.
We suggest early stopping as a means for regularization to prevent overfitting.^{3} To this end, we track the NLL values of the training and validation data during the learning process (Prechelt 2012). A good point to stop the training and to prevent overfitting is when the validation NLL values start to increase again. Finally, predictions can be carried out straightforwardly using a trained PLNet. Given a set of dyads, we simply need to sort them in descending order of their utilities—as already explained, the ranking thus obtained corresponds to the mode of the PL distribution.
6.3 Applicability
PLNet can be applied to dyad ranking problems as well as to label ranking problems. As for the latter, one possibility is to use the dummy encoding mentioned in Sect. 5.2, in which the vectors of the domain \(\mathcal {Y}\) are expressed as onehot vectors. As in this case all values \(y_i\) in the input layer will be 0, except one having the value 1, this approach could also be implemented by omitting the input vector \(\varvec{y}\) altogether, and instead adding labelspecific biases for the activation functions of the first hidden layer neurons.
By relaxing the input type from a vector pair \(\left( \varvec{x},\varvec{y}\right) \) to a single input vector \(\varvec{z}\), PLNet would be applicable on the problem of object ranking, too (Cohen et al. 1999; Kamishima et al. 2011). It would consequently be possible to apply PLNet also on ordered joint querydocument vectors \(\varvec{z}=\Psi (q,d)\) as commonly encountered in the learning to rank domain.
6.4 Comparison with existing NNbased approaches
In this section, we provide a brief overview of other (preference) learning methods based on neural networks that share some similarities with PLNet.
Comparison training refers to a framework introduced for learning from pairwise preferences with a neural network (Tesauro 1989). The network architecture consists of two subnetworks, which are connected to a single output node that indicates which of the two inputs is preferred over the other. The weights associated with the last hidden layer of one subnetwork is the mirrored version of the other subnetwork’s weights. Two principal problems are addressed with this setup, namely efficiency and consistency. In the evaluation phase, only a single subnetwork is used to evaluate n alternatives, which are then sorted according to their scores. The network essentially implements a (latent) utility function (thereby enforcing transitivity of the predicted preferences).
A similar approach called SortNet is proposed by Rigutini et al. (2011). They introduce a threelayered network architecture called CmpNN, which takes two input object vectors and outputs two realvalued numbers. The architecture establishes a preference function, for which the properties of reflexivity and symmetry (but not transitivity) are ensured by a weightsharing technique. Huybrechts (2016) uses the CmpNN architecture (with 3 hidden layers) in the context of document retrieval.
The Bradley–Terry model was reparameterized by a single layer neural network (consisting of a single sigmoid node) by Menke and Martinez (2008). It is used for predicting the outcome of twoteam group competitions by rating individuals in the application of esports. The model offers several extensions, such as homefield advantage, player contributions, time effects, and uncertainty. This model differs from our approach in the sense of being tailored for pairwise group comparisons, albeit considering individual skills. The inclusion of feature vectors is of no concern in this model, and the extension to rankings is only suggested for future work.
The label ranking problem, introduced in Sect. 2.1, has been tackled with neural network approaches previously (Ribeiro et al. 2012; Kanda et al. 2012). A multilayer perceptron (MLP) has been utilized by Kanda et al. (2012) to produce recommendations on metaheuristics (labels) on different metafeatures (instances) within the realm of metalearning. This kind of neural network exhibits an output layer with as many nodes as there are labels. The error signal used to modify the network’s weights is formed by using the mean squared error on the target rank positions of the labels. In Ribeiro et al. (2012) more effort has been spent to incorporate label ranking loss information into the backpropagation procedure. To this end some variations of this procedure have been investigated. Both architectures are similar to each other and have two essential limitations: first, they depend on a fixed number of labels, and second, they cannot cope with incomplete ranking observations. In addition, they lack the ability to provide probabilistic information on their predictions. Zhang and Zhou (2006) train neural networks to minimize the ranking loss in multilabel classification. Thus, the network is adjusted such that, given an instance as an input, higher scores are produced as output for relevant (positive) labels and lower scores for irrelevant (negative) labels.

The learning approach in ListNet addresses only a special case of the PL distribution, namely the case of topk data with \(k=1\).

In ListNet, a linear neural network is used. This is in contrast to our approach, in which nonlinear relationships between inputs and outputs are learned. Linearity in the ListNet approach implies that much emphasis must be put on engineering joint feature input vectors.

In ListNet, the querydocument features are associated with absolute scores (relevance degrees) as training information, i.e., quantitative data, whereas PLNet deals with rankings, i.e., data of qualitative nature.^{4}
7 Multidimensional unfolding of dyad ranking models
Multidimensional unfolding is an extension of multidimensional scaling (MDS) for twoway preferential choice data (Borg and Groenen 2005). As such, it is a useful visualization tool for analyzing and discovering preference structures. In this section, we examine how dyad ranking data can be visualized by combining the learned models with multidimensional unfolding.
7.1 The unfolding problem
7.2 Dyadic unfolding
With a trained dyad ranking model, it is possible to produce a matrix of skills for pairs of possibly new feature vectors from the domains \(\mathcal {X}\) and \(\mathcal {Y}\). Let \(\varvec{X}\) be an \(n_1 \times d_1\) matrix and \(\varvec{Y}\) an \(n_2 \times d_2\) matrix of feature vectors. Then, the skills \(v_{i,j}\) produced by the model on all pairwise vectors of \(\varvec{X}\) and \(\varvec{Y}\) can be grouped into a matrix \(\varvec{S}\) of dimensionality \(n_1 \times n_2\).
For the transformation of the skills \(v_{i,j}\), there are several possibilities. Borg (1981) discusses functional relationships between the value scale v and distances in an unfolding model. The most obvious relationship is \(d_{i,j} = 1/v_{i,j}\), which we call transformation \( t_1(v_{i,j}) = 1/v_{i,j}\). Borg points out that the major drawback of this formula is that for almost similar items it requires v to become infinitely large. A better alternative, originally also motivated in (Luce 1961; Krantz 1967), is \(t_2(v_{i,j}) = d_{i,j} = \log (1/v_{i,j}) = \log (v_{i,j})\).^{6} Another transformation (\(t_3\)) is the ranktransformation, which creates rank numbers in descending order of the skill values.
7.2.1 Dyadic unfolding with SMACOF
As reported in Borg and Groenen (2005, Chapter 11.3), weights like those in the weighted stress function (39) provide a certain degree of flexibility. It is for example possible to mimic other stress functions, such as those used in Sammon’s mapping, elastic scaling, or SStress. Moreover, weights can be used to encode reliability on a proximity, so that proximities with large weights have more impact on a resulting MDS solution than those that are less reliable.
7.3 Related visualization approaches for preference ranking data
Besides multidimensional unfolding, other approaches for visualizing ranking data have been developed in the past (Alvo and Yu 2014). For instance, the permutation polytope, its combination with histograms and its projections belong to the classical approaches for visualizing ranking data (Marden 1995). The main disadvantage of the classical approaches is the difficulty of their interpretability with a growing number of ranking items.
There are different visualization techniques based on the vector model of unfolding introduced by Tucker (1960). The main idea is that subjects are represented as socalled preference vectors while items are identified as points. The closer the projection of an item is to a subject vector, the more preferred it is. Techniques that are based on the vector model include MDPref (Carroll and Chang 1970), CATPCA (Meulman et al. 2004), and VIPSCAL (Van Deun et al. 2005). A disadvantage is that the visualization gets confusing as more and more preference vectors enter the scene.
More recently, Kidwell et al. (2008) proposed a visualization technique using MDS in conjunction with Kendall distance on complete and partial rankings. The visualizations are capable of highlighting clusterings by utilizing heat map density plots. An advantage of the newly proposed dyadic unfolding visualization over this and all other techniques is its capability of visualizing probabilistic information.
8 Experiments on dyad ranking
The following experiments are intended to evaluate the performance of our dyad ranking methods BilinPL and PLNet, as well as the usefulness of the dyad ranking setting itself. Thus, we are interested in conditions under which learning can benefit from taking additional label descriptions into account. Our focus is on the specific case of contextual dyad ranking and its comparison to standard label ranking.
In addition to BilinPL and PLNet, we included LinPL (as implemented by Cheng et al. (2010a)) as well as the following stateoftheart label ranking methods as baselines: Ranking by Pairwise Comparison (RPC, Hüllermeier et al. 2008) and Constrained Classification (CC, HarPeled et al. 2002a, b), both with logistic regression as base learner.^{7} For comparing with conditional ranking, we used QueryRankRLS as implemented in the software package RLScore (Pahikkala and Airola 2016). BilinPL, PLNet and other dyad ranking related methods are provided in the software package DyraLib.^{8} BilinPL and PLNet are realized in Matlab, and as a proof of concept, the latter is also implemented in Python based on the TensorFlow deep learning framework (Abadi et al. 2015).
8.1 Learning curves on synthetic data
The learning curves produced are shown in Fig. 4 for different numbers of labels. Overall, all ranking methods are able to learn and predict correctly if enough training data are available. In the limit, they all reach the performance of the “ground truth”: given complete knowledge about the true PL model, the optimal (Bayes) prediction is the mode of that distribution (note that the average performance of that predictor is still not perfect, since sampling from the distribution will not always yield the mode). As expected, BilinPL benefits from the additional label description compared to the other label ranking approaches over a wide range of different training set sizes and numbers of labels. In comparison with the other approaches, the learning curves of PLNet and RankRLS are less steep and require careful tuning of their regularization parameters.
8.2 Label ranking on semisynthetic UCI data
Semisynthetic label ranking data sets and their properties
Classification  Regression  

Data set  # Inst.(N)  # Attr. (d)  # Labels (M)  Data set  # Inst. (N)  # Attr. (d)  # Labels (M) 
Authorship  841  70  4  Bodyfat  252  7  7 
Glass  214  9  6  Calhousing  20,640  4  4 
Iris  150  4  3  Cpusmall  8192  6  5 
Pendigits  10,992  16  10  Elevators  16,599  9  9 
Segment  2310  18  7  Fried  40,769  9  5 
Vehicle  846  18  4  Housing  506  6  6 
Vowel  528  10  11  Stock  950  5  5 
Wine  178  13  3  Wisconsin  194  16  16 
8.2.1 Experimental results
We compare the performance of BilinPL and PLNet to other stateoftheart label ranking methods using 10fold crossvalidation. For PLNet, we use three layers with 10 neurons for the hidden layer. Labels were encoded for BilinPL and PLNet in terms of 1ofK dummy vectors without utilizing further label descriptions. For BilinPL, an additional bias term (constant 1) is added to the representation of instances.
Results on the UCI label ranking data sets (average Kendall \(\tau \,\pm \) standard deviation)
Data set  BilinPL  CC  PLNet  QRRLS  RPCLR 

Authorship  0.931 ± 0.013  0.916 ± 0.015  0.908 ± 0.025  0.432 ± 0.043  0.917 ± 0.020 
Bodyfat  0.268 ± 0.059  0.245 ± 0.052  0.251 ± 0.040  0.284 ± 0.057  0.285 ± 0.061 
Calhousing  0.220 ± 0.011  0.254 ± 0.009  0.272 ± 0.014  0.215 ± 0.011  0.243 ± 0.010 
Cpusmall  0.445 ± 0.016  0.468 ± 0.017  0.500 ± 0.019  0.376 ± 0.012  0.449 ± 0.016 
Elevators  0.730 ± 0.007  0.770 ± 0.009  0.788 ± 0.009  0.570 ± 0.007  0.749 ± 0.008 
Fried  0.999 ± 0.000  0.999 ± 0.000  0.951 ± 0.010  0.996 ± 0.001  1.000 ± 0.000 
Glass  0.835 ± 0.072  0.830 ± 0.079  0.846 ± 0.080  0.818 ± 0.075  0.889 ± 0.057 
Housing  0.655 ± 0.040  0.639 ± 0.044  0.703 ± 0.033  0.579 ± 0.038  0.672 ± 0.041 
Iris  0.813 ± 0.112  0.800 ± 0.109  0.960 ± 0.049  0.800 ± 0.064  0.911 ± 0.047 
Pendigits  0.892 ± 0.003  0.896 ± 0.002  0.905 ± 0.005  0.561 ± 0.003  0.932 ± 0.002 
Segment  0.903 ± 0.008  0.910 ± 0.008  0.939 ± 0.008  0.720 ± 0.011  0.929 ± 0.009 
Stock  0.704 ± 0.016  0.714 ± 0.016  0.882 ± 0.020  0.663 ± 0.016  0.774 ± 0.024 
Vehicle  0.855 ± 0.020  0.850 ± 0.025  0.872 ± 0.025  0.776 ± 0.031  0.855 ± 0.015 
Vowel  0.581 ± 0.026  0.577 ± 0.046  0.805 ± 0.016  0.574 ± 0.026  0.644 ± 0.021 
Wine  0.929 ± 0.052  0.914 ± 0.069  0.942 ± 0.034  0.923 ± 0.065  0.925 ± 0.054 
Wisconsin  0.629 ± 0.028  0.612 ± 0.030  0.514 ± 0.028  0.630 ± 0.031  0.632 ± 0.027 
Average rank  3.063  3.438  2.000  4.500  2.000 
8.2.2 Unfolding of label ranking predictions
Marginal distributions on a subset of labels for the highlighted test instance in Fig. 5
Ranks  

1  2  3  4  5  6  7  
Labels  
2  0.001 [6]  0.086 [4]  0.445 [1]  0.356 [2]  0.104 [3]  0.008 [5]  0.000 [7] 
3  0.001 [6]  0.059 [4]  0.310 [2]  0.438 [1]  0.174 [3]  0.018 [5]  0.000 [7] 
4  0.000 [6]  0.015 [5]  0.078 [4]  0.151 [3]  0.574 [1]  0.182 [2]  0.000 [7] 
7  0.000 [6]  0.004 [5]  0.019 [4]  0.038 [3]  0.148 [2]  0.791 [1]  0.000 [7] 
8  0.013 [4]  0.821 [1]  0.149 [2]  0.016 [3]  0.001 [5]  0.000 [6]  0.000 [7] 
10  0.985 [1]  0.015 [2]  0.000 [3]  0.000 [4]  0.000 [5]  0.000 [6]  0.000 [7] 
11  0.000 [7]  0.000 [6]  0.000 [5]  0.000 [4]  0.000 [3]  0.000 [2]  1.000 [1] 
8.3 Multilabel ranking of musical emotions
In this experiment, PLNet is used to rank labels that are specified in a multilabel classification context. The Emotions dataset is about songs that were annotated by experts using multiple emotional labels based on the TellegenWatsonClark model (Trohidis et al. 2008). In total, 593 songs from a variety of genres (Classical, Reggae, Rock, Pop, HipHop, Techno, Jazz) were used from which 72 rythmic as well as timbre features were extracted and used as a feature representation.
To apply dyad ranking on this kind of multilabel data, a preference \((\varvec{x}, y) \succ (\varvec{x}, y')\) is constructed for each song \(\varvec{x}\) and each pair of emotions \(y, y'\) such that y is associated with (or relevant for) \(\varvec{x}\) but \(y'\) is not. Once being trained on this data, a dyad ranker will be able to predict a ranking of emotions, contextualized by a song. Note, however, that such as ranking only provides a relative order, but no (absolute) separation of relevant and nonrelevant labels. This problem has been addressed by a technique called calibrated label ranking (Brinker et al. 2006; Fürnkranz et al. 2008): An additional calibration label is introduced, which models exactly this separation. Correspondingly, preferences of all relevant labels over the calibration label and of the calibration label over all nonrelevant labels are added to the training data.
As can be seen, the unfolding nicely reflects the similarity between emotions. For example, quietstill is located much closer to relaxingcalm than ro happypleased. Likewise, angryaggressive and amazedsurprised are close to each other but quite far from the other emotions. Overall, the unfolding suggests a spectrum ranging from quietstill to amazedsurprised, and the songs are distributed along that spectrum. The absolute fit seems to be better for the left side of the spectrum, as can be seen from the difference between the songs and the emotions.
8.4 Algorithm configuration
In the following experiment, we apply our dyad ranking methods in the setting of metalearning for algorithm recommendation as described by Brazdil et al. (2008). In particular, given a problem instance, we aim at predicting rankings of configurations of genetic algorithm (GA). The dyad ranking setting is here explored in full generality, because rankings of different length are considered as well as features about instances (problems) and labels (algorithms).
In this experiment we focus on the task of ranking configurations for genetic algorithms (GAs). These GAs are applied on different instances of the traveling salesman problem (TSP). For the training, the GA performance averages are taken to construct rankings, in which a single performance value corresponds to the distance of the shortest route found by a GA. The GAs all share the properties of using the same selection criterion, which is “roulettewheel”, the same mutation operator, which is “exchange mutation”, and “elitism” of 10 chromosomes (Mitchell 1998).
We tested the performance of three groups of GAs on a set of TSP instances. The groups are determined by their choice of the crossover operator, which are cycle (CX), order (OX), or partially mapped crossover (PMX) (Larrañaga et al. 1999). Problem instances are characterized by the number of cities and the performances of three landmarkers.

Crossover types: {CX, OX, PMX}

Crossover rates: {0.5, 0.6, 0.7, 0.8, 0.9}

Mutation rates: {0.08, 0.09, 0.1, 0.11, 0.12}
8.4.1 Experimental results
Average performance in terms of Kendall’s tau and standard deviations of different metalearners and different conditions (average rankings lengths M and the numbers of training instances N)
M  N  AR  BilinPL  CC  LinPL  PLNet  QRRLS  RPC 

5  30  .192 ± .063  .727 ± .014  .290 ± .063  .317 ± .049  .602 ± .057  .598 ± .041  .158 ± .052 
60  .358 ± .046  .766 ± .014  .428 ± .040  .452 ± .041  .651 ± .049  .633 ± .021  .311 ± .038  
90  .404 ± .030  .770 ± .014  .573 ± .042  .575 ± .037  .685 ± .063  .634 ± .028  .372 ± .035  
120  .430 ± .029  .777 ± .009  .610 ± .031  .619 ± .022  .727 ± .031  .644 ± .025  .387 ± .032  
10  30  .423 ± .054  .775 ± .007  .539 ± .054  .551 ± .049  .724 ± .035  .634 ± .018  .397 ± .043 
60  .487 ± .017  .781 ± .004  .690 ± .021  .696 ± .013  .744 ± .022  .652 ± .022  .493 ± .037  
90  .523 ± .014  .781 ± .007  .726 ± .015  .726 ± .012  .774 ± .010  .657 ± .024  .576 ± .018  
120  .522 ± .015  .783 ± .006  .750 ± .014  .748 ± .014  .783 ± .017  .661 ± .026  .620 ± .020  
20  30  .516 ± .037  .781 ± .005  .722 ± .019  .722 ± .015  .747 ± .037  .659 ± .016  .622 ± .018 
60  .549 ± .014  .784 ± .005  .763 ± .013  .758 ± .014  .787 ± . 010  .659 ± .024  .714 ± .022  
90  .561 ± .014  .787 ± .006  .779 ± .010  .774 ± .013  .793 ± .013  .656 ± .033  .751 ± .021  
120  .571 ± .022  .787 ± .008  .786 ± .010  .782 ± .010  .794 ± .012  .659 ± .036  .772 ± .014  
30  30  .554 ± .028  .782 ± .005  .753 ± .013  .746 ± .018  .772 ± .019  .655 ± .018  .717 ± .019 
60  .567 ± .008  .785 ± .003  .782 ± .007  .775 ± .009  .791 ± .008  .661 ± .020  .767 ± .011  
90  .578 ± .008  .787 ± .004  .791 ± .005  .786 ± .005  .798 ± .003  .663 ± .015  .781 ± .006  
120  .580 ± .011  .786 ± .006  .794 ± .005  .789 ± .007  .799 ± .007  .666 ± .024  .787 ± .005 
8.4.2 Unfolding of algorithm configurations
We used the bilinear PL model to create an unfolding solution for the metalearning problem of recommending rankings over genetic algorithm configurations. To this end, we first trained the BilinPL model on 120 examples, each of which provides an incomplete ranking over only 5 of the 72 random configurations. The model was then used to predict the rankings on 126 new problem instances, and to complement the missing ranking information about the 120 training examples. Dyadic unfolding was then performed (using transformation \(t_2\)) and the pairwise withinset distances of points in X and Y. The calculation of the ideal point configuration under the setting of \(\alpha =0.1\), \(\beta =0.1\), and \(\gamma =0.8\) required 24 SMACOF iterations with a final stress value of 0.0975.
The resulting unfolding configuration shown in Fig. 8 nicely reveals the degree of suitability of GA configurations for particular TSP problems: GAs of type OX and PMX are more suitable for smaller problems, while GAs of the type CX are better suited for TSPs with a larger number of cities. Moreover, the types of GA are reflected well by the different clusters. Each cluster is again (sub)clustered according to mutation rates as indicated by different hues of colors. Isocontour circles are drawn exemplarily around a particular instance (here training problem 6 with 166 cities) to support the inspection of the GA ranking.
8.5 Similarity learning on tagged images
8.5.1 Learning image similarity using dyad ranking
A collection of images, in which each image is tagged with a class label, can be used to infer a notion of preference. For learning a measure of similarity, we make the reasonable assumption that a pair of images sharing the same label are mutually more similar to each other than a pair of images with different labels.
Given a finite set of class labels, there are multiple ways to construct dyad rankings based on this idea, as depicted in Fig. 9. In panels (a)–(c), there are dyad rankings which have in common that the first ranked dyad contains instances that are similar and the second ranked dyad contains instances that are dissimilar. These rankings are thus of the form \((\varvec{x}_1,\varvec{x}_2) \succ (\varvec{x}_3,\varvec{x}_4)\). Panel (b) is a special case of (a) in which one of the second ranked dyad instances belongs to the same class as those from the first ranked dyad. Panel (c) is again a special case of (b) in which one of the instances of the second ranked dyad coincides with one of the instances of the first ranked dyad. This can be translated into a contextualized preference statement “instance \(\varvec{x}_1\) is more similar to instance \(\varvec{x}_2\) than to \(\varvec{x}_3\)” (Cheng and Hüllermeier 2008).
The implementation is also available in DyraLib and is based on Matlab with parts written in C++. This approach is also encouraged by existing metric learning approaches, which typically use training examples of type (c) in Fig. 9 (Bellet et al. 2013). To this end, of course, a suitable feature representation for the images is needed.
8.5.2 Construction of image features
The Caltech256 data set has already been used for image similarity learning before (Chechik et al. 2009, 2010). The authors of these papers used a feature engineering approach, in which images are represented as sparse bagofvisual words vectors. Taking advantage of recent progress in deep learning, we utilize deep convolutional neural networks (CNN) for generating feature representations. More concretely, we used a pretrained CNN model called AlexNet, which has been created on the basis of the ImageNet data set (Krizhevsky et al. 2012; Jia et al. 2014). For each image, 4096 dimensional sparse feature vectors were obtained from the outputs of the 6th fully connected (6fc) layer of the convolutional neural network.
8.5.3 Unfolding of image similarity
SiDRa was run with a learning rate \(\eta = 0.1\) and a regularization parameter \(\lambda = 0.0001\). The dyad rankings were obtained during the learning process by sampling images involving all possible cases (a)–(c) as outlined in Fig. 9. The final BilinPL model comprised a weight matrix of 16 million elements (\(=4096^2\)) and took 30 K iterations. The model was then used to produce a matrix of skill values for images in the test set. The ranktransform (\(t_3\)) of the skills was used for dyadic unfolding, which produced a configuration of points within 339 iterations.
The Pearson product moment correlation coefficient on pairwise log scores \(v(\varvec{x},\varvec{x}')\), i.e., bilinear similarities, and pairwise column configuration point distances amounts to \(r = 0.791\). This means, the lower the distance between two images in the unfolding solution, the higher their dyadic preference in terms of similarity.
9 Conclusion
In this paper, we proposed dyad ranking as an extension of the label ranking problem, a specific problem in the realm of preference learning, in which preferences on a finite set of choice alternatives are represented in the form of a contextualized ranking. While the context is described in terms of a feature vector, the alternatives are merely identified by their label.
In practice, however, information about properties of the alternatives is often available, too, and such information could obviously be useful from a learning point of view. In dyad ranking, not only the context but also the alternatives are therefore characterized in terms of properties. The key notion of a dyad refers to a combination of a context and a choice alternative.
We proposed a method for dyad ranking that is an extension of an existing probabilistic label ranking method based on the Plackett–Luce model. This model is combined with a suitable feature representation of dyads. Concretely, we developed two instantiations of this approach. The first, BilinPL, is a method based on the representation of dyads in terms of a Kronecker product. The second, PLNet, is based on neural networks and allows for learning a (highly nonlinear) joint feature representation of dyads. The usefulness of the additional information provided by the feature description of alternatives was shown in several experiments and case studies on algorithm configuration and similarity learning.
Last but not least, we proposed a method for the visualization of dyad rankings, which is based on the technique of multidimensional unfolding. We consider this as an interesting contribution, especially because visualization has hardly been studied in the field of preference learning so far. Again, the usefulness of our approach was shown in several case studies.
There are several directions for future work. We are specifically interested in developing a “deep” version of PLNet. So far, PLNet is based on standard multilayer networks. However, in many applications, it could be advantageous to replace these networks by deep architectures. In our case study on image data, deep features were first extracted manually from a convolutional neural net, and then used for dyad ranking in a second step. A much more elegant approach, of course, would be to combine these steps in a single method. PLNet with a deep neural net is an obvious candidate for such a method.
Footnotes
 1.
This approach could be compared to the reduction of multilabel to multiclass classification via the label powerset transformation (Tsoumakas and Katakis 2007).
 2.
This is an upper bound, since in practice, not all feature combinations are necessarily realized.
 3.
Of course, other techniques could be used as well, including standard L2 regularisation. Since early stopping works well, a thorough comparison of different alternatives has not yet been done.
 4.
We argue that using querydocumentassociated scores as PL model parameters is anyway questionable, especially because these such scores are normally taken from an ordinal scale.
 5.
Stress is an acronym for standardized residual sumofsquares.
 6.
In our models, the \(v_{i,j}\) are ensured to be positive but not necessarily bounded. Therefore, we extend the mapping \(t_2\) by a subsequent affine linear transformation to the unit interval.
 7.
CC was used in its online variant as described in (Hüllermeier et al. 2008).
 8.
References
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2016). TensorFlow: Largescale machine learning on heterogeneous systems. CoRR. arXiv:1603.04467.
 Alvo, M., & Yu, P. L. (2014). Statistical methods for ranking data. New York: Springer.CrossRefzbMATHGoogle Scholar
 Bakir, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., & Vishwanathan, S. V. N. (Eds.). (2007). Predicting structured data. Cambridge: MIT Press.Google Scholar
 Basilico, J., & Hofmann, T. (2004). Unifying collaborative and contentbased filtering. In Proceedings ICML, 21st international conference on machine learning. ACM, New York, USA.Google Scholar
 Bellet, A., Habrard, A., & Sebban, M. (2013). A survey on metric learning for feature vectors and structured data (p. 57). arXiv:1306.6709.
 Borg, I. (1981). Anwendungsorientierte Multidimensionale Skalierung. New York: Springer.CrossRefGoogle Scholar
 Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. New York: Springer.zbMATHGoogle Scholar
 Borg, I., Groenen, P. J., & Mair, P. (2012). Applied multidimensional scaling. New York: Springer.Google Scholar
 Bottou, L. (2010). Largescale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010, 19th international conference on computational statistics (pp. 177–187). Springer, Paris, France.Google Scholar
 Bottou, L. (1998). Online algorithms and stochastic approximations. Online learning and neural networks. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Reading, MA: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Brazdil, P., GiraudCarrier, C., Soares, C., & Vilalta, R. (2008). Metalearning: Applications to data mining (1st ed.). New York: Springer.zbMATHGoogle Scholar
 Brinker, K., Fürnkranz, J., & Hüllermeier, E. (2006). A unified model for multilabel classification and ranking. In Proceedings of the ECAI2006: 17th European conference on artificial intelligence (pp. 489–493), Riva Del Garda, Italy.Google Scholar
 Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005). Learning to rank using gradient descent. In Proceedings ICML, 22nd international conference on machine learning (pp. 89–96). ACM, New York, USA.Google Scholar
 Busing, F. (2010). Advances in multidimensional unfolding. Ph.D. thesis, University of Leiden.Google Scholar
 Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings ICML, 24th international conference on machine learning (pp. 129–136). ACM, New York, USA.Google Scholar
 Caron, F., & Doucet, A. (2012). Efficient bayesian inference for generalized Bradley–Terry models. Journal of Computational and Graphical Statistics, 21(1), 174–196.MathSciNetCrossRefGoogle Scholar
 Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an Nway generalization of “Eckart–Young” decomposition. Psychometrika, 35(3), 283–319.CrossRefzbMATHGoogle Scholar
 Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2009). An online algorithm for large scale image similarity learning. Advances in Neural Information Processing Systems, 21, 1–9.zbMATHGoogle Scholar
 Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2010). Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11, 1–29.MathSciNetzbMATHGoogle Scholar
 Cheng, W., & Hüllermeier, E. (2008). Learning similarity functions from qualitative feedback. In Proceedings of the ECCBR—2008, 9th European conference on casebased reasoning (pp. 120–134). Springer, Trier, Germany, no. 5239 in LNAI.Google Scholar
 Cheng, W., Henzgen, S., & Hüllermeier, E. (2013). Labelwise versus pairwise decomposition in label ranking. In Proceedings of Lernen Wissen Adaptivität 2013 (LWA 2013) (pp. 140–147). Otto Friedrich Universität Bamberg, Germany.Google Scholar
 Cheng, W., Hühn, J., & Hüllermeier, E. (2009). Decision tree and instancebased learning for label ranking. In Proceedings ICML, 26th international conference on machine learning (pp. 161–168). Omnipress, Montreal, Canada.Google Scholar
 Cheng, W., Hüllermeier, E., Waegeman, W., & Welker, V. (2012). Label ranking with partial abstention based on thresholded probabilistic models. In Proceedings NIPS—2012, 26th annual conference on neural information processing systems, Lake Tahoe, Nevada, US.Google Scholar
 Cheng, W., Rademaker, M., De Beats, B., & Hüllermeier, E. (2010b). Predicting partial orders: Ranking with abstention. In Proceedings ECML/PKDD—2010, European conference on machine learning and principles and practice of knowledge discovery in databases, Barcelona, Spain.Google Scholar
 Cheng, W., Dembczyński, K., & Hüllermeier, E. (2010a). Label ranking methods based on the Plackett–Luce model. In J. Fürnkranz & T. Joachims (Eds.), Proceedings ICML, 27th international conference on machine learning (pp. 215–222). Haifa: Omnipress.Google Scholar
 Cohen, W., Schapire, R., & Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research, 10(1), 243–270.MathSciNetzbMATHGoogle Scholar
 Coombs, C. H. (1950). Psychological scaling without a unit of measurement. Psychological Review, 57(3), 145–158.CrossRefGoogle Scholar
 David, H. A. (1969). The method of paired comparisons. London: Griffin.Google Scholar
 De Leeuw, J., & Mair, P. (2009). Multidimensional scaling using majorization: SMACOF in R. Journal of Statistical Software, 31(1), 1–30.Google Scholar
 De Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In J. R. Barra, F. Brodeau, G. Romier, & B. Van Cutsem (Eds.), Recent developments in statistics (pp. 133–146). North Holland.Google Scholar
 De Leeuw, J., & Heiser, W. J. (1977). Convergence of correction matrix algorithms for multidimensional scaling. In J. C. Lingoes (Ed.), Geometric representations of relational data (pp. 735–752). Ann Arbor, MI: Mathesis Press.Google Scholar
 Dekel, O., Singer, Y., & Manning, C. D. (2004). Loglinear models for label ranking. In S. Thrun, L. K. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems (Vol. 16, pp. 497–504). Cambridge: MIT Press.Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B (methodological), 39(1), 1–38.MathSciNetzbMATHGoogle Scholar
 Fürnkranz, J., & Hüllermeier, E. (2010). Preference learning: An introduction. In J. Fürnkranz & E. Hüllermeier (Eds.), Preference learning (pp. 1–17). New York: Springer.Google Scholar
 Fürnkranz, J., Hüllermeier, E., Mencía, E., & Brinker, K. (2008). Multilabel classification via calibrated label ranking. Machine Learning, 73(2), 133–153.CrossRefGoogle Scholar
 Griffin, G., Holub, A., & Perona, P. (2007). Caltech256 object category dataset. Caltech Mimeo, 11, 20.Google Scholar
 Groenen, P. J., & Heiser, W. J. (1996). The tunneling method for global optimization in multidimensional scaling. Psychometrika, 61(3), 529–550.CrossRefzbMATHGoogle Scholar
 Groenen, P., & van de Velden, M. (2016). Multidimensional scaling by majorization: A review. Journal of Statistical Software, 73(1), 1–26.Google Scholar
 Guiver, J., & Snelson, E. (2009). Bayesian inference for plackettluce ranking models. In Proceedings ICML, 26th international conference on machine learning (pp. 377–384). ACM, ICML ’09.Google Scholar
 HarPeled, S., Roth, D., & Zimak, D. (2002a). Constraint classification: A new approach to multiclass classification. In Proceedings ALT, 13th international conference on algorithmic learning theory (pp. 365–379). Springer.Google Scholar
 HarPeled, S., Roth, D., & Zimak, D. (2002b). Constraint classification for multiclass classification and ranking. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (Vol. 15, pp. 809–816). Cambridge: MIT Press.Google Scholar
 Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16), 1897–1916.MathSciNetCrossRefzbMATHGoogle Scholar
 Hüllermeier, E., & Vanderlooy, S. (2009). Why fuzzy decision trees are good rankers. IEEE Transactions on Fuzzy Systems, 17(6), 1233–1244.CrossRefGoogle Scholar
 Hunter, D. R. (2004). MM algorithms for generalized Bradley–Terry models. Annals of Statistics, 32(1), 384–406.MathSciNetCrossRefzbMATHGoogle Scholar
 Huybrechts, G. (2016). Learning to rank with deep neural networks. Master’s thesis, Ecole polytechnique de Louvain (EPL).Google Scholar
 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
 Kamishima, T., Kazawa, H., & Akaho, S. (2011). A survey and empirical comparison of object ranking methods. In Preference learning (pp .181–201). Springer.Google Scholar
 Kanda, J., Soares, C., Hruschka, E. R., & de Carvalho, A.C.P.L.F. (2012). A metalearning approach to select metaheuristics for the traveling salesman problem using MLPbased label ranking. In Proceedings ICONIP, 19th international conference on neural information processing (pp. 488–495). Springer, Doha, Qatar.Google Scholar
 Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93.CrossRefzbMATHGoogle Scholar
 Kidwell, P., Lebanon, G., & Cleveland, W. (2008). Visualizing incomplete and partially ranked data. IEEE Transactions on Visualization and Computer Graphics, 14(6), 1356–1363.CrossRefGoogle Scholar
 Krantz, D. H. (1967). Rational distance functions for multidimensional scaling. Journal of Mathematical Psychology, 4(2), 226–245.CrossRefzbMATHGoogle Scholar
 Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 1097–1105). Red Hook: Curran Associates, Inc.Google Scholar
 Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1–27.MathSciNetCrossRefzbMATHGoogle Scholar
 Lange, K. (2016). MM optimization algorithms. Philadelphia: Society for Industrial and Applied Mathematics (SIAM).CrossRefzbMATHGoogle Scholar
 Lange, K., Hunter, D., & Yang, I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9, 1–20.MathSciNetGoogle Scholar
 Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zerodata learning of new tasks. In Proceedings of the AAAI’08, 23rd national conference on artificial intelligence (pp. 646–651).Google Scholar
 Larrañaga, P., Kuijpers, C. M. H., Murga, R. H., Inza, I., & Dizdarevic, S. (1999). Genetic algorithms for the traveling salesman problem: A review of representations and operators. Artificial Intelligence Review, 13, 129–170. https://doi.org/10.1023/A:1006529012972.CrossRefGoogle Scholar
 Lichman, M. (2013). UCI Machine Learning Repository. School of Information and Computer Sciences, University of California, Irvine. http://archive.ics.uci.edu/ml.
 Liu, T. (2011). Learning to rank for information retrieval. New York: Springer.CrossRefzbMATHGoogle Scholar
 Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1–3), 503–528.MathSciNetCrossRefzbMATHGoogle Scholar
 Luce, R. D. (1959). Individual choice behavior: A theoretical analysis. New York: Wiley.zbMATHGoogle Scholar
 Luce, R. D. (1961). A choice theory analysis of similarity judgments. Psychometrika, 26(2), 151–163.CrossRefzbMATHGoogle Scholar
 Luenberger, D. G. (1973). Introduction to linear and nonlinear programming. Reading, MA: AddisonWesley.zbMATHGoogle Scholar
 Luo, T., Wang, D., Liu, R., & Pan, Y. (2015). Stochastic topk listnet. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 676–684). Association for Computational Linguistics, Lisbon, Portugal.Google Scholar
 Mallows, C. L. (1957). Nonnull ranking models. Biometrika, 44(1/2), 114–130.MathSciNetCrossRefzbMATHGoogle Scholar
 Marden, J. I. (1995). Analyzing and modeling rank data. London: Chapman & Hall.zbMATHGoogle Scholar
 Maystre, L., & Grossglauser, M. (2015). Fast and accurate inference of Plackett–Luce models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 28, pp. 172–180). Red Hook: Curran Associates, Inc.Google Scholar
 Menke, J. E., & Martinez, T. R. (2008). A Bradley–Terry artificial neural network model for individual ratings in group competitions. Neural Computing and Applications, 17(2), 175–186.CrossRefGoogle Scholar
 Menon, A. K., & Elkan, C. (2010a). Dyadic prediction using a latent feature loglinear model. arXiv:1006.2156.
 Menon, A. K., & Elkan, C. (2010b). A loglinear model with latent features for dyadic prediction. In Proceedings of the 2010 IEEE international conference on data mining (pp. 364–373). IEEE Computer Society, ICDM ’10.Google Scholar
 Menon, A. K., & Elkan, C. (2010c). Predicting labels for dyadic data. Data Mining and Knowledge Discovery, 21(2), 327–343.MathSciNetCrossRefGoogle Scholar
 Meulman, J. J., Van der Kooj, A. J., & Heiser, W. J. (2004). Principal components analysis with nonlinear optimal scaling transformations for ordinal and nominal data. The Sage handbook of quantitative methodology for the social sciences (pp. 49–72). London: Sage.Google Scholar
 Mitchell, M. (1998). An introduction to genetic algorithms. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
 Murata, N., Kitazono, J., & Ozawa, S. (2017). Multidimensional unfolding based on stochastic neighbor relationship. In Proceedings of the 9th international conference on machine learning and computing (pp. 248–252).Google Scholar
 Pahikkala, T., Stock, M., Airola, A., Aittokallio, T., De Baets, B., & Waegeman, W. (2014). A twostep learning approach for solving full and almost full cold start problems in dyadic prediction. In T. Calders, F. Esposito, E. Hüllermeier, & R. Meo (Eds.), Lecture notes in computer science (Vol. 8725, pp. 517–532). Springer.Google Scholar
 Pahikkala, T., Waegeman, W., Airola, A., Salakoski, T., & De Baets, B. (2010). Conditional ranking on relational data. In Proceedings ECML/PKDD European conference on machine learning and knowledge discovery in databases (pp. 499–514). Springer.Google Scholar
 Pahikkala, T., & Airola, A. (2016). RLScore: Regularized leastsquares learners. Journal of Machine Learning Research, 17(221), 1–5.MathSciNetzbMATHGoogle Scholar
 Pahikkala, T., Airola, A., Stock, M., De Baets, B., & Waegeman, W. (2013). Efficient regularized leastsquares algorithms for conditional ranking on relational data. Machine Learning, 93, 321–356.MathSciNetCrossRefzbMATHGoogle Scholar
 Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2), 193–202.MathSciNetGoogle Scholar
 Prechelt, L. (2012). Early stopping: But when? In G. Montavon, G. B. Orr, & K.R. Müller (Eds.), Neural networks: Tricks of the trade (pp. 53–67). Springer.Google Scholar
 Ribeiro, G., Duivesteijn, W., Soares, C., & Knobbe, A. J. (2012). Multilayer perceptron for label ranking. In Proceedings ICANN, 22nd international conference on artificial neural networks (pp. 25–32). Springer, Lausanne, Switzerland.Google Scholar
 Rigutini, L., Papini, T., Maggini, M., & Scarselli, F. (2011). Sortnet: Learning to rank by a neural preference function. IEEE Transactions on Neural Networks, 22(9), 1368–1380.CrossRefGoogle Scholar
 Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 9.CrossRefzbMATHGoogle Scholar
 Schäfer, D., & Hüllermeier, E. (2015). Dyad ranking using a bilinear Plackett–Luce model. In Proceedings ECML/PKDD—2015, European conference on machine learning and knowledge discovery in databases (pp. 227–242). Springer, Porto, Portugal.Google Scholar
 Schäfer, D., & Hüllermeier, E. (2016). Plackett–Luce networks for dyad ranking. In Workshop LWDA, “Lernen, Wissen, Daten, Analysen”. Potsdam, Germany.Google Scholar
 Soufiani, H., Parkes, D., & Xia, L. (2014). Computing parametric ranking models via rankbreaking. In Proceedings of ICML, 31st international conference on machine learning, Beijing, China.Google Scholar
 Tesauro, G. (1989). Connectionist learning of expert preferences by comparison training. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (NIPS1988) (Vol. 1, pp. 99–106). Los Altos: Morgan Kaufmann.Google Scholar
 Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. In Proceedings of ISMIR 2008, international conference on music information retrieval (pp. 325–330), Philadelphia, PA, USA.Google Scholar
 Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetzbMATHGoogle Scholar
 Tsoumakas, I., & Katakis, I. (2007). Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.CrossRefGoogle Scholar
 Tucker, L. R. (1960). Intraindividual and interindividual multidimensionality. In H. Gulliksen & S. Messick (Eds.), Psychological scaling: Theory and applications (pp. 155–167). New York: Wiley.Google Scholar
 Van Deun, K., Groenen, P., & Delbeke, L. (2005). VIPSCAL: A combined vector ideal point model for preference data. Econometric Institute Report No. EI 200503, Erasmus University Rotterdam.Google Scholar
 Vembu, S., & Gärtner, T. (2010). Label ranking algorithms: A survey. In J. Fürnkranz & E. Hüllermeier (Eds.), Preference Learning (pp. 45–64). New York: Springer.CrossRefGoogle Scholar
 Weimer, M., Karatzoglou, A., Le, Q. V., & Smola, A. J. (2007). COFI RANK: Maximum margin matrix factorization for collaborative ranking. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems (Vol. 20, pp. 1593–1600). Cambridge: MIT Press.Google Scholar
 Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA.Google Scholar
 Zhang, M. L., & Zhou, Z. H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1338–1351.CrossRefGoogle Scholar
 Zhou, Y., Liu, Y., Yang, J., He, X., & Liu, L. (2014). A taxonomy of label ranking algorithms. Journal of Computers, 9(3), 557–565.Google Scholar