Ensembles for multitarget regression with random output selections
 632 Downloads
Abstract
We address the task of multitarget regression, where we generate global models that simultaneously predict multiple continuous variables. We use ensembles of generalized decision trees, called predictive clustering trees (PCTs), in particular bagging and random forests (RF) of PCTs and extremely randomized PCTs (extra PCTs). We add another dimension of randomization to these ensemble methods by learning individual base models that consider random subsets of target variables, while leaving the input space randomizations (in RF PCTs and extra PCTs) intact. Moreover, we propose a new ensemble prediction aggregation function, where the final ensemble prediction for a given target is influenced only by those base models that considered it during learning. An extensive experimental evaluation on a range of benchmark datasets has been conducted, where the extended ensemble methods were compared to the original ensemble methods, individual multitarget regression trees, and ensembles of singletarget regression trees in terms of predictive performance, running times and model sizes. The results show that the proposed ensemble extension can yield better predictive performance, reduce learning time or both, without a considerable change in model size. The newly proposed aggregation function gives best results when used with extremely randomized PCTs. We also include a comparison with three competing methods, namely random linear target combinations and two variants of random projections.
Keywords
Predictive clustering trees Multitarget regression Output space decomposition Structured outputs Ensemble methods1 Introduction
Supervised learning is a highly active and researched area of machine learning. Its goal is to produce a model, that can take a previously unseen example and predict the value of a variable of interest, typically called a target variable. If the target variable is of a discrete type, the task at hand is classification. If the target variable is of a numeric data type, the task is called regression. Such singletarget (ST) prediction scenarios are very common.
A number of challenges from various domains require a more complex representation of the data. In those cases, we need to move away from generating models that make predictions for one target variable to models that make predictions for multiple targets simultaneously, i.e., address the task of multitarget (MT) prediction. In general, MT prediction falls under the scope of structured output prediction (SOP). SOP, as the name suggests, is concerned with predicting values of structured data types, which are composed of values of primitive data types, e.g., boolean, real numbers, or discrete values (Panov et al. 2016). Examples of structured data types are tuples, sequences, sets, treeshaped hierarchies, directed acyclic graphs, etc. (Džeroski 2007). Examples of SOP tasks are multitarget regression (MTR), (hierarchical) multilabel classification (MLC) and time series prediction. Solving SOP tasks has great potential and importance in many domains, and has been listed as one of the most challenging problems in machine learning by Yang and Wu (2006) and Kriegel et al. (2007).
This work considers the task of MTR—predicting multiple continuous variables. Many real life scenarios exist where one is interested in predicting multiple numerical values, e.g., in ecology (Demšar et al. 2006; Stojanova et al. 2010) and life sciences (Jančič et al. 2016). MT prediction methods differ mostly in the way they exploit the target space structure during learning a predictive model. The most natural and simple starting point is to make a model for each component of the structure separately. The models predicting the individual components are then combined to make predictions for the whole structure. Such methods are called local, because they learn a local model for only one component at a time, whilst ignoring the other components (i.e., global context). Hence, local methods cannot exploit information hidden in the combination of multiple components and relationships between them. In contrast to local methods, global methods take into account all the structural components and their relations and then make predictions for all of them simultaneously. In general, this makes global models more interpretable. Global models (as well as the process of learning them) are also more computationally efficient as compared to local ones. This becomes especially evident when the predicted structure consists of many components. Learning global models can therefore yield better predictive performance while consuming less resources.
In this paper, we propose a new ensemble extension method for the task of MTR, called Random Output Selections (ROS). The method uses predictive clustering trees (PCTs) as base models in the ensembles. PCTs, a generalization of decision trees, are global models for SOP able to solve MTR and MLC tasks, among other (Blockeel et al. 1998). The proposed method can be coupled with any ensemble learning method that employs global decision trees as base learners. The method learns each base model on a random subset of all target variables (each model has its own subset of variables). In this work we apply ROS on three different ensemble learning methods: bagging (Bag) (Breiman 1996; Kocev et al. 2013), random forests (RF) (Breiman 2001; Kocev et al. 2013) and extremely randomized trees (ET) (Geurts et al. 2006; Kocev and Ceci 2015). Analogously, we refer to extended methods as BagROS, RFROS and ETROS.
The main focus of this study is to determine whether the proposed method can improve the predictive performance and shorten the learning times of the considered ensemble methods. An extensive empirical evaluation over a variety of benchmark datasets in performed in order to determine the effects of using ROS on predictive performance. In addition we perform an analysis with respect to time and space complexity.

A novel global ensemble extension approach for addressing the MTR task. It randomly selects subsets of targets for learning individual base models. This can yield better predictive performance and shorter learning times.

A novel aggregation function for the prediction of base models in the extended ensembles. By default, each tree in the ensemble predicts all targets. Here, only targets that were considered during the learning of an individual tree contribute to the final predictions.

An extensive empirical evaluation of the three different ensemble methods on 17 benchmark datasets, which provide performance assessment for the original and extended ensemble methods, as well as individual multitarget regression trees and ensembles of singletarget regression trees (all over a range of ensemble sizes). The study also includes parameter setting recommendations for the proposed method. Moreover, we compare the performance to other competing methods that also transform the output space.

Theoretical computational complexity analysis of the proposed ensemble extension method, linked to the empirical evidence of the aforementioned evaluation.
2 Background and related work
2.1 Task definition

An input space X, with tuples of dimension d, containing values of primitive data types, i.e., \(\forall x_{i} \in X, x_{i} = (x_{i_{1}},x_{i_{2}}, \ldots ,x_{i_{d}})\),

An output (target) space Y, with tuples of dimension t, containing real values, i.e., \(\forall y_{i} \in Y, y_{i} = (y_{i_{1}},y_{i_{2}}, \ldots ,y_{i_{t}})\), where \(y_{i_{k}} \in \mathbb {R}\) and \(1 \le k \le t\)

A set of examples S, where each example is a pair of tuples from the input and the output space, i.e., \(S = \{(x_{i}, y_{i})  x_{i} \in X, y_{i} \in Y, 1 \le i \le N \}\) and N is the number of examples in S (\(N = S\)),

A quality criterion c, which rewards models with high predictive accuracy and low complexity.
In this work, the function f is represented by an ensemble (set) of PCTs. It will be learned by the approaches of bagging (Bag), Random Forests (RF) and extraPCTs (ET) and their ROS counterparts BagROS, RFROS and ETROS.
2.2 Related work
Our work relates to three main areas: solving the task of MTR, ensemble learning and output space decomposition. Multitarget regression, also referred to as multitarget, multivariate or multiresponse regression, is a machine learning task, where the goal is to predict multiple realvalued variables simultaneously. Borchani et al. (2015) divide multitarget regression methods into two categories: problem transformation methods and algorithm adaptation methods.
Problem transformation methods include the work of SpyromitrosXioufis et al. (2016) on singletarget approaches, multitarget regressor stacking and regressor chains, and the work of Zhang et al. (2012) on multioutput support vector regression, and the work of Tsoumakas et al. (2014) on random linear target combinations. These methods transform the output space in such a way, that it is possible to apply existing methods to solve the task at hand. The transformation process usually converts a multitarget problem into several singletarget ones, thus approaching the MTR problem locally. However, transformation methods exist, that solve the MTR problem locally but, on the other side, include multiple targets into the learning process thus making them not fullylocal nor fullyglobal, e.g., Tsoumakas et al. (2014).
Algorithm adaptation methods, on the other hand, have the capability to handle multitarget tasks naturally, i.e., no transformation of the data is needed. Such methods are global. They can exploit the potential relatedness of the targets to learn models with better predictive performance faster as compared to problem transformation methods. Algorithm adaptation methods include statistical methods, such as those by Abraham et al. (2013), Breiman and Friedman (1997), Izenman (1975), multioutput support vector regression work by Xu et al. (2013), Han et al. (2012), Deger et al. (2012), kernel methods by Alvarez et al. (2012), Micchelli and Pontil (2004), multitarget regression trees (Kocev and Ceci 2015; Levatić et al. 2014; Appice and Malerba 2014; Kocev et al. 2013; Stojanova et al. 2012; Ikonomovska et al. 2011; Appice and Džeroski 2007) and rule based methods for MTR by Aho et al. (2012). Due to the plethora of existing methods, we will not discuss all of them here but rather briefly describe the ones most closely related to our work.
Predictive clustering trees (PCTs) have been introduced by Blockeel et al. (1998). A PCT is a generalization of a standard decision tree, which can be instantiated to support different tasks of SOP, one of which is the task of MTR. PCTs are decision tree based models that belong to the algorithm adaptation methods group, because they handle the task without transforming the instance space. PCTs are global models, since they give predictions for all targets simultaneously. PCTs are instantiated by two required parameters: the variance and prototype functions. Technically, PCTs perform divisive hierarchical clustering by using the provided variance function. The variance function is used to calculate heuristic scores that guide the learning process until a stopping criterion is met. This eliminates the need for arbitrarily selecting the number of clusters beforehand, as required by traditional clustering methods. When instances are clustered, the prototype function is used to calculate the predictions on all leaf nodes (terminal clusters of the hierarchy). A detailed explanation of PCTs can be found in Sects. 3.1 and 3.2.
Ensembles of PCTs were introduced by Kocev et al. (2007, 2013), specifically bagging and random forests. Kocev and Ceci (2015) later extended extremely randomized trees, initially introduced by Geurts et al. (2006), to structured outputs. They called them extraPCTs, because they based the implementation on PCTs. Extremely randomized trees select one random split point for each of k predictive attributes at each split. The best performing split point is selected and the process recursively continues. Extremely randomized trees and their multioutput variant (extraPCTs) are very unstable models, so it only makes sense to use them in an ensemble setting.
Several multitarget prediction approaches that transform the output space do exist, but they mainly focus on the task of MLC. Joly et al. (2014) reduce the dimensionality of the output space by making random projections of it. They make the projections in such a way, that they preserve the original distances in the projected space. Their approach uses JohnsonLindenstrauss lemma. If the output space projection matrix satisfies the lemma, the variance computations in the projected space will be \(\epsilon \)approximations of the variance in the original output space. They employ Gaussian, Rademacher, Hadamard and Achlioptas projections to compress the output space. Only the variance calculations are made in the projected space while the predictions are made directly in the original output space (i.e., no decoding needed). They use multioutput regression trees to calculate the variances in the projected space and then apply thresholding to obtain predictions for labels (i.e., MLC setting). Our approach is simpler, as it does not perform a transformation of the output space but only takes a subset of it, which makes it more straightforward. They do not report any results for the MTR setting. Joly (2017) proposes a gradient boosting method for MTR that uses random projections of the output space to automatically adapt to the output correlation structure.
Tsoumakas and Vlahavas (2007) propose an ensemble method RAkEL (Random klabelsets) for the task of MLC, which is also transformationbased. RAkEL is an ensemblelike wrapper method for solving multilabel classification tasks with existing algorithms for multiclass classification. They construct the ensemble by providing a small random subset of k labels (organized as a label powerset) to each base model, learned by a multiclass classifier. This results in an additional step in the prediction phase because predictions need to be decoded. In addition to this, RAkEL’s computational complexity is high because the generated output spaces are label powersets and the underlying classification algorithm is a parameter, which can considerably change/worsen the training times (e.g., if we use SVMs instead of ordinary decision trees). This approach has been extended by Szymański et al. (2016), where the authors propose not to use the original random partitioning of subsets as performed by RAkEL, but rather a datadriven approach. They propose to use community detection approaches from social networks to partition the label space, which can find better subspaces than random search.
Madjarov et al. (2016) also use a data driven approach to solve the task of MLC. They use label hierarchies which they obtain from hierarchical clustering of flat label sets by using annotations that appear in the training data. Finally, the work of Tsoumakas et al. (2014) considers the MTR task. They use random linear target combinations to enrich the output space by constructing many new target variables. They use a predefined number of original target variables in each random combination and then transform the original output space matrix by multiplying it with the coefficient matrix consisting of new combinations.
3 Ensembles for multitarget regression with random output selections
The following section introduces the ROS ensemble extension method. We consider the proposed approach as a global method that belongs to the algorithm adaptation group of methods. Although ROS uses subsets of the output space during learning, the learned ensemble provides predictions for all target variables simultaneously. We first describe the predictive clustering paradigm and then explain the process of learning a single predictive clustering tree (PCT). Next, we present the proposed method for learning ROS tree ensembles. Finally, we provide a computational complexity analysis of the proposed approach.
3.1 Predictive clustering
The predictive clustering (PC) framework has been introduced by Blockeel (1998). It can be seen as a generalization of supervised and unsupervised learning. The two learning approaches are traditionally considered as two separate machine learning tasks. However, there are supervised methods (e.g., decision trees and rules) that partition the instance space into subsets, which makes it possible, to interpret them as clustering approaches. Unsupervised learning groups/clusters examples that are similar according to some distance measure. In supervised learning, the primary goal is to make predictions. The PC framework combines these two approaches.
The PC framework is implemented in the context of decision trees. From the PC point of view, each decision tree is a hierarchy of clusters. The root node of the tree holds all the examples. When traversing the tree (from the root to a leaf), each intermediate node contains less examples than its parent nodes. The connections between nodes represent available paths that each example can take. The decision which path a new example should take is made at the time of traversal and is based on the example’s values in the tuple of predictive variables. The bottommost nodes of the decision tree are called leaf nodes and hold examples most similar to each other. The examples in a leaf are used for calculating the prediction of the leaf.
A decision tree within the PC framework is called a predictive clustering tree (PCT). A PCT is predictive as it is able to make predictions. A PCT is a clustering, i.e., a hierarchy of clusters, represented by the tree’s structure. Each node in the tree represents a cluster, which can be explained/described by the conditions/tests that appear in the tree. Each node holds a test and if we combine all the tests from the root node to the selected node, we get the description of the cluster at the selected node. Several different predictive clustering methods (Blockeel and Struyf 2002; Struyf and Džeroski 2006; Kocev 2011; Ženko 2007; Vens et al. 2008; Slavkov et al. 2010) are already implemented in the CLUS software package and are available at http://sourceforge.net/projects/clus/.
3.2 Learning a single PCT
The induction of a PCT is similar to the induction of a standard decision tree and follows the TDIDT (top down induction of decision tree) algorithm. Algorithm 1 shows the pseudo code for PCT induction. Considering the MTR task in the context of ROS tree ensembles, the PCT induction algorithm takes three inputs:
The typical PCT considers all predictive and target attributes and is induced by selecting \(\delta _c(X) = \delta _{D}(X) = D\) and \(R_t=T\), where D and T are sets of predictive and target variables respectively. The \(\delta _c\) and \(R_t\) parameters are needed when inducing PCTs in the scope of ensemble learning with ROS, which we describe in detail in Sect. 3.3.
Regular PCTs make predictions based on the examples in the leaf nodes. Specifically, the Prototype function (line 9 of the Algorithm 1) calculates the arithmetic mean of every target variable over all the examples in a given node. If needed, the prototype calculation function can be easily adapted to better address a specific task. The BestTest function only calculates the heuristic value on \(S_R\) (the reduced subset of S). This, however, does not restrict the Prototype function, which can make predictions for all output variables, even if some of them did not contribute to the calculation of the heuristic value \(h^*\). We will discuss this further later in Sect. 3.3.3.
In addition to regular PCTs, we also consider extraPCTs (Kocev and Ceci 2015). These PCTs are induced exactly the same as described in Algorithm 1. However, the extraPCT finds the split points in a different manner (see Algorithms 3 and 4). The split point is randomly selected for each considered predictive attribute. The evaluation of splits with random split points is performed using the same procedure as for regular PCTs.
3.3 Ensembles with ROS
An ensemble is a set of models, called base predictive models. Ensemble models are not considered interpretable, but they generally achieve better predictive performance than individual models, which is usually the reason for using them. The downside of using ensembles is their computational complexity: The cost of learning and using an ensemble model is the sum of the corresponding costs for all of its base models. Predictions for new examples are made by querying base models and combining their predictions.
3.3.1 Generating output space partitions
The proposed ensemble approach introduces randomization in the output space. Whereas regular PCTs simultaneously consider the whole target space in the heuristic used for tree construction, ROS considers a different random subset of it for each base model in the ensemble. Each base model is consequently learned by considering only those targets that are included in the randomly generated partition provided to it (see the call of function \(\varPi \) in Algorithm 2, line 2).
In the first step, we create an empty list that will contain all the subspaces, i.e., \(G = \small [G_1, G_2, \dots , G_b \small ]\). The first generated subspace is T and includes all target variables, i.e., the corresponding PCT considers all target attributes. This is needed to ensure that all targets are being considered at least once. We generate the remaining \(b1\) subspaces with the \(\theta \) function, which has a parameter v. An example of a \(\theta \) function could return a random selection of 25% (\(v=\frac{1}{4}\)) of items in the set provided as input. If one defines \(\theta (X) = X\), then all ensemble constituents will always consider all targets, which is what a regular ensemble of PCTs does. This function is a parameter of our overall ensemble learning algorithm and we investigate its influence in Sect. 5.
3.3.2 Building the ensembles
Ensemble building parameters
Ensemble method  \(\delta _c(D)\)  \(\gamma (X)\)  BestTest 

Bag of PCTs  D  bootstrap(X)  Algorithm 2 
RF of PCTs  \(\theta \Big (D,\frac{1}{\sqrt{D}}\Big )\)  bootstrap(X)  Algorithm 2 
ExtraPCTs  D  X  Algorithm 3 
Bagging (Breiman 1996) is short for bootstrap aggregating. It is an ensemble method that uses bootstrap replication of the training data to introduce randomization in the learning dataset. Such perturbations of the learning set have proven useful for unstable base models, such as decision trees, but can generally be used by any model type. A bootstrap replicate \(S^*\) of a dataset S, is again a dataset, that has been randomly sampled from S. Sampling with replacement is repeated until both datasets are of equal size (i.e., \(S = S^*\)).
Random forests (Breiman 2001) work in a similar fashion to bagging. This ensemble method also starts with bootstrap replicates, that introduce randomization in the instance space. However, it additionally introduces randomization in the predictive attribute space by randomizing the algorithm for the base predictive models. The \(\delta _c\) parameter of the PCT induction algorithm (see Algorithm 1) is instantiated as shown in Table 1. This causes the random forest ensemble method to only consider a subset of randomly selected predictive attributes from the set D of all predictive attributes, while searching for the best split for a node. This process of random selection of predictive attributes is then repeated afresh at each node, yielding different subsets of predictive attributes. The function \(\delta _c(D)\) can be defined to return any number of items from the set D between 1 and D, but the recommended setting by Breiman (2001) is \(\sqrt{D}\), which is what we use.
Extremely randomized trees (Geurts et al. 2006) are very unstable decision trees. It therefore only makes sense to use them in the ensemble setting. It has two distinctive properties with respect to the other two methods: (i) the dataset is not perturbed by applying bootstrapping and (ii) the BestSplit method used by extraPCTs is shown in Algorithm 3. Extra trees select at random k predictive attributes and for each of them, randomly select a split point.^{1} Each split is then evaluated and the one with the best heuristic value \(h^*\) is selected. Algorithm 4 shows how the random split points are determined. The recommended number (Geurts et al. 2006) of predictive attributes to be considered at every split is D, which is reflected in our ensemble initialization for extra trees (see Table 1).
3.3.3 Making predictions
An ensemble makes predictions by combining the predictions of its base models. Each base predictive model gives its predictions to the aggregation function, which takes all the votes and decides on the final prediction of the ensemble. In general, the aggregation function is a parameter and there are many ways to combine the votes of the base predictive models: averaging the predictions, majority vote, introducing weights for individual models, introducing preferences based on domain knowledge, and so on. In this paper, we propose two different aggregation functions used in conjunction with the proposed method: (i) total averaging and (ii) subspace averaging.
Total averaging takes all the predictions of the base models and averages them. Each base model gives predictions for all of the T targets. We calculate the average as the arithmetic mean: Final predictions for the ith target attribute (\(\widehat{y}_i\)) are computed as \(\widehat{y}_i=\frac{1}{b}\sum _{j=1}^{b}y_{i}^{j}\), where \(y_{i}^{j}\) represents the prediction of jth base model for the ith target attribute.
3.4 Computational complexity analysis
From the work of Kocev et al. (2013) and the assumption that the decision tree is balanced and bushy (Witten and Frank 2005), it follows that the computational complexity of learning a single multitarget PCT is \(\mathcal {O}(dNlog^2N) + \mathcal {O}(dtNlogN) + \mathcal {O}(NlogN)\), where N is the number of instances, d is the number of predictive attributes and t is the number of target attributes in the dataset. Similarly, from the work of Kocev and Ceci (2015) and with the same assumption, it follows that the computational complexity of learning a single multitarget extraPCT is \(\mathcal {O}(ktNlogN) + \mathcal {O}(NlogN)\), where k is the number of randomly sampled predictive attributes at each split. In general, learning an ensemble of b base models has the complexity of learning all of its constituents. In our case, that amounts to \(b(\mathcal {O}(dNlog^2N) + \mathcal {O}(dtNlogN) + \mathcal {O}(NlogN))\) for bagging and random forests of PCTs and \(b(\mathcal {O}(ktNlogN) + \mathcal {O}(NlogN))\) for random forests of extraPCTs.
The computational complexity also depends on the use bootstrapping and the amount of predictive and/or target attributes considered for each base model. Computational cost of bootstrapping is \(\mathcal {O}(N)\) and the number of instances considered in that case equals \(N'=0.632 \cdot N\) (Breiman 1996). Bootstrapping is not used for learning extraPCTs.
Taking into account the fact that random forests also sample the input space (through the sampling function \(\delta _c(D)\)), the number of predictive variables actually considered by the base models is \(d'=c\) (see the definition of \(\delta _c\) in Sect. 3.2). The sampling of predictive variables happens at every node split, so the complexity of data subsampling is \(\mathcal {O}(d'logN')\).
Ensembles usually contain many base models which results in longer times to make a prediction. Therefore, we also address the complexity of making predictions. Under the previously mentioned assumption about decision trees being balanced and bushy, the average depth of a decision tree is actually the average length of the path that has to be traversed by an instance in order to get to the prediction. The complexity of making a prediction with a singletarget decision tree is therefore \(\mathcal {O}(log(N))\). In a global MTR scenario, all target variables are predicted simultaneously with the same complexity as that of making a prediction with a singletarget tree. When we switch to the ensemble setting, the complexity increases linearly with the number of base models in the ensemble: \(b\cdot \mathcal {O}(log(N))\). If we are approaching the problem of MTR locally, each target is predicted with its own ensemble and that additionally increases the complexity in proportion to the number of target variables: \(bT\cdot \mathcal {O}(log(N))\).
4 Experimental design
To evaluate the performance of the ROS ensembles for MTR, we performed extensive experiments on benchmark datasets. This section presents: (i) the experimental questions addressed, (ii) the evaluation measures used, (iii) the benchmark datasets and (iv) the experimental setup (including the parameter instantiations for the methods used in the experiments).
4.1 Experimental questions
In our experiments, we construct PCT ensembles for MTR by using the described ensemble extension method ROS. In order to better understand the effects of ROS, we investigate the resulting ensemble models across three dimensions.
First, we are interested in the convergence of their predictive performance as we increase the number of PCTs in the ensemble. We want to establish the number of base models needed in an ensemble to reach the point of performance saturation. We consider an ensemble saturated, when adding additional base models to it would not bring statistically significant improvement in terms of predictive power.
Next, we are interested, whether the proposed extension can improve the predictive performance over the performance of the original ensembles. Learning on subsets of targets could exploit additional structural relations that may be overlooked by the original ensemble approaches.
Finally, as we have theoretically derived in Sect. 3.4, we expect that the dimensionality reduction of the output space will yield improvements in terms of computational efficiency. Specifically, we are interested in the running times of the ROS ensemble approaches and the sizes of the resulting models.
 1.
How many base models do we need in ROS ensembles in order to reach the point of performance saturation?
 2.
What is the best value for the portion of target space to be used within such ensembles? Is this portion equal for all evaluated ensemble methods?
 3.
Does it make sense to change the default aggregation function of the ensemble that uses the prediction for all targets? Can this improve predictive performance?
 4.
Considering predictive performance, how do ROS ensemble methods compare to the original ensemble methods?
 5.
Is ROS helpful in terms of time efficiency?
 6.
Do ROS models use less memory than the models trained with the original ensemble methods?
 7.
How ROS models compare to other output transformation methods?
4.2 Evaluation measures
In order to understand the effects that ROS has on the learning process, we first need to evaluate the models induced by the ROS ensemble approaches. In machine learning, empirical evaluation is most commonly used to achieve this goal, that assesses the performance of a given model in terms of evaluation measures. Below we describe the measures that we use for assessing predictive power, time and space complexity.
4.3 Data description
Properties of the considered MTR datasets with multiple continuous targets: number of examples (N), number of predictive attributes (discrete/continuous, d / c), and number of target attributes (t)
No.  Name of dataset  N  d / c  t 

1  ForestryKras (Džeroski et al. 2006)  60,607  0/160  11 
2  Vegetation clustering (Gjorgjioski et al. 2008)  29,679  0/65  11 
3  Vegetation condition (Kocev et al. 2009)  16,967  1/39  7 
4  106  0/16  14  
5  ATP 1D (SpyromitrosXioufis et al. 2016)  337  0/441  6 
6  ATP 7D (SpyromitrosXioufis et al. 2016)  296  0/441  6 
7  RF1 (SpyromitrosXioufis et al. 2016)  9125  0/64  8 
8  RF2 (SpyromitrosXioufis et al. 2016)  9125  0/576  8 
9  639  0/401  12  
10  SCM 1D (SpyromitrosXioufis et al. 2016)  9893  0/280  16 
11  SCM 20D (SpyromitrosXioufis et al. 2016)  8966  0/61  16 
12  OES 10 (SpyromitrosXioufis et al. 2016)  403  0/298  16 
13  OES 97 (SpyromitrosXioufis et al. 2016)  334  0/263  16 
14  Soil resilience (Debeljak et al. 2009)  26  1/7  8 
15  Prespa diatoms lake top 10 (Kocev et al. 2010)  218  0/16  10 
16  PPMI (Marek et al. 2011)  713  0/148  35 
17  ADNI (Gamberger et al. 2016)  659  0/232  14 
4.4 Experimental setup
We designed the experimental setup according to the experimental questions posed in Sect. 4.1. First, we describe all parameter settings of the ROS ensemble methods. We then outline the procedures for statistical analysis of the results.
We consider three types of ensembles: bagging and random forests of PCTs and extraPCTs. In order for Algorithm 7 to simulate these three ensemble methods, we set its parameters to the values given in Table 1. Following the recommendations from Bauer and Kohavi (1999), the trees in the ensembles are unpruned. Our experimental study considers different ensemble sizes, i.e., different numbers of base models (PCTs) in the ensemble, in order to investigate the saturation of ensembles and to select the saturation point.
First, we construct ensembles without ROS (Bag, RF, ET) that use the full output space for learning the base predictive models. This means that the list G contains b sets, where each set contains all the attributes from the set T (target attributes), i.e., \(G = \left\{ T,T,\ldots ,T\right\} \), where \(G = b\).
The second part of our experiments is concerned with the proposed extension—Random Output Selections. We start with the parametrization of the GenSubspaces function (Algorithm 5), which takes as input b, T and the sampling function \(\theta (X,v)\) (see Algorithm 6). We consider four values for v in the allowed range (0.0, 1.0), namely \(\frac{1}{\sqrt{T}}, \frac{1}{4},\frac{1}{2}, \frac{3}{4}\). Additionally, we use two ensemble predictions aggregation functions: total averaging and subspace averaging. Table 3 summarizes the parameter values considered in our experiments.
Parameter values used to build ensembles with ROS
Location of use  Parameter  Used values 

\(BuildEnsemble(S,\gamma ,\delta _c,G)\)  \(\gamma , \delta _c\)  See Table 1 
\(GenSubspaces(b,T,\theta )\)  b  10, 25, 50, 75 
100, 150, 250  
\(\theta (X,v)\)  v  \(\frac{1}{4},\frac{1}{2}, \frac{3}{4}, \frac{1}{\sqrt{T}}\) 
Making predictions  Averaging function  Total, subspace 
We estimate the predictive performance of the considered methods by using 10fold crossvalidation. All methods use the same folds. For statistical evaluation of the obtained results, we follow the recommendations from Demšar (2006). The Friedman test (Friedman 1940), with the correction by Iman and Davenport (1980), is used to determine statistical significance. In order to detect statistically significant differences, we calculate the critical distances (CD) by applying the Nemenyi (1963) or Dunn (1961) posthoc statistical tests. Both posthoc tests compute critical distance between the ranks of considered algorithms. The difference is that Nemenyi posthoc test compares the relative performance of all considered methods (all vs. all), whereas Bonferroni–Dunn posthoc test compares the performance of a single method to other methods (one vs. all). The results of these tests are presented with average rank diagrams (Demšar 2006), where methods connected with a line have results that are not statistically significantly different. All statistical tests were conducted at the significance level \(\alpha = 0.05\). Statistical tests have been calculated for two variants of the results: per dataset (using aRRMSE value for each dataset) and per target (using the RRMSE values for all targets of all datasets). We used the Bonferroni–Dunn (CD is shown as a dotted blue line) posthoc test to present results in Sect. 5.4 and Nemenyi posthoc test otherwise (CD is shown as a solid red line).
The experiments were executed on a heterogeneous computing infrastructure, i.e., the SLING grid, which can affect timesensitive evaluations. To avoid having incomparable measurements of running times, we run timesensitive experiments separately by using a single computational node.
5 Results and discussion
Here we present the results of our comprehensive experimental study. Considering a large number of datasets (17) and several ensemble methods, we present the results in terms of predictive performance (aRRMSE, overfitting score—OS), time complexity (learning and prediction time) and space complexity (model size). In the presentation of time complexity results, we focus on two datasets Forestry Kras and OES 10, that have relatively large output spaces (11 and 16 targets, respectively). The selected datasets also differ in the number of examples: Forestry Kras has many whereas OES 10 has few. For reference, all other results are available in “Appendix”.
The presentation and discussion of the results follows the experimental questions from Sect. 4.1. First, we examine the convergence of original and ROS ensembles. Next, we focus on selecting the output space size. We experiment with four different output space sizes (see Table 3), that have consequently been used. This parameter is crucial because it introduces additional point of randomization in all three considered ensemble methods. In that sense, ROS can also be seen as a localization process: the constructed base models are tailored to a specific output subspace. We recommend values for this parameter for each ensemble learning method. Furthermore, we show the effects of changing the aggregation functions in our ensembles. Finally, we use the recommended parameters for ROS and provide an overall evaluation, by comparing the extended ensembles to the original ones. We compare the ROS methods to the baseline methods in terms of predictive power, running times and model sizes.
5.1 Ensemble convergence
We next investigate the saturation of the ROS ensembles. A subset of the results for ensembles with 50, 100, 150 and 250 models are reported in Fig. 2 and illustrate the saturation of ROS ensembles for all three considered ensemble methods. Lines on the plots represent different output space sizes. Values in brackets indicate the value for the v parameter. Left and right sides of the plots depict voting with total averaging and subspace averaging respectively. The y axis shows aRRMSE values averaged over all considered datasets. The results show that BagROS and ETROS ensembles saturate between 50 and 100 base models while RFROS ensembles saturate a bit later, between 75 and 100 base models. Figure 2 suggests that ROS ensembles saturate at a large number of base models when subspace averaging is used to aggregate the predictions of the base models. Performance in terms of aRRMSE and overfitting scores of all discussed ensembles with 100 trees (multitarget and singletarget variants) and single multitarget regression trees is presented in “Appendix B”.
5.2 ROS parameter selection
This section describes the selection of the best performing output subspace size and aggregation function. The considered ensemble methods introduce different randomizations in their learning processes, so we cannot assume that ROS has the same influence on all three types of ensemble methods. Figure 2 also suggests that the choice of the aggregation function has a direct effect on performance. We therefore analyze the effect of output subspace size and aggregation function for each ensemble type separately.
We selected candidate values for the ROS parameters based on the curves given in Fig. 2. With both aggregation functions, candidate parameter values for subspace sizes are \(v=\frac{1}{2}\) for BagROS and \(v=\frac{3}{4}\) for RFROS and ETROS. We selected these values because they exhibit the lowest aRRMSE averaged over all datasets used in this study. The averaged saturation curves in Fig. 2 sometimes intertwine and make it difficult to make this decision. In those cases, we selected the parameter values based on the averaged performance of ensembles with 100 trees. Next, we performed a simple analysis by comparing the wins of the two considered aggregation functions using the candidate output space size. For BagROS, it turned out that total averaging had most wins, whereas subspace averaging was dominant for RFROS and ETROS. Our final parameter recommendation is therefore to use total averaging with \(v=\frac{1}{2}\) for BagROS and subspace averaging with \(v=\frac{3}{4}\) for RFROS and ETROS.
5.3 Predictive performance and computational efficiency
Figure 3 depicts two average rank diagrams: one per dataset and one per target. The per dataset diagram is based on aRRMSE value, one per dataset. Both analyses show that ensembles statistically significantly outperform individual multitarget PCTs, i.e., multitarget PCTs perform significantly worse than the ensemble methods. The per dataset analysis shows no statistically significant difference in terms of predictive performance among the other methods. We can however note that BagROS and ETROS outperform their original counterparts and ensembles of singletarget PCTs. RFROS performs on par with the original bagging and random forest ensembles, but worse than the other ROS ensembles (BagROS and ETROS). The best performing of all methods is ETROS.
The per target analysis detects two statistically significant differences in performance. First, with the exception of ETST, ETROS outperforms all other methods with statistical significance. Second, BagROS outperforms RFROS, which performs worst of all ensemble methods. All original ensembles (Bag, RF and ET) show no statistically significant differences in performance. All in all, looking at the big picture, ROS ensembles generally perform better than their original counterparts, with the exception of random forests.
Performance of ensembles and single trees on two datasets (Forestry Kras and OES 10) measured in terms of aRRMSE, overfitting score, average learning times, average perinstance prediction time and model complexity (total number of nodes)
Dataset  Method  aRRMSE  OS  \(\overline{LT}\) (s)  \(\overline{PT}\) (\(\mu s\))  Complexity 

Forestry Kras  Bag  0.55  0.476  394  272  \(3.22 \cdot 10^6\) 
BagROS  0.548  0.471  267  274  \(3.20 \cdot 10^6\)  
BagST  0.551  0.494  34,500  3960  \(34.61\cdot 10^6\)  
RF  0.545  0.44  34.15  259  \(3.16 \cdot 10^6\)  
RFROS  0.546  0.44  36.87  273  \(3.15 \cdot 10^6\)  
RFST  0.546  0.457  2250  2300  \(17.13 \cdot 10^6\)  
ET  0.557  0.575  450  264  \(3.67 \cdot 10^6\)  
ETROS  0.557  0.579  281  274  \(3.67 \cdot 10^6\)  
ETST  0.56  0.611  73,780  3450  \(39.96 \cdot 10^6\)  
Multitarget PCT  0.61  0.169  76.68  2.39  \(2.59 \cdot 10^3\)  
OES 10  Bag  0.531  1.114  19  157  \(3.15 \cdot 10^4\) 
BagROS  0.527  1.052  17  159  \(3.14 \cdot 10^4\)  
BagST  0.487  1.556  412  3960  \(21.5 \cdot 10^4\)  
RF  0.517  1.093  1.59  189  \(3.15 \cdot 10^4\)  
RFROS  0.518  1.132  1.49  208  \(3.16 \cdot 10^4\)  
RFST  0.492  1.525  69  4760  \(43.51 \cdot 10^4\)  
ET  0.514  0.986  18.27  180  \(3.48 \cdot 10^4\)  
ETROS  0.496  1.156  18.57  201  \(3.50 \cdot 10^4\)  
ETST  0.467  2.654  480  4410  \(51.29 \cdot 10^4\)  
Multitarget PCT  0.616  0.704  0.80  2.81  \(0.45 \cdot 10^3\) 
For the Forestry Kras dataset, the proposed ROS methods do not have a notable effect on the predictive performance (aRMMSE) of the three ensemble methods. Similar findings are observed when calculating the overfitting score (OS): the ROS ensembles overfit the training data to the same extent as their original counterparts. Next, multitarget PCTs and ensembles of singletarget PCTs have the worst predictive performance. The difference in predictive performance between ensembles of singletarget PCTs and other ensembles is minimal. However, notable differences exist in terms of time needed for learning the model and making predictions. Namely, the multitarget ensembles have significantly lower learning and prediction times than the singletarget ensembles. The ROS ensembles train the ensembles faster (for Bagging and Extra trees) but still in the same order of magnitude as the original methods. Not surprisingly, single multitarget PCTs have the shortest learning times at the cost of lowest predictive performance. Similar findings are observed when considering model complexity (measured as total number of nodes in all of the trees in an ensemble or in a multitarget PCT). Average prediction times per instance do not differ across the different approaches. This is expected since all base models are trees and no additional computation overhead is needed to calculate the predictions. Ensembles of singletarget PCTs always have an order of magnitude higher learning and prediction times, as well as model complexity as a separate ensemble is learned for predicting each target.
For the OES 10 dataset, improvements in predictive performance are present. The proposed ROS ensembles outperform their original counterparts. Furthermore, the original ensembles were outperformed by the ensembles of singletarget PCTs. The predictive performance gain with ETROS w.r.t. ET is substantial. This is an interesting observation and suggests that ROS could lift predictive performance on smaller datasets with larger output spaces, especially for heavily randomized methods such as extra trees. One possible explanation is that the sampling of input variables in ET, coupled with the small number of examples in the dataset and absence of bootstrapping, introduces a relatively high level of noise in the learning process. The ROS ensemble then actually reduces the effect of this noise at the level of individual base models by specializing them for a smaller output space. This can also explain the small gains for bagging and random forests with the ROS extension on this dataset, because the bootstrapping actually negatively impacts the overall ensemble performance. By inspecting the overfitting score, we note that ROS ensembles consistently exhibit a decreased score w.r.t. ensembles of singletarget PCTs and perform comparably w.r.t. ensembles of multitarget PCTs. Learning and prediction times, as well as model complexity, follow similar patterns as for the Forestry Kras dataset.
5.4 Comparison with other output space transformation methods
In order to put ROS in the broader context of MTR methods with output space transformations, we compare the predictive performance of ROS ensembles and ensembles built with the competing methods proposed in Joly et al. (2014) and Tsoumakas et al. (2014). We have selected these specific methods because they all specialize individual models in the ensemble to a subset of target variables.
Joly et al. (2014) propose ensembles of multioutput regression trees, where each individual tree is built by using a projected output space. Gaussian, Rademacher, Hadamard and Achlioptas projections are used. The goal is to truncate the output space in order to reduce the number of calculations needed to find the best split, which is the main computational burden while building a decision tree. While learning the ensemble, each tree is given a different output space projection. They use two different ensemble methods: Random forests and Extra trees. We dubbed their method Random projections and its two variants as RPRF and RPET. Note that Random projections can not handle nominal attributes and missing values. Hence, the nominal attributes have been converted to numeric scales and missing values have been imputed with the arithmetic mean of that feature.
Tsoumakas et al. (2014) propose an ensemble method called Random Linear Target Combinations for MTR (RLC). They construct new target variables via random linear combinations of existing ones. The data must be normalized in order for the linear combinations to make sense, i.e., to prevent targets on larger scales dominating over the ones with lower scales and thus deteriorating the learning process. The output space is transformed in such a way that each linear combination consists of k original output features. Each combination is then considered for learning one ensemble member. The transformation of the output space matrix \(\varvec{Y}\) (\(m \times q\)) is achieved via a coefficient matrix \(\varvec{C}\) of size \(q \times r\) filled with random values uniformly chosen from [0, 1]. Columns of the matrix \(\varvec{C}\) represent coefficients of the linear combination of the target variables. By multiplying the two matrices, we get the transformed output space \(\varvec{Y'}=\varvec{Y}\varvec{C}\) (\(m \times r\)) that is then used for training. A userselected regression algorithm can then be applied on the transformed data.
5.5 Summary of the results
 1.
How many base models do we need in ROS ensembles in order to reach the point of performance saturation?
The saturation point of the original PCT ensembles is between 50 and 75 base models. BagROS and ETROS ensembles saturate between 50 and 100 base models. Especially RFROS ensembles saturate a bit later, at 75 to 100 base models learned. In the comparative analysis of performance, we consider ensembles with 100 base models (in order to make the comparison fair for all considered methods).
 2.
What is the best value for the portion of target space to be used within such ensembles? Is this portion equal for all evaluated ensemble methods?
The most appropriate size of the portion of target space to be used varies with the ensemble method. The results suggest to use \(v=\frac{1}{2}\) for BagROS and \(v = \frac{3}{4}\) for RFROS and ETROS.
 3.
Does it make sense to change the default aggregation function of the ensemble that uses the prediction for all targets? Can this improve predictive performance?
Changing the aggregation function changes the behaviour of the ROS ensembles. For BagROS, it can even decrease the predictive performance, so we recommend using the standard aggregation function, i.e., total averaging. For RFROS and ETROS we recommend making predictions with subspace averaging.
 4.
Considering predictive performance, how do ROS ensemble methods compare to the original ensemble methods?
Using ROS can improve the predictive performance of PCT ensembles. This is especially notable when using ETROS with small datasets with larger output spaces.
 5.
Is ROS helpful in terms of time efficiency?
The observed learning times for ROS methods can be substantially lower than the ones of their original counterparts. This especially holds for large datasets. Prediction times, however, do not change.
 6.
Do ROS models use less memory than the models trained with the original ensemble methods?
Ensemble models obtained with ROS have sizes comparable to the ensemble models produced by the original ensemble models.
 7.
How ROS models compare to other output transformation methods?
ETROS ensembles generally perform better than ensembles of other competing output transformation methods.
6 Conclusions
This work has addressed the task of learning predictive models that can predict the values of multiple continuous variables for a given input tuple, referred to as multitarget regression (MTR): MTR is a task of predicting structured outputs, i.e., structured output prediction. There are two general approaches to solving tasks of such nature. The first, local approach, learns separate models for every component of the predicted structure, whereas the second, global approach, learns one model capable of predicting all components of the structure simultaneously.
We have proposed novel ensemble methods for MTR. An ensemble is a set of predictive models whose predictions are combined and yield the model output. The proposed methods build further on of well known methods for learning ensembles that have been extended to structured outputs. The base models we have considered are predictive clustering trees (PCTs) for MTR. The methods we have proposed are based on the ensemble extension method, i.e., ROS—Random Output Selections. For each ensemble constituent (PCT), the proposed extension randomly selects targets that are considered while learning that particular base model. We perform an extensive experimental evaluation of three ensemble methods extended with ROS, i.e., bagging, random forests and extraPCTs. The performance has been evaluated on 17 benchmark datasets of varying sizes in terms of number of examples, number of predictive attributes and number of target attributes.
The results show that the proposed extension has a favorable effect, yielding lower error rates and shorter running times. ROS coupled with bagging and extra trees can outperform the original ensemble methods. Random forests do not benefit from ROS in terms of predictive power, but do benefit in terms of shorter learning time. ETROS (extra trees with ROS) statistically significantly outperform all original ensemble methods and their ROS (when analyzing predictive performance on a per target basis). We also conducted experiments with three competing methods showing that the proposed method yields the best performance.
We have also provided a computational complexity analysis for the proposed ensemble extension. Our experiments confirm the results of the theoretical analysis. Ensembles with ROS can yield better predictive performance, as well as reduce learning times, whereas the sizes of the induced models do not change notably.
We plan future work along several possible directions. To begin with, the aggregation function has an effect on the ensemble predictive performance, as the present work demonstrates. We plan to design a new aggregation function by combining total averaging and subspace averaging, hoping to achieve better performance and better understanding of the effect of subspace averaging. Additionally, we can use outofbag errors to derive aggregation weights, such that ensemble constituents with higher error rates would make a smaller contribution to the final prediction. Furthermore, we could perform biasvariance decomposition of the error of all the investigated methods and investigate the sources of errors.
Following an alternative direction, the process of generating target subspaces could also be adapted. The current approach generates target subspaces at random, which is not necessarily the best approach. The relations between target variables could be exploited in order to generate a smaller set of more sensible subspaces.
The final direction we intend to follow is the adaptation of the proposed approaches to other structured output prediction tasks, such as multitarget classification, (hierarchical) multilabel classification, and timeseries prediction. For all of these tasks, global random forests are already being used to obtain feature rankings (in the context of predicting structured outputs). ROS could improve this approach/ranking, by considering subsets of the set of target attributes in the process of producing ranks.
Footnotes
 1.
This is in contrast to the normal procedure of considering multiple/all possible split points for each attribute, before selecting the best one.
Notes
Acknowledgements
We acknowledge the financial support of the Slovenian Research Agency via the grants P20103 and a young researcher grant to MB, as well as the European Commission, through the grants MAESTRA (Learning from Massive, Incompletely annotated, and Structured Data) and HBP (The Human Brain Project), SGA1 and SGA2. SD also acknowledges support by Slovenian Research Agency (via grants J47362, L27509, and N20056), the European Commission (project LANDMARK) and ARVALIS (project BIODIV). The computational experiments presented here were executed on a computing infrastructure from the Slovenian Grid (SLING) initiative.
References
 Abraham, Z., Tan, P. N., Winkler, J., Zhong, S., Liszewska, M., et al. (2013). Position preserving multioutput prediction. In Joint European conference on machine learning and knowledge discovery in databases (pp. 320–335), Springer.Google Scholar
 Aho, T., Ženko, B., Džeroski, S., & Elomaa, T. (2012). Multitarget regression with rule ensembles. Journal of Machine Learning Research, 13, 2367–2407.MathSciNetzbMATHGoogle Scholar
 Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. (2012). Kernels for vectorvalued functions: A review. Foundations and Trends$\textregistered $ in Machine Learning, 4(3), 195–266.Google Scholar
 Appice, A., & Džeroski, S. (2007). Stepwise induction of multitarget model trees. In Machine Learning: ECML 2007, LNCS (Vol. 4701, pp. 502–509). Springer.Google Scholar
 Appice, A., & Malerba, D. (2014). Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering. Data Mining and Knowledge Discovery, 28(5–6), 1266–1313.MathSciNetCrossRefGoogle Scholar
 Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1), 105–139.CrossRefGoogle Scholar
 Blockeel, H. (1998). Topdown induction of first order logical decision trees. Ph.D. thesis, Katholieke Universiteit Leuven, Leuven, Belgium.Google Scholar
 Blockeel, H., Džeroski, S., & Grbović, J. (1999). Simultaneous prediction of multiple chemical parameters of river water quality with TILDE. In Proceedings of the 3rd European conference on PKDD—LNAI (Vol. 1704, pp. 32–40). Springer.Google Scholar
 Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the 15th international conference on machine learning (pp. 55–63), Morgan Kaufmann.Google Scholar
 Blockeel, H., & Struyf, J. (2002). Efficient algorithms for decision tree crossvalidation. Journal of Machine Learning Research, 3, 621–650.zbMATHGoogle Scholar
 Borchani, H., Varando, G., Bielza, C., & Larrañaga, P. (2015). A survey on multioutput regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5), 216–233.Google Scholar
 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MathSciNetzbMATHGoogle Scholar
 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefzbMATHGoogle Scholar
 Breiman, L., & Friedman, J. (1997). Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1), 3–54.MathSciNetCrossRefzbMATHGoogle Scholar
 Debeljak, M., Kocev, D., Towers, W., Jones, M., Griffiths, B., & Hallett, P. (2009). Potential of multiobjective models for riskbased mapping of the resilience characteristics of soils: Demonstration at a national level. Soil Use and Management, 25(1), 66–77.CrossRefGoogle Scholar
 Deger, F., Mansouri, A., Pedersen, M., Hardeberg, J. Y., & Voisin, Y. (2012). Multiand singleoutput support vector regression for spectral reflectance recovery. In 2012 eighth international conference on signal image technology and internet based systems (SITIS) (pp. 805–810). IEEE.Google Scholar
 Demšar, D., Džeroski, S., Larsen, T., Struyf, J., Axelsen, J., BrunsPedersen, M., et al. (2006). Using multiobjective classification to model communities of soil. Ecological Modelling, 191(1), 131–143.CrossRefGoogle Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetzbMATHGoogle Scholar
 Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52–64.MathSciNetCrossRefzbMATHGoogle Scholar
 Džeroski, S., Demšar, D., & Grbović, J. (2000). Predicting chemical parameters of river water quality from bioindicator data. Applied Intelligence, 13(1), 7–17.CrossRefGoogle Scholar
 Džeroski, S., Kobler, A., Gjorgjioski, V., & Panov, P. (2006). Using decision trees to predict forest stand height and canopy cover from LANSAT and LIDAR data. In Managing environmental knowledge: EnviroInfo 2006: Proceedings of the 20th international conference on informatics for environmental protection (pp. 125–133). Aachen: Shaker Verlag.Google Scholar
 Džeroski, S. (2007). Towards a general framework for data mining (pp. 259–300). Berlin: Springer. https://doi.org/10.1007/9783540755494_16.Google Scholar
 Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92.MathSciNetCrossRefzbMATHGoogle Scholar
 Gamberger, D., Ženko, B., Mitelpunkt, A., Shachar, N., & Lavrač, N. (2016). Clusters of male and female alzheimers disease patients in the Alzheimers disease neuroimaging initiative (ADNI) database. Brain Informatics, 3(3), 169–179.CrossRefGoogle Scholar
 Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.CrossRefzbMATHGoogle Scholar
 Gjorgjioski, V., Džeroski, S., & White, M. (2008). Clustering analysis of vegetation data. Technical report 10065, Jožef Stefan Institute.Google Scholar
 Han, Z., Liu, Y., Zhao, J., & Wang, W. (2012). Real time prediction for converter gas tank levels based on multioutput least square support vector regressor. Control Engineering Practice, 20(12), 1400–1409.CrossRefGoogle Scholar
 Ikonomovska, E., Gama, J., & Džeroski, S. (2011). Incremental multitarget model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing (pp. 988–993). ACM.Google Scholar
 Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics: Theory and Methods, 9(6), 571–595.CrossRefzbMATHGoogle Scholar
 Izenman, A. J. (1975). Reducedrank regression for the multivariate linear model. Journal of multivariate analysis, 5(2), 248–264.MathSciNetCrossRefzbMATHGoogle Scholar
 Jančič, S., Frisvad, J. C., Kocev, D., Gostinčar, C., Džeroski, S., & GundeCimerman, N. (2016). Production of secondary metabolites in extreme environments: Food and airborne Wallemia spp. produce toxic metabolites at hypersaline conditions. PLoS ONE, 11(12), e0169116.CrossRefGoogle Scholar
 Joly, A. (2017). Exploiting random projections and sparsity with random forests and gradient boosting methods—Application to multilabel and multioutput learning, random forest model compression and leveraging input sparsity. arXiv preprint arXiv:1704.08067 Google Scholar
 Joly, A., Geurts, P., Wehenkel, L. (2014). Random forests with random projections of the output space for high dimensional multilabel classification. In Joint European conference on machine learning and knowledge discovery in databases (pp. 607–622). Springer.Google Scholar
 Kaggle. (2008). Kaggle competition: Online product sales. https://www.kaggle.com/c/onlinesales/data. Accessed July 19, 2017.
 Kocev, D. (2011). Ensembles for predicting structured outputs. Ph.D. thesis, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.Google Scholar
 Kocev, D., & Ceci, M. (2015). Ensembles of extremely randomized trees for multitarget regression. In Discovery science: 18th international conference (DS 2015), LNCS, (Vol. 9356, pp. 86–100).Google Scholar
 Kocev, D., Džeroski, S., White, M., Newell, G., & Griffioen, P. (2009). Using single and multitarget regression trees and ensembles to model a compound index of vegetation condition. Ecological Modelling, 220(8), 1159–1168.CrossRefGoogle Scholar
 Kocev, D., Naumoski, A., Mitreski, K., Krstić, S., & Džeroski, S. (2010). Learning habitat models for the diatom community in Lake Prespa. Ecological Modelling, 221(2), 330–337.CrossRefGoogle Scholar
 Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2007). Ensembles of multiobjective decision trees. In ECML ’07: Proceedings of the 18th European conference on machine learning—LNCS (Vol. 4701, pp. 624–631). Springer.Google Scholar
 Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.CrossRefGoogle Scholar
 Kriegel, H. P., Borgwardt, K., Kröger, P., Pryakhin, A., Schubert, M., & Zimek, A. (2007). Future trends in data mining. Data Mining and Knowledge Discovery, 15, 87–97.MathSciNetCrossRefGoogle Scholar
 Levatić, J., Ceci, M., Kocev, D., & Džeroski, S. (2014). Semisupervised learning for multitarget regression. In International workshop on new frontiers in mining complex patterns (pp. 3–18). Springer.Google Scholar
 Madjarov, G., Gjorgjevikj, D., Dimitrovski, I., & Džeroski, S. (2016). The use of dataderived label hierarchies in multilabel classification. Journal of Intelligent Information Systems, 47(1), 57–90.CrossRefGoogle Scholar
 Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., et al. (2011). The Parkinson Progression Marker Initiative (PPMI). Progress in Neurobiology, 95(4), 629–635.CrossRefGoogle Scholar
 Micchelli, C. A., & Pontil, M. (2004). Kernels for multitask learning. In Advances in neural information processing systems 17—Proceedings of the 2004 conference (pp. 921–928).Google Scholar
 Nemenyi, P. B. (1963). Distributionfree multiple comparisons. Ph.D. thesis, Princeton University, Princeton, NY, USA.Google Scholar
 Panov, P., Soldatova, L. N., & Džeroski, S. (2016). Generic ontology of datatypes. Information Sciences, 329, 900–920.CrossRefGoogle Scholar
 Slavkov, I., Gjorgjioski, V., Struyf, J., & Džeroski, S. (2010). Finding explained groups of timecourse gene expression profiles with predictive clustering trees. Molecular BioSystems, 6(4), 729–740.CrossRefGoogle Scholar
 SpyromitrosXioufis, E., Tsoumakas, G., Groves, W., & Vlahavas, I. (2016). Multitarget regression via input space expansion: Treating targets as inputs. Machine Learning, 104(1), 55–98.MathSciNetCrossRefzbMATHGoogle Scholar
 Stojanova, D., Ceci, M., Appice, A., & Džeroski, S. (2012). Network regression with predictive clustering trees. In Data mining and knowledge discovery (pp. 1–36).Google Scholar
 Stojanova, D., Panov, P., Gjorgjioski, V., Kobler, A., & Džeroski, S. (2010). Estimating vegetation height and canopy cover from remotely sensed data with machine learning. Ecological Informatics, 5(4), 256–266.CrossRefGoogle Scholar
 Struyf, J., & Džeroski, S. (2006). Constraint based induction of multiobjective regression trees. In Proceedings of the 4th international workshop on knowledge discovery in inductive databases KDID—LNCS (Vol. 3933, pp. 222–233). Springer.Google Scholar
 Szymański, P., Kajdanowicz, T., & Kersting, K. (2016). How is a datadriven approach better than random choice in label space division for multilabel classification? Entropy, 18(8), 282.CrossRefGoogle Scholar
 Tsoumakas, G., SpyromitrosXioufis, E., Vrekou, A., & Vlahavas, I. (2014). Multitarget regression via random linear target combinations. In Machine learning and knowledge discovery in databases: ECMLPKDD 2014, LNCS (Vol. 8726, pp. 225–240).Google Scholar
 Tsoumakas, G., & Vlahavas, I. (2007). Random klabelsets: An ensemble method for multilabel classification. In Proceedings of the 18th European conference on machine learning (pp. 406–417).Google Scholar
 Vens, C., Struyf, J., Schietgat, L., Džeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multilabel classification. Machine Learning, 73(2), 185–214.CrossRefGoogle Scholar
 Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Los Altos: Morgan Kaufmann.zbMATHGoogle Scholar
 Xu, S., An, X., Qiao, X., Zhu, L., & Li, L. (2013). Multioutput leastsquares support vector regression machines. Pattern Recognition Letters, 34(9), 1078–1084.CrossRefGoogle Scholar
 Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making, 5(4), 597–604.CrossRefGoogle Scholar
 Ženko, B. (2007). Learning predictive clustering rules. Ph.D. thesis, Faculty of Computer Science, University of Ljubljana, Ljubljana, Slovenia.Google Scholar
 Zhang, W., Liu, X., Ding, Y., & Shi, D. (2012). Multioutput LSSVR machine in extended feature space. In 2012 IEEE international conference on computational intelligence for measurement systems and applications (CIMSA) (pp. 130–134). IEEE.Google Scholar