Data complexity metafeatures for regression problems
 850 Downloads
Abstract
In metalearning, classification problems can be described by a variety of features, including complexity measures. These measures allow capturing the complexity of the frontier that separates the classes. For regression problems, on the other hand, there is a lack of such type of measures. This paper presents and analyses measures devoted to estimate the complexity of the function that should fitted to the data in regression problems. As case studies, they are employed as metafeatures in three metalearning setups: (i) the first one predicts the regression function type of some synthetic datasets; (ii) the second one is designed to tune the parameter values of support vector regressors; and (iii) the third one aims to predict the performance of various regressors for a given dataset. The results show the suitability of the new measures to describe the regression datasets and their utility in the metalearning tasks considered. In cases (ii) and (iii) the achieved results are also similar or better than those obtained by the use of classical metafeatures in metalearning.
Keywords
Metalearning Metafeatures Complexity measures1 Introduction
The design of a learning system is a challenging task due to a variety of important issues that have to be addressed like: algorithm selection, optimization of hyperparameters, choice of data processing techniques, among others. Such issues have been dealt with different approaches ranging from the use of optimization metaheuristics to supervised learning techniques (Thornton et al. 2013; Pappa et al. 2014). Many of these approaches have been referred in the literature as MetaLearning (MTL) (Brazdil et al. 2008). In this paper, we focus on a MTL setup which relates a set of descriptive metafeatures of problems to the performance of one or more algorithms on their solution (or alternatively, hyperparameter configurations, data processing techniques, among others). In this setup, knowledge is extracted from metadatasets built from experiments performed with different algorithms on previous problems. These algorithms can then be selected or ranked for new problems based on their metafeatures. Additionally, MTL approaches have been adopted in the literature as components of more complex hybrid methods in the design of learning systems (Wistuba et al. 2016; Leite et al. 2012; de Miranda et al. 2014; Gomes et al. 2012).
An important aspect in MTL is the definition of the metafeatures used to describe the problem at hand. There is a large selection of metafeatures adopted in the MTL literature to describe learning problems. They can be roughly classified as: (i) simple measures, statistical and informationtheoretic measures, which are extracted from the learning datasets; (ii) landmarking measures, which are descriptors extracted from simple learning models when applied to the dataset; and (iii) modelbased features, such as the size of an induced decision tree (Soares 2008). Alternatively, in the case of classification problems, some previous work have demonstrated the value of employing data complexity measures as metafeatures (Cavalcanti et al. 2012; Leyva et al. 2015; Garcia et al. 2016; MoránFernández et al. 2017). They allow capturing the complexity of the decision boundary that discriminates the classes (Ho and Basu 2002). Examples include measuring the degree of linearity of the problem and the volume of feature overlapping along different classes.
In Cavalcanti et al. (2012), for instance, the authors state that the behavior of nearest neighbor classifiers is highly affected by data complexity. They experimentally show that a combination of complexity measures can be used to predict the expected accuracy of this classification technique. In Leyva et al. (2015), the authors proved the effectiveness of the data complexity measures in recommending algorithms for instance selection. In Garcia et al. (2015), in turn, authors discussed how label noise affects the complexity of classification problems and showed that many of the data complexity measures are able to characterize the presence of noise in a classification dataset. More recently, MoránFernández et al. (2017) discussed how the complexity measures values can be employed to predict the expected classification performance of different classifiers commonly employed in microarray data analysis.
Metafeatures have been also proposed for the characterization of regression problems. Most of them are similar to the metafeatures proposed for classification problems (Kuba et al. 2002). However, until recently no work proposed measures to estimate the complexity of a regression problem. In Maciel et al. (2016), we introduced complexity measures for regression problems. These novel measures cover concepts similar to those from the original work of Ho and Basu (2002), which were transfered for the regression scenario. They quantify aspects as the linearity of the data, the presence of informative features and the smoothness of the target attribute. In Maciel et al. (2016), the proposed measures were also evaluated in a simple case study regarding their ability to distinguish easy from complex problems using simulated data. Several of the measures were able to separate the simpler problems from problems of medium to high complexity.
In this paper we extend our previous work in various aspects. Firstly, we expand the experimental evaluation with synthetic datasets. For this, we use distinct functions to generate datasets of varying complexities, which include polynomials of increasing degrees and trigonometric functions. The range of values of the complexity measures for the different functions are then analyzed and compared. We also use the complexity measures as metafeatures for inducing classifiers able to predict the regression function type of the designed datasets. The results show that many of the measures, as well as their combination, are able to distinguish the different types of problems according to their complexity degree. Nonetheless, as noise is introduced into all datasets, this discrimination power is lost.
We also show the practical use of the measures in real datasets by employing them as metafeatures in two MTL tasks. The first task is to recommend the parameter values for the Support Vector Regression (SVR) (Basak et al. 2007) technique in new regression problems. The second MTL setup is designed to predict the expected Normalized Mean Squared Error (NMSE) of particular regressors for new regression problems. In both cases, our complexity measures achieved results superior or comparable to those of other metafeatures commonly employed in MTL. The achieved results demonstrate the suitability of our complexity measures as generic datasets descriptors in MTL.
Finally, we include a detailed description and efficient implementation of the complexity measures. This includes illustrative examples of the measures operation, algorithms and their asymptotic computational complexities. We have also proposed computationally efficient alternatives for some of the measures. R codes of all complexity measures proposed are made available at https://github.com/aclorena/ComplexityRegression.git.
This paper is structured as follows: Sect. 2 defines MTL and discusses the use of complexity measures as metafeatures. Section 3 presents our complexity measures for regression problems. Section 4 evaluates the ability of these measures in distinguishing the complexity of some synthetic regression problems. Section 5 presents the experimental results achieved in the recommendation of SVR parameter values. Section 6 presents the MTL study to predict the NMSE of various regressors. Section 7 concludes this paper.
2 Metalearning and data complexity
MetaLearning (MTL) aims to relate features of learning problems to the performance of learning algorithms. MTL receives as input a set of metaexamples built from a set of learning problems and a pool of candidate algorithms. Usually each metaexample stores the features describing a problem (the metafeatures) and a target attribute indicating the best candidate algorithm once evaluated for that problem. A metamodel (or metalearner) is then learned to predict the best algorithm for new problems based on their metafeatures. There are different variations of this basic metalearning task, including for instance to rank the candidate algorithms, to predict the performance measure of a single algorithm (a metaregression problem) and to recommend values for the hyperparameters of an algorithm based on the problem features (see Brazdil et al. 2008 for a deeper view of metalearning).
A common issue of previous work in metalearning is the need to define a (good) set of metafeatures. These features have to be informative and discriminative enough to identify different aspects of the learning problems that can bias algorithm performance. Since MTL is problem dependent, researchers have tried to adopt sets of metafeatures that are at same time suitable and informative for each application. In many cases, there is no clear justification for the metafeatures adopted in each work, which makes it difficult defining which features are actually relevant and useful to be adopted in new applications. Possibly, a reason for this is the lack of systematic studies to specific aspects that can be measured in a learning problem and then the corresponding metafeatures that are more informative.
One of these aspects is the difficulty of a learning problem, which can be expressed in terms of complexity measures of a dataset. Ho and Basu (2002) have proposed a set of measures of classification complexity which have been used as metafeatures in various recent work (Cavalcanti et al. 2012; Smith et al. 2014; Leyva et al. 2015; Garcia et al. 2016; MoránFernández et al. 2017). These measures can be roughly divided into three main categories: (i) measures of overlapping of feature values; (ii) measures of the separability of the classes; (iii) and geometry, topology and density measures.
The feature overlapping group contains measures that quantify whether the dataset contains at least one feature that is able to fully separate the examples within the classes. If one of such features is present, the problem can be considered simple according to this perspective. In OrriolsPuig et al. (2010) the collective effect of more features in separating the data is also taken into account.
The measures from group (ii) quantify how the classes are separated. One approach for such is to verify if the classes are linearity separable. Linearly separable problems are considered simpler than problems where nonlinear boundaries are required. Other measures try to estimate the length of the decision boundary and the overlapping of the classes.
Group (iii) contains measures that capture the inner structure of the classes. The idea is that if examples of the same class are densely distributed, the problem can be considered simpler than when the examples from a same class are sparsely distributed or occupy many manifolds.
Most MTL approaches are in principle independent on the baselearning task of interest (e.g., classification or regression). However, most of the experimental studies were based on the classification problem. As a consequence, there is a lack of metafeatures proposed specifically for describing regression problems.
Few previous articles, which employed metalearning in regression problems, adopted metafeatures that resemble complexity measures of the target attribute. However, none of these work explicitly addressed the regression complexity. Examples of such metafeatures are classical statistics for continuous attributes (e.g., coefficient of variation, skewness, kurtosis, etc.), which measure symmetry and dispersion of the target attribute in a dataset (Kuba et al. 2002; Soares et al. 2004; Amasyali and Erson 2009; Gomes et al. 2012; Loterman and Mues 2012). In Soares and Brazdil (2006), the authors proposed the use of Kernel Target Alignment (KTA) (Cristianini et al. 2002) as a specific metafeature to recommend hyperparameters for support vector regressors. KTA measures the degree of agreement between a kernel and the target attribute in a problem. This metafeature was also adopted in Gomes et al. (2012). Metafeatures based on the landmarking approach were adopted in Kuba et al. (2002), Soares et al. (2004) and Amasyali and Erson (2009), in which simple regression algorithms such as linear regression and decision stumps were adopted as landmarkers. As described next, our work contributes with the delineation and formalization of complexity measures as metafeatures for regression problems. The presented set of complexity measures considers different perspectives of data complexity that were not dealt with in any previous work concerning regression problems.
3 Complexity of regression problems
The complexity of a regression problem can be attributed to various factors, some of them similar to those of classification problems. For instance, a regression problem can be complex if it is ill defined and/or described by features that are not informative enough. Data sparsity can also make a problem that is originally complex deceptively simple or viceversa.
The distribution of the target attribute can indicate whether the regression problem is simple or not too. If the outputs of neighbor points are distributed smoothly, usually this means that simpler functions can be fitted to them than if they are spread in the space.
Finally, the complexity of the objective function which relates the inputs to the target attribute is intrinsic to the problem. Our complexity measures try to quantify the former factors.
The previous interpretations are focused on a concept of intrinsic or absolute complexity, but the complexity of a problem can also be regarded as relative. For instance, whilst approximating nonlinear functions can be considered simple for properly tuned multilayer Neural Networks, the same does not hold for a naïve Adaline linear regressor. By relating the values of the proposed complexity measures to the performance achieved by some specific learning technique, as done in MTL, we are also able to assess the relative complexity of a problem.

Feature correlation measures: capture the relationship of the feature values with the outputs;

Linearity measures: estimate whether a linear function can fit data, that is, if they are linearly distributed;

Smoothness measures: estimate the smoothness of the function that must be fitted to the data;

Geometry, topology and density measures: capture the spatial distribution and the structure of the data.
3.1 Feature correlation measures
This category contains measures that consider the correlation of the values of the features to the outputs. If at least one feature is highly correlated to the output, this indicates that simpler functions can be fitted to the data. Most of the measures from this category are univariate and analyze each feature individually, disregarding their potential correlation or joint effect. This is justified by the fact that we are concerned with simple measures that give an indicative of the problem complexity. Often a combination of more measures will be required for a proper problem characterization.
3.1.1 Maximum feature correlation to the output (\(C_1\))
Higher values of \(C_1\) indicate simpler problems, where there is at least one feature strongly correlated with the output. For Dataset1, the \(C_1\) value is 1, since feature \(x_1\) has maximum correlation to the target attribute. For Dataset2, the \(C_1\) value is 0.03, indicating no relationship of any of the predictive features to the target attribute.
3.1.2 Average feature correlation to the output (\(C_2\))
3.1.3 Individual feature efficiency (\(C_3\))
A naïve implementation of \(C_3\) would have a cubic complexity on the number of samples. We propose therefore an efficient implementation of \(C_3\) which takes advantage of the fact the Spearman correlation works on differences of rankings. Instead of reestimating the rankings after the removal of each observation, we simply update the difference of the rankings. The proposed algorithm has the worst time complexity of \(\text{ O }(d\cdot n^2)\) and is shown in the Appendix.
Lower values of \(C_3\) indicate simpler problems. If there is at least one feature with correlation higher than 0.9 to the output, all examples are preserved in the dataset and the \(C_3\) value is the minimum (null). This is the case of Dataset1, where feature \(x_1\) has a maximum correlation of 1 to the output. In the case of Dataset2, on the other hand, 601 of the 1000 initial examples have to be removed so that feature \(x_2\) shows a correlation higher than 0.9 to the target attribute and 604 examples must be removed in the case of feature \(x_3\). Therefore, the \(C_3\) value for Dataset2 is 0.601, indicating a higher complexity (as more examples need to be removed to achieve a high correlation, the more complex is the problem).
3.1.4 Collective feature efficiency (\(C_4\))
3.2 Linearity measures
These measures capture whether a linear function provides a good fit to the problem. If this is the case, the problem can be considered simpler than one in which a nonlinear function is required. Both of the measures from this section employ the residuals of a multiple linear regressor. Concerning computational complexity, both measures here show an asymptotic complexity of \(\text{ O }(n \cdot d^2)\), which comes from the linear regression estimation.
3.2.1 Mean absolute error (\(L_1\))
3.2.2 Residuals variance (\(L_2\))
3.3 Smoothness measures
In regression problems, the smoother the function to be fitted to the data, the simpler it shall be. Larger variations in the inputs and/or outputs, on the other hand, usually indicate the existence of more intricate relationships between them. The measures from this category measure this aspect and also allow capturing the shape of the curve to be fitted to the data.
3.3.1 Output distribution (\(S_1\))
For building the graph from the data it is necessary to compute the distance matrix between all pairs of elements, which requires \(O(n^2\cdot d)\) operations. Afterwards, using the Prim’s algorithm for estimating the MST requires \(O(n^2)\) operations. Therefore, the asymptotic complexity of \(S_1\) is \(O(n^2\cdot d)\).
3.3.2 Input distribution (\(S_2\))
3.3.3 Error of a nearest neighbor regressor (\(S_3\))
3.4 Geometry, topology and density measures
These measures capture the distribution and density of the examples in the input/output space.
3.4.1 Nonlinearity of a linear regressor (\(L_3\))
\(L_3\) is based on the Nonlinearity of a Linear Classifier (\(L_3\)) measure from Ho and Basu (2002), with adaptations to the regression scenario. For a given dataset, it first selects pairs of examples with similar outputs and creates a new test point by randomly interpolating them. Here both input and output features are interpolated, while in Ho and Basu (2002) random examples of the same class have their input features interpolated.
3.4.2 Nonlinearity of nearest neighbor regressor (\(S_4\))
3.4.3 Average number of examples per dimension (\(T_2\))
3.5 Summary of the measures
Summary of the measures
Category  Acronym  Min.  Max.  Asymptotic cost  Complexity 

Feature correlation  \(\text{ C }_1\)  0  1  \(\text{ O }(d\cdot n\cdot \log n)\)  \(\downarrow \) 
\(\text{ C }_2\)  0  1  \(\text{ O }(d\cdot n\cdot \log n)\)  \(\downarrow \)  
\(\text{ C }_3\)  0  1  \(\text{ O }(d\cdot n^2)\)  \(\uparrow \)  
\(\text{ C }_4\)  0  1  \(\text{ O }(d \cdot (d+ n\cdot \log n))\)  \(\uparrow \)  
Linearity  \(\text{ L }_1\)  0    \(\text{ O }(n\cdot d^2)\)  \(\uparrow \) 
\(\text{ L }_2\)  0    \(\text{ O }(n\cdot d^2)\)  \(\uparrow \)  
Smoothness  \(\text{ S }_1\)  0    \(\text{ O }(d\cdot n^2)\)  \(\uparrow \) 
\(\text{ S }_2\)  0    \(\text{ O }(n\cdot (d+\log n))\)  \(\uparrow \)  
\(\text{ S }_3\)  0    \(\text{ O }(d\cdot n^2)\)  \(\uparrow \)  
Geometry topology and density  \(\text{ L }_3\)  0    \(\text{ O }(n\cdot (d^2+\log n))\)  \(\uparrow \) 
\(\text{ S }_4\)  0    \(\text{ O }(n\cdot d\cdot \log n)\)  \(\uparrow \)  
\(\text{ T }_2\)  \(\approx 0\)    \(\text{ O }(n+d)\)  \(\downarrow \) 
As can be observed in Table 1, the majority of the measures are unbounded. Data should be normalized between [0, 1] before the application of these measures (except from \(T_2\)) so that upper values close to 1 can be obtained. Normalization is required for all measures that compute some distance between examples, such as \(S_1\) to \(S_4\). Regarding the worstcase asymptotic computational cost, all measures can be computed in polynomial times on the number of features and examples.
4 Experiments with synthetic datasets
For evaluating the ability of the previous measures in quantifying the complexity of regression problems, we first performed experiments on a set of synthetic datasets. The idea is to verify how the measures behave with regression functions of increasing complexity.
4.1 Datasets
 Polynomial with varying degrees as given by Eq. 14. The \(\beta _{ij}\) values are randomly chosen in the [0, 1] interval, but remain the same for all n data points that are generated for a given dataset. The polynomial degree (p) values were varied as 1, 3 and 5. When \(p = 1\), we have a perfect linear relationship between the input variables and the output values. The higher the degree, the higher the complexity of the regression problem, as illustrated in Fig. 8.$$\begin{aligned} f(\mathbf {x})=&\beta _0 + \beta _{11}(x^1)+ \cdots + \beta _{1p}(x^1)^p+ \nonumber \\&\beta _{21} (x^2) + \cdots +\beta _{2p}(x^2)^p + \nonumber \\&\qquad \qquad \quad \vdots \nonumber \\&\beta _{d1}(x^d) + \cdots + \beta _{dp} (x^d)^p \end{aligned}$$(14)
 Sine waves with varying frequencies as expressed by Eq. 15. In this equation, a is the amplitude of the senoid, which we set as 1, b is the frequency and \(\psi \) is the phase of the curve, which is set randomly in the \([0,\pi ]\) interval. The b values are varied as 1 and 3. The sine wave has more cycles for higher b values, making the regression function more complex. Examples of datasets generated are shown in Fig. 8.$$\begin{aligned} f(\mathbf {x})=&a\sin (2 \pi b x^1 + \psi _1)+ \nonumber \\&a\sin (2 \pi b x^2 + \psi _2) + \nonumber \\&\vdots \nonumber \\&a\sin (2 \pi b x^d + \psi _d) \end{aligned}$$(15)
We also generated noisy versions of all synthetic datasets by introducing residuals or errors to the outputs. Thereby, a residual \(\varepsilon _i\) is added to all examples. They are assumed to follow the \(N(0,\sigma )\) distribution. The higher the \(\sigma \), the higher the residuals and the higher is the underlying complexity of the dataset. Three \(\sigma \) values were tested: 0 (no noise), 0.5 and 1.
For each d value, function type and \(\sigma \) variation, 50 datasets were produced. We have, therefore, a total of 3000 synthetic datasets. For all datasets, the outputs are normalized in the [0, 1] interval after generation.
4.2 Individual values of the measures
According to the results shown in Fig. 9, when no noise is introduced to the datasets (boxplots shown in red), almost all measures show variations in their values according to the complexity of the functions. In general, the correlation measures were more effective in separating the types of functions. Taking \(C_3\), for instance, we can see that its values increase for more complex functions, which is in accordance to the expected behavior of this measure for more complex problems (easier problems will show lower \(C_3\) values).
In some cases the values of the measures for the polynomials of higher degrees are similar or indicate a superior complexity of these functions when compared to the simpler senoid (ex. \(C_1\), \(C_2\), \(S_1\), \(S_3\)). Since they are functions of different categories, they may be comparably simpler or more complex depending on the aspect to be measured.
For many cases the median values of the measures are similar for the polynomials of degrees 3 and 5 (ex. \(C_4\), \(L_1, L_2, L_3, S_4\)). There are also cases where the values for these functions overlap to those of sin1 senoid. Nonetheless, in all cases the simpler problems (linear) can be easily distinguished from the more complex function (senoid with frequency of 3). This is less evident for the \(S_2\) measure, for which the boxplots overlap more.
Statistical test results: pairwise comparison of functions according to the values of each complexity measure (Color table online)
\(\sigma \)  \(C_1\)  \(C_2\)  \(C_3\)  \(C_4\)  \(L_1\)  \(L_2\)  \(L_3\)  \(S_1\)  \(S_2\)  \(S_3\)  \(S_4\)  

pol1–pol3  0.0  
0.5    
1.0       
pol1–pol5  0.0  
0.5  
1.0          
pol1–sin1  0.0  
0.5     
1.0          
pol1–sin3  0.0  
0.5    
1.0       
pol3–pol5  0.0       
0.5          
1.0            
pol3–sin1  0.0    
0.5           
1.0            
pol3–sin3  0.0  
0.5    
1.0       
pol5–sin1  0.0    
0.5      
1.0          
pol5–sin3  0.0  
0.5      
1.0           
sin1–sin3  0.0  
0.5    
1.0        
It is also possible to notice in Table 2 that while many of the measures are able to separate the pairs of functions in the noiseless scenario (\(\sigma =0\)), when noise is added to the outputs, all measures are negatively impaired. Noise effect was more harmful in the case of measures \(C_4, L_1, L_2, L_3, S_1, S_3\) and \(S_4\). In these cases, the values of the measures overlap for all types of functions. For instance, \(S_1\) builds a Minimum Spanning Tree from input data, whose structure can be very different face of small data changes, as those introduced by noise. Measures relying on the residuals of a regression function, such as \(C_4, L_1, L_2\) and \(L_3\) are naturally affected by the noise variation procedure adopted in the generation of the datasets. On the other hand, the correlation measures \(C_1\), \(C_2\) and \(C_3\) can be regarded as quite robust in differentiating most of the pairs of functions, despite of the noise level.
It is also possible to notice in Fig. 9 that some measures show a high variance, as indicated by their longer boxplots. This happens specially for some specific combinations of measures and function types, as for the correlation measures and function pol1 and for the linearity and smoothness measures and function sin3. This variability is due to the impact of the data dimension in the results of these measures, since each boxplot includes the results for datasets with distinct dimensions. Appendix Fig. 14a shows the boxplots of the \(C_1\) values for the pol1 function, separated by the dimensions of the datasets (1, 2, 5 and 10). The same type of boxplot is presented in Appendix Fig. 14b for the sin3 function and the \(S_3\) values. Data dimension affects specially the correlationbased measures. For instance, for a onedimensional dataset with no output noise the only feature will show maximum correlation to the output. When more dimensions are added, the influence of each feature is reduced (Appendix Fig. 14).
Another noticeable result is the fact that all those measures relying on a linear regressor (\(C_4, L_1, L_2\) and \(L_3\)) indicate that sin1 and sin3 are more complex without noise than with noise. This also happens for \(S_4\) and function sin3. Whilst it is clear that these regression problems are non linear, this is a case where these individual measures fail to assess the complexity introduced by noise.
Despite the previous observation, overall the results obtained for the synthetic datasets evidences the ability of the proposed measures in revealing the complexity of the regression problems generated. But we expect that a combination of the measures will better characterize the complexity of a problem, since they look at different perspectives regarding data complexity. This aspect is analyzed next.
4.3 Combination of the measures: prediction of function type
For evaluating the combined effect of the proposed measures, we performed a simple MTL experiment for distinguishing the synthetic datasets according to their complexity level. The objective is to predict the regression function that should be fitted to the data. For each noise level (\(\sigma = 0, 0.5\) and 1), a metadataset is built. Each synthetic dataset corresponds to an input example and is described by the complexity measures values. These examples are labeled according to their underlying regression function, namely pol1, pol3, pol5, sin1 or sin3. Therefore, we have three metadatasets with 12 input features (\(T_2\) was included in these experiments), 1000 examples and five classes (with 200 examples per class/function type). The majority error rate for each dataset is 0.8.
Error rates of classifiers: regression function prediction
Classifier  \(\sigma = 0\)  \(\sigma = 0.5\)  \(\sigma = 1\)  

Aver.  Std.  Aver.  Std.  Aver.  Std.  
RF  0.06  0.03  0.38  0.04  0.57  0.04 
MLP  0.03  0.02  0.31  0.04  0.55  0.04 
SVM  0.13  0.03  0.40  0.05  0.62  0.03 
3NN  0.06  0.02  0.45  0.04  0.65  0.02 
According to the results shown in Table 3, all classification techniques were successful in separating the noiseless datasets according to their function type. MLP achieved the lowest average error rate, while SVM obtained the worst results. More errors are verified for functions pol3 and pol5, which usually get confused in these cases. As noise is added to the base datasets, the recognition performance of all techniques is diminished (although it still remains above the majority error rate performance). This simple experiment evidences the joint ability of the measures to describe the synthetic datasets generated.
4.4 Correlation and PCA analysis
We also performed a PCA analysis using the measures values calculated for all synthetic datasets. Herewith, the first four components explain 95% of data variance (Appendix Table 10). Figure 11 shows a scatter plot with the two first components and the variable loadings for datasets with \(\sigma = 0\). The first component, responsible for 46% of data variance, separates data from Sinusoidal datasets (high PC1 values) from lower polynomials (low PC1 values). All features, but \(T_2\) and \(S_2\), contribute to this component, which captures the complexity of the datasets. Two features which captures the sparsity of the datasets, \(T_2\) and \(S_2\), dominate PC2. Indeed, this PC separates data generated by distinct functions in four groups, as a reflex of the four numbers of input variables used in the data simulation. Of note, pol1 datasets per dimension are all located in the same position, as they all have similar complexity values. For datasets with \(\sigma = 0.5\), we observe a similar but more diffuse distribution of the datasets (Appendix Fig. 15). The distribution of the datasets gets very diffused and the groups of functions overlap more in the case of \(\sigma = 1.0\) (Appendix Fig. 16).
5 Experiments with real datasets: parameter selection
Some regression techniques require the user to define the values of one or more parameters. This is the case of the Support Vector Regression (SVR) technique. While MTL has proved to be a suitable and useful strategy for recommending the parameter values of SVR, its results are affected by the quality of the metafeatures employed to describe the base datasets (Soares et al. 2004).
In this case study, we investigate the use of the proposed measures to recommend the best configurations of the following parameters of SVR: the \(\gamma \) parameter of RBF kernel, the regularization parameter C and the \(\epsilon \) parameter. The choice of RBF kernel is due to its flexibility compared to other kernels (Keerthi and Lin 2003). The \(\gamma \) parameter has an important influence in learning performance since it controls the linearity of the induced SVR (Soares et al. 2004). The parameter C is also important for learning performance since it controls the complexity (flatness) of the regression function derived by the SVR (Basak et al. 2007). The \(\epsilon \) parameter denotes a tradeoff between model complexity and the amount of error allowed per training instance (Basak et al. 2007). Thus, if \(\epsilon \) is greater than the range of the target values, a good result can not be expected. On the other hand, if \(\epsilon \) is zero, the model can overfit.
The three previous parameters affect the complexity of the regression model induced in SVR, and therefore its generalization performance for a given problem. The best configuration of parameters will depend on the complexity of the learning problem at hand, i.e., the complexity of the model has to match the complexity of the problem. Therefore, we expect that the complexity measures will be useful to identify how difficult is a problem and thus to recommend a suitable configuration of parameters for the SVR.
In this case study, a metabase was generated from the application of SVRs to 39 different regression problems collected from public data repositories. An instancebased metalearner was employed to recommend the best parameter configuration for new problems based on their metafeatures.
5.1 Metabase
Regression problems adopted for metaexample generation and their respective number of examples (in parenthesis)
Auto price (159)  Ele1 (495)  MBA grade (61) 
Auto MPG6 (392)  Ele2 (1052)  Mortgage (1049) 
Auto MPG8 (392)  Elusage (55)  Pollution (60) 
Baseball (337)  Fertility diagnosis (100)  Pwlinear (200) 
Baskball (96)  Forest fires (517)  Pyrim (75) 
Bodyfat (252)  Friedman (1200)  Sensory (576) 
Bolt (40)  Fruitfly (125)  Servo (167) 
Cloud (108)  Gascons (27)  Stock (950) 
Concrete (1030)  Housing (506)  Strike (625) 
CPU (209)  Laser (993)  Treasury (1049) 
Dee (365)  Longley (16)  Triazines (187) 
Detroit (13)  Lowbwt (189)  Vineyard (52) 
Diabetes numeric (43)  Machine CPU (209)  Yacht hydrodynamics (308) 
Each metaexample is related to a single regression problem and stores: (i) the values of its metafeatures (as those described in Sect. 3); and (ii) the best configuration of parameters evaluated for the problem. The best configuration stored in a metaexample is defined from the exhaustive evaluation of 3, 192 configurations of parameters: the parameter \(\gamma \) assumed 19 different values (from \(2^{15}\) to \(2^{3}\), increasing in powers of two), the parameter C assumed 21 different values (from \(2^{5}\) to \(2^{15}\), increasing in powers of two) and the parameter \(\epsilon \) assumed 8 values (from \(2^{8}\) to \(2^{1}\), increasing in powers of two). For each of combination, a 10fold crossvalidation experiment was performed to evaluate the SVR performance. The configuration of parameters with the lowest Normalized MSE (NMSE) was then stored in the metaexample. In these experiments, we deployed the ScikitLearn package (Pedregosa et al. 2011) to implement the SVRs and to perform the crossvalidation procedure.
5.2 Metalearner
The metalearner suggests for the new problem the configuration of parameters stored in the retrieved metaexample. The main assumption here is that the best configuration of parameters in a problem can also achieve good results in a similar problem.
5.3 Results
The proposed metafeatures were evaluated on the recommendation of SVR parameters by following a leaveoneout methodology described as follows. At each step of leaveoneout, one metaexample is left out to be adopted as the test problem, and the remaining 38 metaexamples are kept in the metabase to be selected by the metalearner. Once the most similar metaexample is identified in the metabase, its solution (configuration of parameters) is evaluated in the input problem and the NMSE obtained is stored.
The recommendation results achieved by using the proposed metafeatures were compared with four other different approaches. Initially, we compared the proposal results with results achieved by the SVR using default parameters. The default parameter values are: C = 1.0, \(\gamma \) = \(\frac{1}{d}\), \(\epsilon \) = 0.1, where d is the number of input attributes in the regression problem. We also investigate how close the results achieved by our proposal reached the solutions found by a Grid search (exhaustive approach). We point out that all used approaches are implementations of the ScikitLearn package (Pedregosa et al. 2011). Finally, we compared our results with the combination of two sets of metafeatures originally proposed by Soares et al. (2004) and Soares and Brazdil (2006). The combination of these metafeatures was also validated in Gomes et al. (2012) for SVR parameter selection.
NMSE values reached by the approaches
Dataset  Default  Soares  Proposal  Grid 

Auto price  0.118  0.012  0.011  0.008 
AutoMPG6  0.056  0.005  0.005  0.005 
AutoMPG8  0.056  0.005  0.005  0.005 
Baseball  0.134  0.044  0.015  0.013 
Baskball  0.047  0.017  0.018  0.017 
Bodyfat  0.039  0.001  0.001  0.001 
Bolt  0.141  0.021  0.016  0.008 
Cloud  0.116  0.005  0.005  0.004 
Concrete  0.050  0.007  0.009  0.005 
Cpu  0.196  0.002  0.001  0.000 
Dee  0.050  0.010  0.009  0.008 
Detroit  0.098  0.026  0.024  0.016 
Diabetes numeric  0.044  0.030  0.028  0.026 
Ele1  0.101  0.007  0.007  0.006 
Ele2  0.134  0.000  0.000  0.000 
Elusage  0.090  0.021  0.015  0.014 
Fertility diagnosis  0.252  0.720  0.107  0.100 
Forestfires  0.230  0.003  0.003  0.003 
Friedman  0.035  0.002  0.002  0.002 
Fruitfly  0.087  0.040  0.040  0.038 
Gascons  0.105  0.003  0.002  0.001 
Housing  0.055  0.010  0.040  0.010 
Laser  0.102  0.001  0.002  0.000 
Longley  0.125  0.003  0.002  0.001 
Lowbwt  0.031  0.015  0.015  0.014 
Machine cpu  0.190  0.011  0.004  0.003 
Mbagrade  0.060  0.048  0.045  0.044 
Mortgage  0.081  0.000  0.000  0.000 
Pollution  0.042  0.019  0.016  0.013 
PwLinear  0.041  0.006  0.005  0.005 
Pyrim  0.074  0.013  0.013  0.011 
Sensory  0.028  0.067  0.031  0.026 
Servo  0.151  0.008  0.008  0.008 
Stock  0.062  0.008  0.012  0.005 
Strike  0.209  0.006  0.006  0.006 
Treasury  0.096  0.000  0.000  0.000 
Triazines  0.081  0.025  0.021  0.021 
Vineyard  0.072  0.016  0.016  0.008 
Yacht hydrodynamics  0.168  0.006  0.005  0.001 
Ranking  3.92  2.56  2.21  1.33 
According to the results shown in Table 5, the NMSE values reached by the proposal were in general better than those achieved by the SVR using default parameters as defined in Pedregosa et al. (2011). More precisely, the proposal won in 38 out of the 39 regression problems and lost in only 1 dataset (Sensory). The grid search exhaustively considers all parameter combinations, finding, this way, the best solution among those presented in the set of possible solutions. Our intention on comparing the recommendation results achieved by the proposal with the optimal results found by the grid search is to verify whether the proposal can recommend solutions near to the optimal. Considering 95% of confidence level between the results of the approaches according to the Wilcoxon statistical test, it can be observed that there is a similarity between these results in 30 out of 39 problems. Finally, the proposal achieved better NMSE values than the metafeatures employed in Soares et al. (2004) in most of the problems (18 wins, 16 draws and 5 losses). To perform a fair comparison between both approaches, we also considered 95% of confidence level in the Wilcoxon statistical test to analyze the results. Thus, 5 wins, 2 losses and 32 draws were accounted to the proposal regarding the counterpart. The metafeatures from Soares et al. (2004) considered both commonly used metafeatures for the characterization of regression problems and metafeatures specifically designed for SVR, that are based on the Kernel matrix. Thus, good results are expected by adopting this baseline. Our experiments show that our proposed generic metafeatures are competitive compared to this strong baseline and can be useful to characterize regression problems in the SVR domain.
A Friedman statistical test detects that the results of the techniques compared can not be regarded as identical, at 95% of confidence level. The Nemenyi posttest points no differences between the results of the Proposal and of the Soares’ metafeatures. This reinforces the good performance of our metafeatures, which achieved results comparable to those of the strong baseline of Soares et al. (2004), Soares and Brazdil (2006).
6 Experiments with real datasets: metaregression
The data complexity measures proposed in our work try to characterize the difficulty of regression problems on a variety of situations. We expect that the proposed measures are actually related to the empirical performance obtained by regression algorithms once evaluated on a diverse set of regression problems.
In order to verify this statement, we performed a case study in which the proposed measures are adopted as metafeatures to predict the expected performance of some regression algorithms. The case study is defined as a metaregression task, in which the independent attributes are the complexity measures and the target attribute is the error obtained by a regression algorithm. This is a natural case study, since the performance of algorithms and the estimated difficulty of problems should be directly related. The usefulness of a set of data complexity measures depends on the ability of assessing the difficulty of different problems for different algorithms.
In this case study, the complexity measures were adopted to predict the performance of fourteen different regression algorithms implemented in the Weka environment (i.e., fourteen different instances of the metaregression task were considered in this case study). The adopted algorithms correspond to different families of regression techniques, with different biases and assumptions, ranging from simple linear regression to more sophisticated kernel methods (see Table 6). By adopting a diverse set of regression algorithms, we aim to verify how robust the proposed measures are to characterize the intrinsic difficulty of regression problems.
Algorithms adopted in the case study of metaregression
Acronym  Algorithm 

A1  Linear regression 
A2  Least med. square 
A3  IBk (k=3) 
A4  IBk (k=5) 
A5  M5P 
A6  REPTree 
A7  MLP network 
A8  RBF network 
A9  GP (kernel Poly, p=1) 
A10  GP (kernel Poly, p=2) 
A11  GP (kernel RBF) 
A12  SVM (kernel Poly, p=1) 
A13  SVM (kernel Poly, p=2) 
A14  SVM (kernel RBF) 
6.1 Metabase
In this case study, a metabase is built for each of the fourteen considered baselearners (in our case, a regression algorithm). Given a baselearner, a metabase was built from the 39 regression problems collected in our previous experiments. Each metaexample was generated from a regression problem and we stored: (1) the metaattributes describing the regression problem; (2) the NMSE error produced by the baselearner for that problem in a 10fold crossvalidation (CV) experiment.
In the CV experiments, we observed a varied learning performance depending on the regression problem considered. In fact, the best algorithms for each problem varied a lot and statistical differences between the baselearners were commonly observed. Despite this variability, the correlation of the NMSE values obtained by the baselearners across the datasets was commonly high (the average correlation was 0.8). Each regression problem has an intrinsic difficulty that homogeneously affected the performance level of all regression algorithms. We could say that part of the learning performance obtained by a baselearner on a problem is explained by its specific skills (which is related to the relative complexity concept briefly discussed in Sect. 3) and the remaining part is explained by the difficulty of the problem itself.
In order to assess problem difficulty in this case study, we evaluated two different sets of metaattributes: the proposed set of measures and the metafeatures proposed in Soares et al. (2004) and Soares and Brazdil (2006), adopted here as a baseline of comparison. This is the same set of metafeatures named as Soares in the previous Section. Therefore, considering the set of 14 baselearners and 2 metafeature sets, the total of 28 different metabases were built in our experiments.
6.2 Metalearner
As the metalearning task in this case study is to predict the error rate of a given baselearner, the metalearner is also a regression algorithm. In this case study, we adopted the SVR algorithm (with RBF Kernel), since is has shown to be a very competitive and robust algorithm for regression.
6.3 Relevance of complexity measures
Correlation among the proposed complexity measures and the NMSE values obtained by the baselearners on the regression problems adopted
PC  \(\text{ C }_1\)  \(\text{ C }_2\)  \(\text{ C }_3\)  \(\text{ C }_4\)  \(\text{ L }_1\)  \(\text{ L }_2\)  \(\text{ S }_1\)  \(\text{ S }_2\)  \(\text{ S }_3\)  \(\text{ L }_3\)  \(\text{ S }_4\)  \(\text{ T }_2\) 

\(\) 0.86  \(\) 0.75  0.66  0.48  0.59  0.57  0.67  0.44  0.61  0.56  0.54  \(\) 0.29\(^*\)  
A1  \(\) 0.66  \(\) 0.69  0.52  0.35  0.37  0.42  0.47  0.52  0.46  0.43  0.40  \(\) 0.37 
A2  \(\) 0.71  \(\) 0.70  0.55  0.17\(^*\)  0.30\(^*\)  0.38  0.47  0.39  0.47  0.39  0.45  \(\) 0.22\(^*\) 
A3  \(\) 0.69  \(\) 0.70  0.53  0.23\(^*\)  0.35  0.40  0.53  0.41  0.48  0.41  0.46  \(\) 0.28\(^*\) 
A4  \(\) 0.68  \(\) 0.64  0.46  0.18\(^*\)  0.42  0.58  0.51  0.53  0.62  0.60  0.44  \(\) 0.31\(^*\) 
A5  \(\) 0.56  \(\) 0.72  0.43  0.39  0.37  0.38  0.39  0.39  0.38  0.39  0.34  \(\) 0.23\(^*\) 
A6  \(\) 0.70  \(\) 0.71  0.55  0.28\(^*\)  0.40  0.47  0.55  0.47  0.54  0.47  0.49  \(\) 0.31 
A7  \(\) 0.56  \(\) 0.54  0.38  0.29\(^*\)  0.32  0.40  0.39  0.30\(^*\)  0.42  0.41  0.38  \(\) 0.19\(^*\) 
A8  \(\) 0.59  \(\) 0.53  0.36  0.28\(^*\)  0.33  0.45  0.40  0.32  0.48  0.48  0.42  \(\) 0.19\(^*\) 
A9  \(\) 0.64  \(\) 0.75  0.52  0.26\(^*\)  0.41  0.42  0.44  0.52  0.43  0.44  0.39  \(\) 0.25\(^*\) 
A10  \(\) 0.66  \(\) 0.69  0.52  0.22\(^*\)  0.38  0.40  0.42  0.48  0.42  0.41  0.38  \(\) 0.21\(^*\) 
A11  \(\) 0.55  \(\) 0.64  0.39  0.51  0.42  0.41  0.49  0.34  0.41  0.41  0.39  \(\) 0.34 
A12  \(\) 0.69  \(\) 0.62  0.49  0.21\(^*\)  0.34  0.44  0.53  0.42  0.54  0.45  0.50  \(\) 0.30\(^*\) 
A13  \(\) 0.63  \(\) 0.57  0.50  0.15\(^*\)  0.30\(^*\)  0.35  0.56  0.39  0.48  0.35  0.46  \(\) 0.38 
A14  \(\) 0.56  \(\) 0.68  0.41  0.46  0.38  0.37  0.47  0.46  0.40  0.37  0.32  \(\) 0.40 
Before presenting the predictive results obtained by the metalearner, we investigated the relevance of the complexity measures and their relationship with the performance obtained by the baselearners. In this section, we aim to provide insights about the usefulness of the proposed complexity measures.
Initially, we analyze the correlation between the complexity measures and the algorithms performance. In order to summarize the algorithm performance in a lower dimension, we applied PCA on the 39 (datasets) \(\times \) 14 (baselearners) matrix of NMSE values. Next we ploted each proposed measure against the first principal component derived from the NMSE matrix (see Fig. 12). Table 7 (first line) presents the values of correlation associated to the plots in Fig. 12.
As expected (by definition), we observed a positive correlation between the regression error and the measures \(C_3\), \(C_4\), \(L_1\), \(L_2\), \(S_1\), \(S_2\), \(S_3\), \(L_3\) and \(S_4\). In turn, negative correlation was observed for \(C_1\), \(C_2\) and \(T_2\). The correlation values suggest the relative importance of each attribute. Most correlations were statistically significant (except for \(T_2\)).
Figure 12 indicates that some measures (like \(C_1\), \(C_2\), \(C_3\) and \(S_1\)) present a more regular linear relationship with the learning performance. Other measures (like \(L_2\), \(S_3\), \(L_3\) and \(S_4\)) seem to present an exponential behavior in the sense that small modifications in their values indicate a high increase in the regression errors. Such measures may be less useful for identifying some of the complex problems (i.e., problems may be considered more simple than they really are).
Table 7 also details the correlation between the proposed measures and the performance of the baselearners. Feature correlation measures \(C_1\), \(C_2\) and \(C_3\) were notably related to regression performance (i.e., high feature correlation to the target attribute results in low regression error). Other complexity measures present intermediate correlations (mostly statistically significant).
Centroid of regression problems obtained by k Means
Cluster 1 high complex  Cluster 2 low complex  

\(\text{ C }_1\)  0.383  0.830 
\(\text{ C }_2\)  0.207  0.620 
\(\text{ C }_3\)  0.593  0.108 
\(\text{ C }_4\)  0.512  0.170 
\(\text{ L }_1\)  0.521  0.252 
\(\text{ L }_2\)  0.256  0.068 
\(\text{ S }_1\)  0.589  0.261 
\(\text{ S }_2\)  0.404  0.194 
\(\text{ S }_3\)  0.275  0.065 
\(\text{ L }_3\)  0.236  0.057 
\(\text{ S }_4\)  0.236  0.038 
\(\text{ T }_2\)  0.536  0.641 
6.4 Metaregression results
Average NMSE values (and standard deviations) obtained by the SVR metaregression for each metalearning task considering different baselearners and metafeature sets
Base learner  Metafeature set Baseline  

(Soares)  Proposal  
A1  0.83 (0.18)  0.68 (0.13)  \(\bullet \)  
A2  0.85 (0.15)  0.79 (0.07)  \(\bullet \)  
A3  0.78 (0.19)  0.64 (0.15)  \(\bullet \)  
A4  0.81 (0.08)  0.72 (0.10)  \(\bullet \)  
A5  0.86 (0.11)  0.74 (0.10)  \(\bullet \)  
A6  0.89 (0.08)  0.79 (0.10)  \(\bullet \)  
A7  0.80 (0.16)  0.69 (0.09)  \(\bullet \)  
A8  0.94 (0.08)  0.77 (0.11)  \(\bullet \)  
A9  0.91 (0.08)  0.83 (0.09)  \(\bullet \)  
A10  0.88 (0.10)  0.74 (0.11)  \(\bullet \)  
A11  0.82 (0.15)  0.65 (0.13)  \(\bullet \)  
A12  0.82 (0.10)  \(\bullet \)  0.69 (0.09)  \(\bullet \) 
A13  0.87 (0.08)  \(\bullet \)  0.76 (0.10)  \(\bullet \) 
A14  0.86 (0.10)  0.78 (0.08)  \(\bullet \) 
7 Conclusion
This paper presented and formalized a set of metafeatures that can be employed to characterize the complexity of regression problems. Thereby, they estimate how complex should be the function that must be fitted to the data in order to solve the regression problem. The utility of these measures in the description of regression problems was assessed in three MTL setups: (i) predicting the regression function type of some datasets; (ii) determining the parameter values of SVR regressors; and (iii) predicting the expected NMSE of various regressors. While the first analysis employs synthetic datasets containing known regression functions, the last two use real regression datasets from public repositories.
Experiments with synthetic datasets revealed that many of the individual measures were able to separate simple problems from more complex variants. However, each measure provides a perspective regarding the problem complexity. Since several aspects can influence the complexity of a regression problem, a combination of various measures should be used to describe a problem. This is supported by the success in the use of the complexity measures as metafeatures in a MTL system designed to predict the regression function generating the data. In other two setups the complexity measures are also jointly employed as metafeatures to characterize a set of 39 regression problems from public repositories. For these datasets, good predictive results were obtained in two distinct metalearning setups: the first one designed to tune the parameter values of the Support Vector Regression technique and the second one used to predict the expected NMSE of various regression techniques.
PCA analysis further supports the importance and association between proposed measures. The PCA analysis on simulated data revealed that most measures are important to characterize data complexity. There, the first component contained similar contributions of all measures, except from \(T_2\) and \(S_2\), which were in turn prevalent in the second PC component. PCA and correlation analysis also indicates that some groups of variables are redundant, i.e. \(C_1\)–\(C_2\), \(S_1\)–\(S_3\). Similar importances were found in the association of complexity measures with the first PCA of the NMSE of all evaluated regression methods for real data. Here, features of correlation (such as \(C_1\), \(C_2\) and \(C_3\)) and of the data smoothness (as \(S_1\)) were most effective in the discrimination of data complexity.
It is worth noting that the calculated values of the measures give an apparent estimate of the regression problem complexity. This happens because their values are estimated from the dataset available for learning, which contains a sample of the problem instances.
7.1 Future work
Future work shall investigate the ability of other measures from relate literature that can be used to characterize the difficulty of regression problems. Examples include the proportion of features with outliers and the coefficient of variation of the target used in Soares et al. (2004) and Gomes et al. (2012). If there are many features with outliers, the task of fitting a function to the data will be impaired. More complex functions may also be required if the outputs vary too much. Other candidate measure is the coefficient of determination from the Statistics literature, which can be used to measure data conformation to a linear model. Therefore, there is already a range of measures in the literature which, although have not been analyzed with the aim of characterizing the complexity of regression problems, can be directly used or adapted for such.
Some of the proposed measures can also be reformulated to become more robust. An example is the \(C_3\) measure. We considered a threshold of 0.9 as a high correlation value, but other values could be employed instead to decide whether the correlation is significant. The same reasoning applies to the \(C_4\) measure and the threshold employed for the residuals values.
The values of some measures are also very dependent on the number of features present on a dataset. This issue should be investigated further in future work. For correlation measures, for example, the higher the number of features, the lower their individual correlation values tend to be. This happens because a combination of the features will correlate to the output. Since correlation does not imply causality and can be observed at random (Armstrong 2012), some caution must also be taken in their interpretation. Nonetheless, in most of the ML datasets we do expect that most of the features are related to the domain and that the causality component is present.
The formulated measures can only be applied to datasets containing quantitative/numerical features. For nominal features, the measures must be adapted or changed, although it is also possible to map the symbolic values into numerical values and apply the standard measures to the resulting values.
Finally, the proposed measures can be evaluated in different case studies for selecting regression algorithms or hyperparameter values using conventional metalearners or combined with search strategies, as described in Wistuba et al. (2016) and Leite et al. (2012). Such case studies will reveal in which situations the proposed measures are more or less useful and will provide insights for new proposals.
Notes
Acknowledgements
To the research agencies FAPESP (2012/226088), CNPq (482222/20131, 308858/20140 and 305611/20151), CAPES, DAAD and IZKF Aachen for the financial support.
References
 Amasyali, M., & Erson, O. (2009). A study of meta learning for regression. Tech. rep. ECE Technical Reports 386, Purdue University.Google Scholar
 Armstrong, J. S. (2012). Illusions in regression analysis. International Journal of Forecasting, 28(3), 689–694.CrossRefGoogle Scholar
 Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.
 Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information ProcessingLetters and Reviews, 11(10), 203–224.Google Scholar
 Brazdil, P., GiraudCarrier, C., Soares, C., & Vilalta, R. (2008). Metalearning: Applications to data mining. New York: Springer Science and Business Media.zbMATHGoogle Scholar
 Cavalcanti, G., Ren, T., & Vale, A. (2012). Data complexity measures and nearest neighbor classifiers: A practical analysis for metalearning. In: IEEE 24th international conference on tools with artificial intelligence (ICTAI), 2012 (Vol. 1, pp. 1065–1069). IEEE.Google Scholar
 Cristianini, N., ShaweTaylor, J., Elisseeff, A., & Kandola, J. (2002). On kerneltarget alignment. Advances in Neural Information Processing Systems, 14, 367–373.Google Scholar
 de Miranda, P., Prudêncio, R. B. C., Carvalho, A., & Soares, C. (2014). A hybrid metalearning architecture for multiobjective optimization of SVM parameters. Neurocomputing, 143, 27–43.CrossRefGoogle Scholar
 Garcia, L. P., de Carvalho, A. C., & Lorena, A. C. (2015). Effect of label noise in the complexity of classification problems. Neurocomputing, 160, 108–119.CrossRefGoogle Scholar
 Garcia, L. P., de Carvalho, A. C., & Lorena, A. C. (2016). Noise detection in the metalearning level. Neurocomputing, 176, 14–25.CrossRefGoogle Scholar
 Gomes, T. A. F., Prudêncio, R. B. C., Soares, C., Rossi, A. L. D., & Carvalho, A. (2012). Combining metalearning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1), 3–13.CrossRefGoogle Scholar
 Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300.CrossRefGoogle Scholar
 Keerthi, S. S., & Lin, C. J. (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7), 1667–1689.CrossRefzbMATHGoogle Scholar
 Kuba, P., Brazdil, P., Soares, C., & Woznica, A. (2002). Exploiting sampling and metalearning for parameter setting for support vector machines. In: VIII Iberoamerican conference on artificial intellignce proceedings of workshop learning and data mining associated with iberamia 2002, (University of Sevilla, Sevilla (Spain), (pp. 209–216).Google Scholar
 Leite, R., Brazdil, P., & Vanschoren, J. (2012). Selecting classification algorithms with active testing. In: Proceedings of the 8th international conference on machine learning and data mining in pattern recognition (pp. 117–131).Google Scholar
 Leyva, E., Gonzalez, A., & Perez, R. (2015). A set of complexity measures designed for applying metalearning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2), 354–367.CrossRefGoogle Scholar
 Loterman, G., & Mues, C. (2012). Selecting accurate and comprehensible regression algorithms through meta learning. In: IEEE 12th international conference on data mining workshops (pp. 953–960).Google Scholar
 Maciel, A. I., Costa, I. G., & Lorena, A. C. (2016). Measuring the complexity of regression problems. In: IEEE proceedings of the 2016 international conference on neural networks (in press).Google Scholar
 MoránFernández, L., BolónCanedo, V., & AlonsoBetanzos, A. (2017). Can classification performance be predicted by complexity measures? A study using microarray data. Knowledge and Information Systems, 51(3), 1067–1090.CrossRefGoogle Scholar
 OrriolsPuig, A., Maci, N., & Ho, T. K. (2010). Documentation for the data complexity library in c++. Tech. rep., La Salle—Universitat Ramon Llull.Google Scholar
 Pappa, G. L., Ochoa, G., Hyde, M. R., Freitas, A. A., Woodward, J., & Swan, J. (2014). Contrasting metalearning and hyperheuristic research: The role of evolutionary algorithms. Genetic Programming and Evolvable Machines, 15(1), 3–35.CrossRefGoogle Scholar
 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikitlearn: Machine learning in python. The Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
 Smith, M. R., White, A., GiraudCarrier, C., & Martinez, T. (2014). An easy to use repository for comparing and improving machine learning algorithm usage. Preprint. arXiv:14057292.
 Soares, C. (2008). Development of metalearning systems for algorithm recommendation. In: P. Brazdil, C. GiraudCarrier, C. Soares & R. Vilalta (Eds.), Metalearning: applications to data mining (pp. 33–62). Springer.Google Scholar
 Soares, C., & Brazdil, P. B. (2006). Selecting parameters of SVM using metalearning and kernel matrixbased metafeatures. In: Proceedings of the 2006 ACM symposium on applied computing, ACM, SAC ’06, (pp. 564–568).Google Scholar
 Soares, C., Brazdil, P. B., & Kuba, P. (2004). A metalearning method to select the kernel width in support vector regression. Machine Learning, 54(3), 195–209.CrossRefzbMATHGoogle Scholar
 Thornton, C., Hutter, F., Hoos, H., & LeytonBrown, K. (2013). AutoWEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 847–855).Google Scholar
 Wistuba, M., Schilling, N., & SchmidtThieme, L. (2016). Twostage transfer surrogate model for automatic hyperparameter optimization. In: European conference on machine learning and knowledge discovery in databases (pp. 199–214).Google Scholar