Interpretable linear dimensionality reduction based on bias-variance analysis

Bonetti, Paolo; Metelli, Alberto Maria; Restelli, Marcello

doi:10.1007/s10618-024-01015-0

Interpretable linear dimensionality reduction based on bias-variance analysis

Open access
Published: 25 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Interpretable linear dimensionality reduction based on bias-variance analysis

Download PDF

Paolo Bonetti ORCID: orcid.org/0000-0001-7392-5416¹,
Alberto Maria Metelli¹ &
Marcello Restelli¹

353 Accesses
1 Altmetric
Explore all metrics

Abstract

One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, dimensionality reduction techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, Linear Correlated Features Aggregation (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.

Data Dimensionality Estimation: Achievements and Challanges

An Experimental Study of Dimensionality Reduction Methods

Dimensionality Reduction: Is Feature Selection More Effective Than Random Selection?

1 Introduction

Dimensionality reduction plays a crucial role in applying Machine Learning (ML) techniques in real-world datasets (Sorzano et al. 2014). Indeed, in a large variety of scenarios, data are high-dimensional with a large number of correlated features. For instance, financial datasets are characterized by time series representing the trend of stocks in the financial market, and climatological datasets include several highly-correlated features that, for example, represent temperature value at different points on the Earth. On the other hand, only a small subset of features is usually significant for learning a specific task, and it should be identified to train a well-performing ML algorithm. In particular, considering many redundant features boosts the model complexity, which increases its variance and the risk of overfitting (Hastie et al. 2009). Furthermore, when the number of features is high, and comparable with the number of samples, the available data become sparse, leading to poor performance (curse of dimensionality (Bishop and Nasrabadi 2006)). For this reason, dimensionality reduction and feature selection techniques are usually applied. Feature selection (Chandrashekar and Sahin 2014) focuses on choosing a subset of features important for learning the target following a specific criterion (e.g., the most correlated with the target, the ones that produce the highest validation score), discarding the others. On the other hand, dimensionality reduction methods (Sorzano et al. 2014) maintain all the features projecting them in a (much) lower dimensional space, producing new features that are linear or non-linear combinations of the original ones. Compared to feature selection, this latter approach has the advantage of reducing the dimensionality without discarding any feature and exploiting all of their contributions to the projections. Moreover, recalling that the variance of a sum of random variables is smaller than or equal to the original one, the features computed with linear dimensionality reduction have smaller variance. However, the reduced features might be less interpretable since they are linear combinations of the original ones with different coefficients.

In this paper, we propose a novel dimensionality reduction method that exploits the information of each feature, without discarding any of them, while preserving the interpretability of the resulting feature set. To this end, we aggregate features through their average, and we propose a criterion that aggregates two features when it is beneficial in terms of the bias-variance tradeoff. Specifically, we focus on linear regression, assuming a linear relationship between the features and the target. In this context, the main idea of this work is to identify a group of aggregable features and substitute them with their average. Intuitively, in linear settings, two features should be aggregated if their correlation is large enough. We identify a theoretical threshold on the minimum correlation for which it is profitable to unify the two features. This threshold is the minimum correlation value between two features for which, comparing the two linear regression models before and after the aggregation, the variance decrease is larger than the increase of bias.

Choosing the average to aggregate the features is to preserve interpretability (the resulting reduced feature is just the average of k original features). Another advantage is that the variance of the average is smaller than the variance of the original features if they are not perfectly correlated. Indeed, assuming that we unify k standardized features, the variance of their average becomes $var(\bar{X})=\frac{1}{k}+\frac{k-1}{k}\rho$, with $\rho$ being the average correlation of distinct features (Jacod and Protter 2004). The main restriction of choosing the average to aggregate is that we will only consider continuous features since the mean is not well-defined for categorical features. Moreover, it would be improper to evaluate the mean between heterogeneous features: interpretability is preserved only if the aggregation is meaningful.

Another issue may arise when considering features with a different unit of measurement or scale, for this reason we will consider standardized variables.

Remark 1

(About the linearity assumption and non-linear cases.) The theoretical analysis that lays the foundations of the proposed algorithm is limited to linear assumptions and considers linear regression as ML method. However, the proposed algorithm preserves a relevant significance. Indeed, the theoretical analysis allows to prove that the proposed algorithm is theoretically sound, assuming linearity. Then, the algorithm is designed relying on the linear theoretical result, but it can be applied to any regression problem with continuous features, where the proposed threshold becomes a heuristic quantity. While the theoretical guarantees no longer hold, this claim is supported by the empirical validation of the method on real-world datasets, which have no guarantee of linearity, but show a promising applicability of the proposed method outside linear contexts. Additionally, as usually done also in linear regression, it is possible to consider non-linear transformations of the original features as inputs of the LinCFA algorithm to relax the linearity assumption in some specific contexts.

Remark 2

(About interpretability.) Complex (linear and non-linear) transformations of the original features are usually performed by dimensionality reduction methods. As an example, PCA performs a linear combination of potentially all the original features, each with different weights. This kind of aggregation is already defined in Kovalerchuk et al. (2021) as not completely interpretable, since they define these kind of transformation as quasi-explainable. In this context, the LinCFA Algorithm is interpretable, since it only relies on performing the mean of several features, which is a transformation that a domain expert can understand without any additional explanation by ML experts. Lahav et al. (2018) define interpretability as: “the extent to which a ML model can be made understandable to relevant human users, with the goal of increasing users’ trust in, and willingness to utilize, the model in practice” and Kovalerchuk et al. (2021) defines interpretability in this terms: “the model is explainable if it is presented only in the domain terms (e.g., medicine) without terms that have no meaning in the domain”. The mean of variables known by the domain expert can therefore be considered interpretable in these terms. Additionally, the interpretability of the proposed method is particularly clear when the features have the same unit of measure. In this context, the reduced features are simply the average of a set of measurements of the same quantity at different locations, with different sensors or at different time frames.

Additionally, depending on the applicative problem considered, the reduction performed with the mean can be particularly meaningful for domain experts. An example with meteorological measurements highlights the main applicative motivation behind the proposed approach. Indeed, a standard preprocessing approach often adopted in ML-based works for Earth science applications consists in computing the mean of a set of neighbouring measurements of the same physical quantity (e.g., temperature measurements in different locations). This method leads to the extraction of features that are average values of quantities over a region. As an example, Fig. 1a shows temperature gridded data related to ten different sub-basins of the Po River. In particular, each colored point in the figure represents one location, where temperature measurements are available. Therefore, each point can be seen as the location of a feature, that represents the temperature in that specific coordinates. To reduce the dimensionality, one may average the measurements of all the points within a sub-basin, following the geographical location of the data (in Fig. 1a each color identifies the set of locations belonging to a specific sub-basin). However, this has no guarantees on the ML performance and does not take into account the relationships between the data. The LinCFA algorithm, on the other hand, focuses on the relationships between pairs of points (i.e., features) and their relationship with the target to decide whether to aggregate them. From Fig. 1b it is possible to see the aggregations performed by the LinCFA algorithm: the dots having the same color correspond to the locations of the temperature features that the algorithm aggregates with their mean. Therefore, from the figure it is possible to conclude that, in this case, the data-driven approach aggregates the points differently from the geographical boundaries of the sub-basins. This preserves the interpretability since it aggregates measurements in different locations in the same way that domain expert does, with the advantage of being a data-driven approach, theoretically motivated.

Outline: The paper is structured as follows. In Sect. 2, we formally define the problem, and we provide a brief overview of the main dimensionality reduction methods. Section 3 introduces the methodology that will be followed throughout the paper. In Sect. 4, the main theoretical result is presented for the bivariate setting, which is then generalized to D dimensions in Sect. 5. Finally, in Sect. 6, the proposed algorithm, Linear Correlated Features Aggregation (LinCFA), is applied to synthetic and real-world datasets to experimentally confirm the result and lead to the conclusions of Sect. 7. The paper is accompanied by supplementary material. Specifically, Appendix A contains the proofs and technical results of the bivariate case that are not reported in the main paper, Appendix B shows an additional finite-samples bivariate analysis, Appendix C elaborates on the bivariate results to be composed only of theoretical or empirical quantities, Appendix D contains the proofs and technical results of the three-dimensional setting, and Appendix E presents in more details the experiments performed.

2 Preliminaries

In this section, we introduce the notation and assumptions employed in the paper (Sect. 2.1) and we survey the main related works (Sect. 2.2).

2.1 Notation and assumptions

Let (X, Y) be random variables with joint probability distribution $P_{X,Y}$, where $X\in \mathbb {R}^D$ is the D-dimensional vector of features and $Y\in \mathbb {R}$ is the scalar target of a supervised learning regression problem. Given N data sampled from the distribution $P_{X,Y}$, we denote the corresponding feature matrix as ${\textbf {X}}\in \mathbb {R}^{N\times D}$ and the target vector as ${\textbf {Y}}\in \mathbb {R}^{N}$. Each element of the random vector X is denoted with $x_i$ and it is called a feature of the ML problem. We denote as y the scalar target random variable and with $\sigma ^2_y$ and $\hat{\sigma }^2_y$ its variance and sample variance. For each pair of random variables a, b we denote with $\sigma ^2_{a}$, cov(a, b) and $\rho _{a,b}$ respectively the variance of the random variable a and its covariance and correlation with the random variable b. Their estimators are $\hat{\sigma }^2_{a}$, $\hat{cov}(a,b)$ and $\hat{\rho }_{a,b}$. Finally, the expected value and the variance operators applied on a function f(a) of a random variable a w.r.t. its distribution are denoted with $\mathbb {E}_a[f(a)]$ and $var_a(f(a))$.

A dimensionality reduction method can be seen as a function $\varvec{\phi }: \mathbb {R}^{N \times D} \rightarrow \mathbb {R}^{N \times d}$, mapping the original feature matrix ${\textbf {X}}$ with dimensionality D into a reduced dataset ${\textbf {U}} = \varvec{\phi }({\textbf {X}}) \in \mathbb {R}^{N\times d}$ with $d<D$. The goal of this projection is to reduce the (possibly huge) dimensionality of the original dataset while keeping as much information as possible in the reduced dataset. This is usually done by preserving a distance (e.g., Euclidean, geodesic) or the probability of a point to have the same neighbours after the projection (Zaki and Meira 2014).

In this paper, we assume a linear dependence between the features X and the target Y, i.e., $Y = w^{T} X + \epsilon$, where $\epsilon$ is a zero-mean noise, independent of X, and $w \in \mathbb {R}^{D}$ is the weight vector. Without loss of generality, the expected value of each feature is assumed to be zero, i.e., $\mathbb {E}[x_i]=\mathbb {E}[Y]=0\ \forall i\in \{1,\dots ,D\}$. Finally, we consider linear regression as ML method: the i-th estimated coefficient is denoted with $\hat{w}_i$, the estimated noise with $\hat{\epsilon }$ and the predicted (scalar) target with $\hat{y}$.

2.2 Existing methods

This section briefly surveys dimensionality reduction algorithms available in the literature, presenting unsupervised and supervised approaches. More extensive reviews can be found in (Sorzano et al. 2014; Cunningham and Ghahramani 2015; Espadoto et al. 2021; Chao et al. 2019). The algorithm presented in this paper can be considered as a linear supervised dimensionality reduction approach, therefore the focus will be on this topic. However, feature selection also provides a set of reduced features, as discussed in Chapter 1 (the interested reader may refer to literature reviews such as (Li et al. 2017)). Therefore, RReliefF algorithm (Robnik-Sikonja et al. 1997; Kononenko et al. 1997) will also be considered in the empirical evaluation as a supervised feature selection approach.

2.2.1 Unsupervised dimensionality reduction

Classical dimensionality reduction methods can be considered as unsupervised learning techniques which, in general, do not take into account the target, but they focus on projecting the dataset ${\textbf {X}}$, minimizing a given loss.

The most popular unsupervised linear dimensionality reduction technique is Principal Components Analysis (PCA) (Pearson 1901; Hotelling 1933), a linear method that embeds the data into a linear subspace of dimension d describing as much as possible the variance in the original dataset. One of the main difficulties of applying PCA in real problems is that it performs linear combinations of possibly all the D features, usually with different coefficients, losing the interpretability of each principal component and suffering the curse of dimensionality. To overcome this issue, there exist some variants like svPCA (Ulfarsson and Solo 2011), which forces most of the weights of the projection to be zero. This contrasts with the approach proposed in this paper, which aims to preserve interpretability while exploiting the information yielded by each feature.

There exist several variants to overcome different issues of PCA (e.g., out-of-sample generalization, linearity, sensitivity to outliers) and other methods that approach the problem from a different perspective (e.g., generative approach with Factor Analysis, independence-based approach with Independent Component Analysis, matrix factorization with SVD), an extensive overview can be found in (Sorzano et al. 2014). A broader overview of linear dimensionality reduction techniques can be found in (Cunningham and Ghahramani 2015). Specifically, SVD (Golub and Reinsch 1970) leads to the same result of PCA from an algebraic perspective through matrix decomposition. Factor analysis (Thurstone 1931) assumes that the features are generated from a smaller set of latent variables, called factors, and tries to identify them by looking at the covariance matrix. Both PCA and Factor Analysis can reduce through rotations the number of features that are combined for each reduced component to improve the interpretability, but their coefficients can still be different and hard to interpret. Finally, Independent Component Analysis (Hyvärinen 1999) is an information theory approach that looks for independent components (not only uncorrelated as PCA) that are not constrained to be orthogonal. This method is more focused on splitting different signals mixed between features than on reducing their dimensionality, which can be done as a subsequent step with feature selection, which would be simplified from the fact that the new features are independent.

Differently from the linear nature of PCA, many non-linear approaches exist (see (Van Der Maaten et al. 2009; Espadoto et al. 2021) for a broader discussion), following the idea that the data can be projected onto non-linear manifolds. Some of them optimize a convex objective function (usually solvable through a generalized eigenproblem) trying to preserve global similarity of data (e.g., Isomap (Tenenbaum et al. 2000), Kernel PCA (Shawe-Taylor and Cristianini 2004), Kernel Entropy Component Analysis (Jenssen 2009), MVU (Weinberger et al. 2004), Diffusion Maps (Lafon and Lee 2006)) or local similarity of data (LLE (Roweis and Saul 2000), Laplacian Eigenmaps (Belkin and Niyogi 2001), LTSA (Zhang and Zha 2004), LPP (He and Niyogi 2003)). Other methods optimize a non-convex objective function with the purpose of rescaling Euclidean distance (Sammon Mapping (Sammon 1969)) introducing more complex structures like neural networks (Multilayer Autoencoders (Hinton and Salakhutdinov 2006)) or aligning mixtures of models (LLC (Teh and Roweis 2002)).

In this paper we assume linearity, therefore in the experimental section we will compare the proposed method with classical PCA and its supervised version, since it is one of the most applied linear unsupervised dimensionality reduction techniques in ML applications. Non-linear techniques for dimensionality reduction (Kernel PCA, Isomap, LLE, LPP) will also be considered to further test the behavior of the LinCFA algorithm on real data, where linearity is not guaranteed, together with RReliefF algorithm as nonlinear supervised feature selection approach.

2.2.2 Supervised dimensionality reduction

Supervised dimensionality reduction is a less-known but powerful approach when the main goal is to perform classification or regression rather than learn a data projection into a lower dimensional space. The methods of this subfield are usually based on classical unsupervised dimensionality reduction, adding the regression or classification loss in the optimization phase. In this way, the reduced dataset ${\textbf {U}}$ is the specific projection that allows maximizing the performance of the considered supervised problem. This is usually done in classification settings, minimizing the distance within the same class and maximizing the distance between different classes in the same fashion as Linear Discriminant Analysis (Fisher 1936). The other possible approach is to directly integrate the loss function for classification or regression. Following the taxonomy presented in (Chao et al. 2019), these supervised approaches can be divided into PCA-based, NMF-based (mostly linear), and manifold-based (mostly non-linear).

A well-known PCA-based algorithm is Supervised PCA. The most straightforward approach of this kind has been proposed in (Bair et al. 2006), which is a heuristic that applies classical PCA only to the subset of features mostly related to the target. A more advanced approach can be found in (Barshan et al. 2011), where the original dataset is orthogonally projected onto a space where the features are uncorrelated, simultaneously maximizing the dependency between the reduced dataset and the target by exploiting Hilbert-Schmidt independence criterion. The goal of Supervised PCA is similar to that of the algorithm proposed in this paper. The main difference is that we are not looking for an orthogonal projection, but we aggregate features by computing their means (thus, two projected features can be correlated) to preserve interpretability. Many variants of Supervised PCA exist, e.g., to make it a non-linear projection or to make it able to handle missing values (Yu et al. 2006). Since it is defined in the same context (linear) and has the same final purpose (minimize the mean squared regression error), supervised-PCA will be compared with the approach proposed by this paper in the experimental section. NMF-based algorithms (Jing et al. 2012; Lu et al. 2016) have better interpretability than PCA-based, but they focus on the non-negativity property of features, which is not a general property of linear problems. Manifold-based methods (Ribeiro et al. 2008; Zhang et al. 2018; Zhang 2009; Raducanu and Dornaika 2012), on the other hand, perform non-linear projections with higher computational costs. Therefore, both families of techniques will not be considered in this linear context.

3 Proposed methodology

In this section, we introduce the proposed dimensionality reduction algorithm, named Linear Correlated Features Aggregation (LinCFA), from a general perspective. The approach is based on the following simple idea. Starting from the features $x_i$ of the D-dimensional vector X, we build the aggregated features $u_k$ of the d-dimensional vector U. The dimensionality reduction function $\varvec{\phi }$ is fully determined by a partition $\varvec{\mathcal {P}}=\{\mathcal {P}_1,\dots ,\mathcal {P}_d\}$ of the set of features $\{x_1,\dots ,x_D\}$. In particular, each feature $x_i$ is assigned to a set $\mathcal {P}_k\in \varvec{\mathcal {P}}$ and each feature $u_k$ is computed as the average of the features in the k-th set of $\varvec{\mathcal {P}}$:

$$\begin{aligned} u_k = \frac{1}{|\mathcal {P}_k|} \sum _{i \in \mathcal {P}_k} x_i. \end{aligned}$$

(1)

In the following sections, we will focus on finding theoretical guarantees to determine how to build the partition $\varvec{\mathcal {P}}$. Intuitively, two features will belong to the same element of the partition $\varvec{\mathcal {P}}$ if their correlation is larger than a threshold. This threshold is formalized as the minimum correlation for which the Mean Squared Error (MSE) of the regression with a single aggregated feature (i.e., the average) is not worse than the MSE with the two separated features.^{Footnote 1} In particular, it is possible to decompose the MSE as follows (bias-variance decomposition (Hastie et al. 2009)):

$$\begin{aligned}&\underbrace{\mathbb {E}_{x,y,\mathcal {T}}[(h_\mathcal {T}(x)-y)^2]}_{\text {MSE}} = \underbrace{\mathbb {E}_{x,\mathcal {T}}[(h_\mathcal {T}(x)-\bar{h}(x))^2]}_{\text {variance}} \nonumber \\&\quad +\underbrace{\mathbb {E}_{x}[(\bar{h}(x)-\bar{y}(x))^2]}_{\text {bias}} +\underbrace{\mathbb {E}_{x,y}[(\bar{y}(x)-y)^2]}_{\text {noise}}, \end{aligned}$$

(2)

where x, y are the features and the target of a test sample, $\mathcal {T}$ is the training set, $h_\mathcal {T}(\cdot )$ is the ML model trained on dataset $\mathcal {T}$, $\bar{h}(\cdot )$ is its expected value w.r.t. the training set $\mathcal {T}$ and $\bar{y}$ is the expected value of the test output target y w.r.t. the input features x. Decreasing model complexity leads to a decrease in variance and an increase in bias. Therefore, in the analysis, we will compare these two variations and identify a threshold as the minimum value of correlation for which, after the aggregation, the decrease of variance is greater or equal than the increase of bias, so that the MSE will be greater or equal than the original one.

4 Two-dimensional analysis

This section introduces the theoretical analysis, performed in the bivariate setting, that identifies the minimum value of the correlation between the two features for which it is convenient to aggregate them with their mean. In particular, Sect. 4.1 introduces the assumptions under which the analysis is performed. Subsection 4.2 computes the amount of variance decreased when performing the aggregation. Then, Sect. 4.3 evaluates the amount of bias increased due to the aggregation. Finally, Sect. 4.4 combines the two results identifying the minimum amount of correlation for which it is profitable to aggregate the two features. In addition, Appendix A contains the proofs and technical results that are not reported in the main paper, Appendix B includes an additional finite-sample analysis, and Appendix C computes confidence intervals that allow stating the results with only theoretical or empirical quantities.

4.1 Setting

In the two-dimensional case ($D=2$), we consider the relationship between the two features $x_1$, $x_2$ and the target y to be linear and affected by Gaussian noise: $y=w_1x_1+w_2x_2+\epsilon$, with $\epsilon \sim \mathcal {N}(0,\sigma ^2)$. As usually done in linear regression (Johnson and Wichern 2007), we assume the training dataset ${\textbf {X}}$ to be known. Moreover, recalling the zero-mean assumption (${{\,\mathrm{\mathbb {E}}\,}}[x_1]={{\,\mathrm{\mathbb {E}}\,}}[x_2]=0$), it follows $\mathbb {E}[y]=w_1\mathbb {E}[x_1]+w_2\mathbb {E}[x_2]=0$ and $\sigma ^2_y = \sigma ^2$.

We compare the performance (in terms of bias and variance) of the two-dimensional linear regression $\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2$ with the one-dimensional linear regression, which takes as input the average between the two features $\hat{y}=\hat{w}\frac{x_1+x_2}{2}=\hat{w}\bar{x}$. As a result of this analysis, we will define conditions under which aggregating features $x_1$ and $x_2$ in the feature $\bar{x}$ is convenient.

4.2 Variance analysis

In this subsection, we compare the variance of the two models with both an asymptotic and a finite-samples analysis. Since the two-dimensional model estimates two coefficients, it is expected to have a larger variance. Instead, aggregating the two features reduces the variance of the model.

4.2.1 Variance of the estimators

A quantity, necessary to compute the variance of the models that will be compared throughout this subsection, is the covariance matrix of the vector $\hat{w}$ of the estimated regression coefficients w.r.t. the training set. Given the training features ${\textbf {X}}$, a known result in a general linear problem with n samples and D features (Johnson and Wichern 2007) (see Appendix A for the computations) is:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = ({\textbf {X}}^T{\textbf {X}})^{-1}\sigma ^2. \end{aligned}$$

(3)

The following lemma shows the variance of the weights for the two specific models that we are comparing.

Lemma 1

Let the real model be linear with respect to the features $x_1$ and $x_2$ ($y=w_1x_1+w_2x_2+\epsilon$). In the one-dimensional case $\hat{y}=\hat{w}\bar{x}$, we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = \frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{\bar{x}}}. \end{aligned}$$

(4)

In the two-dimensional case $\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2$, we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}})&= \frac{\sigma ^2}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)} \nonumber \\&\quad \times \begin{bmatrix} \hat{\sigma }^2_{x_2} &{} -\hat{cov}(x_1,x_2) \\ -\hat{cov}(x_1,x_2) &{} \hat{\sigma }^2_{x_1} \end{bmatrix}. \end{aligned}$$

(5)

Proof

The proof of the two results follows from Equation (3), see Appendix A for the computations. $\square$

4.2.2 Variance of the model

Recalling the general definition of variance of the model from Equation (2), in the specific case of linear regression it becomes:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2] = {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(\hat{w}^T x-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}^Tx])^2]. \end{aligned}$$

(6)

The following result shows the variance of the two specific models (univariate and bivariate) considered in this section.

Theorem 1

Let the real model be linear with respect to the two features $x_1$ and $x_2$ ($y=w_1x_1+w_2x_2+\epsilon$). Then, in the one dimensional case $y=\hat{w}\frac{x_1+x_2}{2}=\hat{w}\bar{x}$, we have:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2|{\textbf {X}}] = \sigma _{x_1+x_2}^2\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1+x_2}}. \end{aligned}$$

(7)

In the two dimensional case $y=\hat{w}_1x_1+\hat{w}_2x_2$, we have:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2|{\textbf {X}}] \nonumber \\&\quad = \frac{\sigma ^2(\sigma ^2_{x_1}\hat{\sigma }^2_{x_2}+\sigma ^2_{x_2}\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}. \end{aligned}$$

(8)

Proof

The proof combines the results of Lemma 1 with the definition of variance for a linear model given in Equation (6). The detailed proof can be found in Appendix A. $\square$

4.2.3 Comparisons

In this subsection, the difference between the variance of the linear regression with two features $x_1$ and $x_2$ and the variance of the linear regression with one feature $\bar{x}=\frac{x_1+x_2}{2}$ is shown. We will prove that, as expected, this difference is positive and it represents the reduction of variance when substituting a two-dimensional random vector with the average of its components.

First, the asymptotic analysis is performed, obtaining a result that can be applied with good approximation when a large number of samples n is available. Then, the analysis is repeated in the finite-samples setting, with an additional assumption on the variance and sample variance of the features $x_1$ and $x_2$, that simplify the computations.^{Footnote 2}

Case I: asymptotic analysis. The estimators that we are considering are consistent, i.e., they converge in probability to the real values of the parameters (e.g., $\text {plim}_{n\rightarrow \infty }\hat{\sigma }^2_{x_1}=\sigma ^2_{x_1}$). Therefore the following result can be proved.

Theorem 2

If the number of samples n tends to infinity, let $\Delta _{var}^{n\rightarrow \infty }$ be the difference between the variance of the two-dimensional and the one-dimensional linear models, it is equal to:

$$\begin{aligned} \Delta _{var}^{n\rightarrow \infty }=\frac{\sigma ^2}{n-1} \ge 0, \end{aligned}$$

(9)

that is a positive quantity and tends to zero when the number of samples tends to infinity.

Proof

The result follows from the difference between Eqs. 8 and 7, exploiting the consistency of the estimators. $\square$

Case II: finite-samples analysis with equal variance and sample variance. For the finite-samples analysis, we add the following assumption to simplify the computations:

$$\begin{aligned} {\left\{ \begin{array}{ll} \sigma _{x_1}=\sigma _{x_2}{=}{:}\sigma _x \\ \hat{\sigma }_{x_1}=\hat{\sigma }_{x_2}{=}{:}\hat{\sigma }_x. \end{array}\right. } \end{aligned}$$

(10)

Theorem 3

If the conditions of Equation (10) hold, let $\Delta _{var}$ be the difference between the variance of the two-dimensional and the one-dimensional linear models, it is always non-negative and it is equal to:

$$\begin{aligned} \Delta _{var}=\frac{\sigma ^2}{n-1}\cdot \frac{\sigma ^2_x(1-\rho _{x_1,x_2})}{\hat{\sigma }^2_x(1-\hat{\rho }_{x_1,x_2})}. \end{aligned}$$

(11)

Proof

The proof starts again from the variances of the two models found in Theorem 1 and it performs algebraic computations exploiting the assumption stated in Equation (10). All the steps can be found in Appendix A. $\square$

Remark 3

When the number of samples n tends to infinity, the result of Equation (11) reduces to the asymptotic case, as in Equation (9).

Remark 4

The quantities found in Theorem 2 and 3 are always non-negative, meaning that the variance of the two-dimensional case is always greater or equal than the corresponding one-dimensional version, as expected.

4.3 Bias analysis

In this subsection, we compare the (squared) bias of the two models under examination with both an asymptotic and a finite-samples analysis, as done in the previous subsection for the variance. Since the two-dimensional model corresponds to a larger hypothesis space it is expected to have a lower bias w.r.t. the one-dimensional.

The procedure to derive the difference between biases is similar to the one followed for the variance. The first step is to compute the expected value w.r.t. the training set $\mathcal {T}$ of the vector $\hat{w}$ of the regression coefficients estimates, given the training features ${\textbf {X}}$. This is used to compute the bias of the models. In particular, in Equation (2), we defined the (squared) bias as follows:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]= {{\,\mathrm{\mathbb {E}}\,}}_x[({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[h(x)]-{{\,\mathrm{\mathbb {E}}\,}}_{y|x}[y])^2]. \end{aligned}$$

(12)

Starting from this definition, the bias of the one-dimensional case $\hat{y}=\hat{w}\bar{x}$ is computed. Moreover, for the two dimensional case $y=\hat{w}_1x_1+\hat{w}_2x_2$ the model is clearly unbiased. Detailed computations can be found in Appendix A.

After the derivation of the bias of the models, the same asymptotic and finite-samples analysis performed on the variance is repeated in this section for the (squared) bias. Since the two-dimensional model is unbiased, we can conclude that the increase of the bias component of the loss, when the two features are substitute by their mean, is equal to the bias of the one-dimensional model.

Case I: asymptotic analysis. When the number of samples n of the training dataset $\mathcal {T}$ approaches infinity, recalling that the estimators considered converge in probability to the expected values of the parameters, the following result holds.

Theorem 4

If the number of samples n tends to infinity, let $\Delta _{bias}^{n\rightarrow \infty }$ be the difference between the bias of the one-dimensional and the two-dimensional models, it is equal to:

$$\begin{aligned} \Delta _{bias}^{n\rightarrow \infty }&=\frac{\sigma ^2_{x_1}\sigma ^2_{x_2}(1-\rho _{x_1,x_2}^2)(w_1-w_2)^2}{\sigma ^2_{x_1+x_2}}\end{aligned}$$

(13)

$$\begin{aligned}&=\frac{(1-\rho _{x_1,x_2})(w_1-w_2)^2}{2}, \end{aligned}$$

(14)

where the second equality holds if $\sigma _{x_1}=\sigma _{x_2}=1$.

Proof

The proof starts from the bias of the two models computed in Appendix A and exploits the fact that in the limit $n \rightarrow \infty$, it is possible to substitute every sample estimator with the real quantity of the parameters because they are consistent estimators. Details can be found in Appendix A. $\square$

Case II: finite-samples analysis with equal variance and sample variance In the finite-samples case, we provide the same analysis performed for variance, i.e., with the assumptions of Equation (10).

Theorem 5

If the conditions of Equation (10) hold, let $\Delta _{bias}$ be the difference between the (squared) bias of the one-dimensional and the two-dimensional linear models, then it has value:

$$\begin{aligned} \Delta _{bias}=\frac{\sigma ^2_x(1-\rho _{x_1,x_2})(w_1-w_2)^2}{2}. \end{aligned}$$

(15)

Proof

The proof starts from the bias of the two models and performs algebraic computations exploiting the assumptions of Equation (10). All the steps can be found in Appendix A. $\square$

Remark 5

When the number of samples n tends to infinity, the result in Equation (15) reduces to the asymptotic case as in Theorem 4.

Remark 6

Some observations are in order:

As expected, the quantities found in Theorem 4, 5 are always non-negative, since the hypothesis space of the univariate model is a subset of the one of the bivariate model.
We observe that $\Delta _{bias}=0$ if $\rho _{x_1,x_2}=1$. Indeed, when the two variables are perfectly (positively) correlated their coefficients in the linear regression are equal, therefore there is no loss of information in their aggregation.
Finally, when the two regression coefficients are equal $w_1=w_2$ there is no increase of bias due to the aggregation, since it is enough to learn a single coefficient $\bar{w}$ to have the same performance of the bivariate model.

4.4 Correlation threshold

This subsection concludes the analysis with two features by comparing the reduction of variance with the increase of bias when aggregating the two features $x_1$ and $x_2$ with their average $\bar{x}=\frac{x_1+x_2}{2}$. In conclusion, the result shows when it is convenient to aggregate the two features with their mean, in terms of mean squared error.

Considering the asymptotic case, the following theorem compares bias and variance of the models.

Theorem 6

When the number of samples n tends to infinity and the relationship between the features and the target is linear with Gaussian noise, the decrease of variance is greater than the increase of (squared) bias when the two features $x_1$ and $x_2$ are aggregated with their average if and only if:

$$\begin{aligned} \rho ^2_{x_1,x_2} \ge 1-\frac{\sigma ^2\sigma ^2_{x_1+x_2}}{(n-1)\sigma ^2_{x_1}\sigma ^2_{x_2}(w_1-w_2)^2}, \end{aligned}$$

(16)

that, for $\sigma _{x_1}=\sigma _{x_2}=1$ becomes:

$$\begin{aligned} \rho _{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$

(17)

Proof

Computing the difference between Eqs. (9) and (13) the result follows. $\square$

In the finite-samples setting, with the additional assumptions of Eq. (10), the following theorem shows the result of the comparison between bias and variance of the two models.

Theorem 7

Let the variance and sample variance of the features $x_1$ and $x_2$ be equal (Eq. (10)) and the relationship between the features and then target be linear with Gaussian noise. The decrease of variance is greater than the increase of (squared) bias when the two features $x_1$ and $x_2$ are aggregated with their average if and only if:

$$\begin{aligned} \hat{\rho }_{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)\hat{\sigma }^2_x(w_1-w_2)^2}, \end{aligned}$$

(18)

that, for $\hat{\sigma }_x=1$ becomes:

$$\begin{aligned} \hat{\rho }_{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$

(19)

Proof

Computing the difference between Equation (11) and (15) the result follows. $\square$

Remark 7

The results of Theorem 6 and 7 comply with the intuition that, in a linear setting with two features, they should be aggregated if their correlation is large enough.

Remark 8

Theorem 6 and 7 with unitary sample variances produce the same threshold both in the finite and the asymptotic settings.

In conclusion, the thresholds found in Theorem 6 and 7 show that it is profitable in terms of MSE to aggregate two variables in a bivariate linear setting with Gaussian noise if:

the variance of the noise $\sigma ^2$ is large, which means that the process is noisy and the variance should be reduced;
the number of samples n is small, indeed in this case there is little knowledge about the actual model, therefore it is better to learn one parameter rather than two;
the difference between the two coefficients $w_1-w_2$ is small, which implies that they are similar, and learning a single coefficient introduces a little loss of information.

5 Generalization: three-dimensional and D-dimensional analysis

In the previous section, we focused on aggregating two features in a bivariate setting. In this section, we extend that approach to three features. Starting from the related results, we will straightforwardly extend them to a general problem with D features, heuristically considering the $D-2$ remaining features as a unique third contribution. Given the complexity of the computations, we focus on asymptotic analysis only. After the analysis, we conclude this section with the main algorithm proposed in this paper: Linear Correlated Features Aggregation (LinCFA).

5.1 Three-dimensional case

In the three-dimensional case ($D=3$), we consider the relationship between the three features and the target to be linear with Gaussian noise: $y=w_1x_1+w_2x_2+w_3x_3+\epsilon$, $\epsilon \sim \mathcal {N}(0,\sigma ^2)$. In accordance with the previous analysis, we assume the training dataset ${\textbf {X}}=[{\textbf {x}}_{{\textbf {1}}}\ {\textbf {x}}_{{\textbf {2}}}\ {\textbf {x}}_{{\textbf {3}}}]$ to be known and recalling the zero-mean assumption (${{\,\mathrm{\mathbb {E}}\,}}[x_1]={{\,\mathrm{\mathbb {E}}\,}}[x_2]={{\,\mathrm{\mathbb {E}}\,}}[x_3]=0$) it follows $\mathbb {E}[y]=w_1\mathbb {E}[x_1]+w_2\mathbb {E}[x_2]+w_3\mathbb {E}[x_3]=0$, $\sigma ^2_y = \sigma ^2$.

In this setting and for the general D dimensional setting of the next subsection, which will be a direct application of this, we compare the performance of the bivariate linear regression $\hat{y}=\hat{w}_ix_i+\hat{w}_jx_j$ of each pair of features $x_i,x_j$ with the univariate linear regression that considers their average $\hat{y}=\hat{w}\frac{x_i+x_j}{2}=\hat{w}\bar{x}$, to decide whether it is convenient to aggregate them or not in terms of MSE. Indeed, extending the dimension from $D=2$ to a general dimension D, and comparing all the possible models where groups of variables are aggregated, is combinatorial in the number of features and it would be impractical. Also, comparing the full D dimensional regression model with the $D-1$ dimensional model where two variables are aggregated is impractical. Indeed, when the number of features is huge, in addition to a polynomial computational cost, both models suffer issues like the curse of dimensionality and risk of overfitting.

To simplify the exposition, for the theoretical analysis, we will consider $x_i=x_1,x_j=x_2$. Moreover, in the following subsection we will directly report the asymptotic correlation threshold that guarantees the asymptotic decrease of variance to be greater than the increase of bias due to the aggregation of two features. The specific analysis of variance and bias, together with the related proofs, can be found in Appendix D.

5.1.1 Correlation threshold

The result of the following theorem extends the result of Theorem 6 for the three-dimensional setting.

Theorem 8

In the asymptotic setting, let the relationship between the features and the target be linear with Gaussian noise. Assuming unitary variances of the features $\sigma _{x_1}=\sigma _{x_2}=\sigma _{x_3}=1$, the decrease of variance is greater than the increase of (squared) bias due to the aggregation of the features $x_1$ and $x_2$ with their average if and only if:

$$\begin{aligned}&1-(a-b)-\sqrt{a(a-2b)} \le \rho _{x_1,x_2}\le 1-(a-b)+\sqrt{a(a-2b)},\nonumber \\&\qquad \text {with } {\left\{ \begin{array}{ll} a=\frac{\sigma ^2}{(n-1)(w_1-w_2)^2}\\ b=\frac{(\rho _{x_1,x_3}-\rho _{x_2,x_3})w_3}{(w_1-w_2)}. \end{array}\right. } \end{aligned}$$

(20)

Proof

The result follows after algebraic computations on the difference $\Delta _{var}^{n\rightarrow \infty } - \Delta _{Bias}^{n\rightarrow \infty } \ge 0,$ where the expression of the asymptotic difference of variances and biases can be respectively found in Remark 18 and Theorem 18 of Appendix D. $\square$

Remark 9

Equation (20) holds also in the case of generic variance $\sigma ^2_{x_3}$ of the feature $x_3$, with the only difference that b becomes:

$$\begin{aligned} b=\frac{\sigma _{x_3}(\rho _{x_1,x_3}-\rho _{x_2,x_3})w_3}{(w_1-w_2)}. \end{aligned}$$

(21)

Remark 10

The result obtained in this section with three features is more difficult to interpret than the bivariate one. However, if the two features $x_1$ and $x_2$ are uncorrelated with the third feature $x_3$ or they have the same correlation with it ($\rho _{x_1,x_3}=\rho _{x_2,x_3}$), then Equation (20) is equal to the one found in the bivariate asymptotic analysis (Equation (17)).

Remark 11

Since the analysis is asymptotic, the theoretical quantities in Equation (20) can be substituted with their consistent estimators when the number of samples n is large.

5.2 D-dimensional case

This last subsection of the analysis shows a generalization from three to D dimensions. In particular, we assume the relationship between the D features $x_1,...,x_D$ and the target to be linear with Gaussian noise $y=w_1x_1+...+w_Dx_D+\epsilon$, with $\epsilon \sim \mathcal {N}(0,\sigma ^2)$. As done throughout the paper, we assume the training dataset ${\textbf {X}}=[{\textbf {x}}_{{\textbf {1}}}\ ...\ {\textbf {x}}_{{\textbf {D}}}]$ to be known and from the zero-mean assumption $\mathbb {E}[y]=0$ and $\sigma ^2_y = \sigma ^2$.

As discussed for the three-dimensional case, we compare the performance (in terms of bias and variance) of the two-dimensional linear regression $\hat{y}=\hat{w}_ix_i+\hat{w}_jx_j$ with the one-dimensional linear regression $\hat{y}=\hat{w}\frac{x_i+x_j}{2}=\hat{w}\bar{x}$ and in the computations we consider $x_i=x_1,x_j=x_2$ without loss of generality.

Considering the linear combination of the remaining features as a unique variable $x=w_3x_3+...+w_Dx_D$, we directly extend the three-dimensional analysis of the previous subsection to this general case, considering the model to be $y=w_1x_1+w_2x_2+wx+\epsilon$, with $w=1$ and $x=w_3x_3+...+w_Dx_D$. This way, the D-dimensional linear problem is straightforwardly reformulated as a three-dimensional one. However, this analysis can be seen as a heuristic result, in the sense that we do not fully characterize the relationship between the two features under analysis and all the remaining ones, but the focus is only the relationship between the two features under analysis $x_1,x_2$ and the linear combination $x=w_3x_3+...+w_Dx_D$ of the remaining ones.

Recalling that in this case the third feature x has general variance $\sigma ^2_{x}$, the following lemma holds.

Lemma 2

Let $y=w_1x_1+...+w_Dx_D+\epsilon =w_1x_1+w_2x_2+wx+\epsilon$ with $\sigma ^2_{x_1}=\sigma ^2_{x_2}=1$ and $\sigma ^2_x=\sigma ^2_{w_3x_3+...+w_Dx_D}$. Then, performing linear regression in the asymptotic setting, the decrease of variance is greater than the increase of bias when aggregating the two features $x_1$ and $x_2$ with their average if and only if the condition on the correlation of Equation (20) holds (with the parameter b expressed like in Equation (21) as $b=\frac{\sigma _{x}(\rho _{x_1,x}-\rho _{x_2,x})w}{(w_1-w_2)}$).

Proof

The lemma follows by applying the three-dimensional analysis with general variance of the third feature $\sigma ^2_x$ (Theorem 8 and Remark 9). $\square$

5.3 D-dimensional algorithm

For the general D-dimensional case, as explained in the previous subsection, the three-dimensional results has be extended considering as third feature the linear combination of the $D-2$ features not currently considered for the aggregation. A drawback of applying the obtained result in practice is that it requires the knowledge of all the coefficients $w_1,...,w_D$, which is unrealistic, or to approximate them through an estimate, performing linear regression on the complete D-dimensional dataset. In this case, the computational cost is $\mathcal {O}(n\cdot D^2 + D^3)$—which becomes $\mathcal {O}(n\cdot D^2 + D^{2.37})$ if using the Coppersmith-Winograd algorithm (Coppersmith and Winograd 1990)—and it is impractical with a huge number of features. Therefore, since the equation in the three dimensional asymptotic analysis becomes equal to the bivariate one if the two features have the same correlation with the third (Remark 10), it is reasonable, if they are highly correlated, to assume this to be valid and to apply the asymptotic bivariate result shown in Equation (17) to decide whether the two features should be aggregated or not. In this way, we iteratively try all combinations of two features, with complexity $\mathcal {O}(n+D^2)$ in the worst case, in order to choose the groups of features that is convenient to aggregate with their mean.

In Algorithm 1 the pseudo-code of the proposed algorithm Linear Correlated Features Aggregation (LinCFA) can be found. The proposed dimensionality reduction algorithm creates a d dimensional partition of the indices of the features $\{1,\dots ,D\}$ by iteratively comparing couples of features and adding them to the same subset if their correlation ($\text {correlation}(x_i,x_j)$) is greater than the threshold ($\text {threshold}(x_i,x_j,y)$), obtained from Eq. (17). Then, it aggregates the features in each set k of the partition ($\varvec{\mathcal {P}}$) with their average, producing each output $\bar{x}_k$.

Remark 12

(About theoretical results and the empirical algorithm) The proposed algorithm aggregates sets of features and not only couples of them, as considered in the theoretical analysis. The motivation behind this choice is to perform a single average of a set of features. A possible variation, which aggregates pairs of features as derived by the theory, is to directly aggregate a pair of features with their mean, once they respect the theoretical aggregation condition (e.g., at the first iteration we aggregate $x_1,x_2$ producing $\bar{x}=\frac{x_1+x_2}{2}$). Then, considering their mean from there on as a single feature, it would be aggregated with another feature if the condition is respected (e.g., at the second iteration we aggregate $\bar{x},x_3$ producing $\hat{x}=\frac{\bar{x}+x_3}{2}$), until no more aggregations are possible. This procedure adheres more with the theoretical results, however it is less interpretable in the sense discussed in Remark 2, since each reduced feature is an iterative mean of means.

Remark 13

(About the ordering of features) The output of the LinCFA algorithm may depend on the ordering of the features. Therefore, in the pseudo-code of Algorithm 1, a random shuffle of the original features is required as input, such that systematic biases due to the ordering are avoided. A greedy approach that removes the dependency of the LinCFA algorithm on the ordering of the features would be to introduce an internal ordering among pairs of features. Considering for example correlation, one may consider the pair of the two most correlated features and test if they exceed the threshold. If so, they could be added to a cluster and substituted with their mean. Iteratively proceeding in this way, until all features have been assigned to a set of the partition, produces an algorithm that becomes independent from the initial ordering of features and aggregates only features that exceed the threshold. However, this increases the memory and computational complexity, since all the correlations between each pair of features should be computed and stored.

As a further step, among the possible partitions that can be identified depending on the ordering, there is at least an optimal partition of features, which maximizes the mean squared error. Intuitively, with infinite samples, the MSE is maximized considering each feature independently. This is confirmed by the asymptotic variance analysis, where a term n shows that, with infinite samples, there is no decrease of variance with the aggregation. However, with finite samples, the identification of the optimal partition would be combinatorial, since all the possible partitions should be tested. The proposed algorithm adds one feature at time in a cluster, therefore it has no guarantees of optimality. This is in line with classical machine learning approaches such as forward feature selection, that iteratively selects a promising feature, although a combination of other two features may be more informative.

6 Numerical validation

In this section, the theoretical results obtained in Sects. 4 and 5 are exploited to perform dimensionality reduction on synthetic datasets of two, three and D dimensions. Furthermore, the proposed dimensionality reduction approach LinCFA is applied to real datasets and compared with state-of-the-art benchmark methods. To evaluate the performance of the regressions, the results will be evaluated in terms of Mean Squared Error (MSE), R-squared ($R^2$) and Relative RMSE (RRMSE). Code and datasets can be found at the following link: https://github.com/PaoloBonettiPolimi/PaperLinCFA.

6.1 Two-dimensional application

Table 1 Experiment on synthetic bivariate data for two combinations of weights and three different values of variance of the noise

Full size table

In the bivariate setting, according to Eq. (17) and (19), it is convenient to aggregate the two features with a small number of samples n, with a small absolute value of the difference between the coefficients of the linear model $w_1,w_2$ or with a large variance of the noise $\sigma ^2$. The synthetic experiments (full description in Appendix E) confirm with data the theoretical result. In particular, they are performed with a fixed number of samples $n=500$, a fixed correlation between the features $\rho _{x_1,x_2}\approx 0.9$, comparing two combinations of weights (at small and large distances) and three different variances of the noise (small, normal, large).

Table 1 shows the results of the experiments (more detailed results can be found in Tables 6,7 of Appendix E). In line with the theory, when the weights in the linear model are consistently distant, only with a huge variance of the noise the threshold is far from 1 and the two features are aggregated, while for a reasonably small amount of variance in the noise they are kept separated. On the other hand, when the weights in the linear model are similar, the threshold of Eq. (17) is small and the conclusion is to aggregate the two features also with a small amount of variance in the noise. The confidence intervals on the $R^2$ and on the MSE confirm that, when the correlation is above the threshold, the performance of the linear model when the two features are aggregated with their average is statistically not worse than the bivariate model where they are kept separate. It is finally important to notice that, knowing the coefficients of the regression, always leads to aggregate the two features or not in all the 500 repetitions of the experiment (row # aggregations (theo)). On the contrary, estimating the coefficients from data leads to the same action in most repetitions but not always (row # aggregations (emp)), since the limited amount of data introduces noise into the estimates.

6.2 Three-dimensional application

Table 2 Synthetic experiment in the three dimensional setting comparing the full model with three variables with the bivariate model where $x_1,x_2$ are aggregated with their mean

Full size table

Equation (20) expresses the interval for which it is convenient to aggregate the two features $x_1$and $x_2$ in the three-dimensional setting. As in the bivariate case, it is related to the number of samples, the difference between weights, and the variance of the noise. In addition, it also depends on the difference of the correlations between each of the two features with the third one $x_3$ and on the weight $w_3$.

The experiment performed in this setting is based on synthetic data, computed with the following realistic setting: the weights $w_1=0.4,\ w_2=0.6$ are closer than $w_3=0.2$. Moreover, the two features are significantly correlated: $\rho _{x_1,x_2}\approx 0.88$ (more details can be found in Appendix E).

In this setting, as shown in Table 2, it is convenient to aggregate the two features $x_1,x_2$ with their average both in terms of MSE and $R^2$, since the aggregation does not worsen the performances. In particular, the aggregation is already convenient with a small standard deviation of the noise ($\sigma =0.5$).

6.3 D-dimensional application

Table 3 Synthetic experiment in the D dimensional setting. The experiment has been repeated twice: considering the theoretical threshold with the exact coefficients (theo) and with coefficients estimated from data (emp)

Full size table

This subsection introduces the D-dimensional synthetic experiment performed 500 times with $n=500$ samples and $D=100$ features, reduced with the proposed algorithm LinCFA (more details can be found in Appendix E).

The test results shown in Table 3 underline that knowing the real values of the coefficients of the linear model would lead to a reduced dataset of $d=4$ features and a significant increase of performance ($R^2$ aggregate (theo), MSE aggregate (theo)), while using the empirical coefficients the dimension is reduced to $d=15$, still with a significant increase of performance both in terms of $\text {MSE}$ and $R^2$ ($R^2$ aggregate (emp), MSE aggregate (emp)). This is a satisfactory result and it is confirmed by the real dataset application described below.

To better understand the performance of the algorithm, in Fig. 1 we consider the number of selected features and the regression scores. From Fig. 2a it is clear that with a small number of samples, both considering theoretical and empirical quantities, the number of reduced features d becomes smaller to prevent overfitting. Moreover, considering the empirical quantities, which are the only ones available in practice, lead to a larger number of reduced features (but still significantly smaller than the original dimension D). Figure 2b, c show the performance of the linear regression considering the reduced features compared with the full dataset. When the number of samples is significantly larger than the number of features, the performance of the reduced datasets is only slightly better but, when the number of samples is of the same order of magnitude as the number of features, the reduced datasets (both considering empirical and theoretical quantities) significantly outperform the regression over the full dataset. Moreover, the regression performed with reduced datasets is much robust, since it has a score that is stable for different numbers of samples.

Real-world experiments. The main practical result introduced in this paper (the algorithm LinCFA) has been also tested on real datasets. In particular, the results of the application of the dimensionality reduction method introduced in this paper are discussed in comparison with the chosen baselines.

Table 4 Experiments on four real datasets. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 5 Experiments on four additional real datasets. The total number of samples n is divided into train (66% of data) and test (33% of data) sets

Full size table

Specifically, the LinCFA algorithm has been applied in comparison with classical (unsupervised) PCA, Supervised PCA, LLE, LPP, Isomap and Kernel PCA. Additionally, RReliefF has also been considered to take into account a feature selection method as baseline. The number of components selected for PCA is set to explain $95\%$ of variance, while for Supervised PCA, LLE, LPP, Isomap, Kernel PCA and RReliefF the best result (evaluating from $d=1$ to $d=50$ principal components) has been considered. All the other hyperparameters of the methods have been set to default values. Linear, polynomial and sigmoidal kernels have been considered for Kernel and Supervised PCA. The mean squared error (MSE), the relative root mean squared error (RRMSE) and the coefficient of determination ($R^2$) of the linear regression, applied on the set of components reduced by the each algorithm under analysis, have been considered as performance measures on the test set. Confidence intervals have been produced bootstrapping the training and validation set with five different seeds. Additionally, the performances of the full dataset applying linear, Ridge (Hoerl and Kennard 1970), and Lasso (Tibshirani 1996) regression have been considered. To further test the LinCFA algorithm in comparison with non-linear regression approaches, together with linear regression, support vector machine for regression (Drucker et al. 1996)^{Footnote 3}, XGBoost (Chen and Guestrin 2016)^{Footnote 4} and a neural network (Lawrence 1993)^{Footnote 5} have been performed and the related results are available in Appendix E.4. Moreover, in Appendix E.4, also the result of each baseline considering the same number of reduced features selected by the LinCFA algorithm is reported.

Eight datasets with different characteristics have been considered.

The first dataset focuses on the prediction of life expectancy from $D=18$ continuous factors and 1649 samples. The dataset is available on Kaggle^{Footnote 6}. In this case, a reduction of the number of features may be unnecessary, as confirmed by the experimental results, where the full dataset have similar performances w.r.t. the majority of the benchmark methods and the LinCFA algorithm. This experiment provides an example that shows that the algorithm does not reduce too much the dimensionality, when it is not necessary.

The second dataset is a financial dataset made of $D=75$ continuous features, 1299 samples, and a scalar output. The model predicts the cash ratio depending on other metrics, from which it is possible to derive many fundamental indicators. The dataset is available on Kaggle^{Footnote 7}. Given the consistent number of features w.r.t. the number of samples, linear regression with the full dataset provides negative results, while the application of linear regression on the reduced dataset has a significantly high score, and the LinCFA algorithm has one of the best performances among the methods considered.

Then, the algorithm is tested on two climatological dataset composed by $D=136$ (with 1038 samples) and $D=1991$ (with 981 samples) continuous climatological features and a scalar target, which represents the state of vegetation of a basin of the Po river. This datasets have been composed by the authors merging different sources for the vegetation index, temperature, and precipitation over different basins (see (Didan 2015; Cornes et al. 2018; Zellner and Castelli 2022)), and they are available in the repository of this work. With the first climate dataset, considering linear regression, the reduction of features performed by the baselines and the LinCFA algorithm lead again to a significant improvement w.r.t. the full dataset, which has a satisfactory performance only considering XGBoost. The second climatic dataset significantly benefits from the reduction of the dimension in all cases. In particular, the LinCFA algorithm leads to the highest score in combination with linear regression.

Additionally, we further tested the LinCFA algorithm on four datasets from the UCI repository. In particular, we considered a simple classical dataset with 13 features (and 506 samples), the Boston Housing dataset (Harrison and Rubinfeld 1978), which confirms that, as discussed for the life expectancy dataset, the proposed algorithm does not lose information when the full regression is already able to manage the entire set of features. Then, a more complex dataset related to superconductivity (Hamidieh 2018), with 81 features, provides an example with many samples (21263), showing the possibility to apply the LinCFA algorithm also in this case.

Additionally, we considered the Cifar-10 dataset (Krizhevsky et al. 2009), transformed into a regression problem by considering each pixel of each of the three color layers as a feature and removing a pixel, considered as target. This provides a significant case with 6000 features and 3071 samples, where the LinCFA algorithm provides the best absolute score w.r.t. the full dataset and the considered linear and non-linear methods.

Finally, a Gene Expression (Fiorini 2016) dataset composed of 801 samples and 19133 features has been considered, where the gene expression of one gene is the target variable and the gene expression of the other genes available are considered as features. Similarly to the climate and the Cifar-10 dataset, with many highly correlated features and the need to reduce them to gain both interpretability and performance, this dataset allows to test the LinCFA algorithm on a dataset with a large number of features and a relatively small number of samples. The results show once again that the LinCFA algorihtms obtains high scores w.r.t. the other dimensionality reduction methods and the regression on the full dataset.

Tables 4 and 5 show the MSE, RRMSE and $R^2$ coefficients obtained with linear regression applied on the full dataset, on the dataset reduced by LinCFA, and on the dataset reduced by the best performing baseline. Additionally, the results related to Ridge and Lasso regression are reported. The extensive results for each dataset with all the baselines and regression methods considered can be found in Appendix E, Tables 9, 10, 11, 12, 13, 14, 15, 16. Additionally, in the appendix, Tables 17 and 18 report the results associated to the repetition of the experiments, imposing the same number of reduced features as the one identified by LinCFA to each dimensionality reduction method. Finally, an empirical example of computational time is reported in Table 19.

From the results, as already mentioned during the description of the datasets, it is possible to notice that, when the number of features is low, the results are similar between the full regression and the regression on the reduced dataset applying the baselines or LinCFA. On the other hand, when the algorithms are applied to the large dimensional data, the algorithm that we propose always obtains similar or better performances than the other methods. Therefore, the LinCFA Algorithm is able to reduce the dimensionality of the input features improving (or not-worsening) the performance of the linear model, preserving the interpretability of the reduced features.

7 Conclusion and future work

This paper presents a dimensionality reduction algorithm in linear settings with the theoretical guarantee to produce a reduced dataset that does not perform worse than the full dataset in terms of MSE, with a decrease of variance that is larger than the increase of bias due to the aggregation of features. The main strength of the proposed approach is that it aggregates features through their mean, which reduces the dimension meanwhile preserving the interpretability of each feature, which is not common in traditional dimensionality reduction approaches like PCA. Moreover, the complexity of the proposed algorithm is lower than performing a linear regression on the full original dataset. The main weaknesses of the proposed method are that all the computations have been done assuming the features to be continuous and the relationship between the target and the features to be linear, which is a strong assumption in real-world applications. However, the empirical results show an increase in performance and a significant reduction of dimensionality when applying the proposed algorithm to real-world datasets. Therefore, as detailed in Remark 1, the algorithm is designed relying on the linear theoretical result, but it can be applied to any regression problem with continuous features, where the proposed threshold becomes a heuristic quantity. In this case, the empirical validation of the method on real-world datasets, which have no guarantee of linearity, shows a promising applicability of the proposed method outside linear contexts.

In future work it may be interesting to relax the linearity assumption in the theoretical analysis, considering the target as a general function of the input features and applying a general machine learning method to the data. Another possible way to enrich the results obtained in this paper is to consider structured data, where prior knowledge of their relationship can be useful to identify the most suitable features for the aggregation (e.g., on climatological data, features that are registered by two adjacent sensors are more likely to be aggregated).

Notes

For this reason, the approach can be considered supervised.
The assumption that we will introduce for the finite-samples setting might be restrictive. However, it allows simplifying the computations. A more general finite-sample analysis has also been performed, only assuming unitary variances. This more general analysis leads to more convolute expressions and for this reason it is reported in Appendix B.
Considering Scikit-Learn (Pedregosa et al. 2011) implementation.
Implementation available at https://xgboost.readthedocs.io/en/stable/index.html.
Considering Scikit-Learn (Pedregosa et al. 2011) implementation.
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
https://www.kaggle.com/datasets/dgawlik/nyse
https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
https://www.kaggle.com/datasets/dgawlik/nyse
https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq

References

Bair E, Hastie T, Paul D et al (2006) Prediction by supervised principal components. J Am Stat Assoc 101(473):119–137. https://doi.org/10.1198/016214505000000628
Article MathSciNet Google Scholar
Barshan E, Ghodsi A, Azimifar Z et al (2011) Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recognit 44:1357–1371. https://doi.org/10.1016/j.patcog.2010.12.015
Article Google Scholar
Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 4. Springer, NY
Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Article Google Scholar
Chao G, Luo Y, Ding W (2019) Recent advances in supervised dimension reduction: a survey. Mach Learn Knowl Extr 1(1):341–358. https://doi.org/10.3390/make1010020
Article Google Scholar
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, https://doi.org/10.1145/2939672.2939785
Coppersmith D, Winograd S (1990) Matrix multiplication via arithmetic progressions. J Symb Comput 9(3):251–280. https://doi.org/10.1016/S0747-7171(08)80013-2
Article MathSciNet Google Scholar
Cornes RC, van der Schrier G, van den Besselaar EJM et al (2018) An ensemble version of the E-OBS temperature and precipitation data sets. J Geophys Res-Atmos 123(17):9391–9409. https://doi.org/10.1029/2017JD028200
Article Google Scholar
Cunningham JP, Ghahramani Z (2015) Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res 16(1):2859–2900
MathSciNet Google Scholar
Didan K (2015) Myd13q1 modis/aqua vegetation indices 16-day l3 global 250m sin grid v006, NASA eosdis lp daac. Retrieved from doi https://doi.org/10.5067/MODIS/MYD13Q1.006
Drucker H, Burges CJ, Kaufman L, et al (1996) Support vector regression machines. Adv Neural Inf Process syst 9
Espadoto M, Martins RM, Kerren A et al (2021) Toward a quantitative survey of dimension reduction techniques. IEEE Trans Vis Comput Graph 27:2153–2173. https://doi.org/10.1109/TVCG.2019.2944182
Article Google Scholar
Fiorini S (2016) Gene expression cancer RNA-Seq. UCI Mach Learn Repos https://doi.org/10.24432/C5R88H
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
Article Google Scholar
Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14(5):403–420. https://doi.org/10.1007/BF02163027
Article MathSciNet Google Scholar
Hamidieh K (2018) Superconductivty data. UCI Mach Learn Repos https://doi.org/10.24432/C53P47
Harrison D Jr, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manag 5(1):81–102
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH et al (2009) The elements of statistical learning: data mining, inference, and prediction, vol 2. Springer, NY
Book Google Scholar
He X, Niyogi P (2003) Locality preserving projections. Adv Neural Inf Process Syst 16
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507. https://doi.org/10.1126/science.1127647
Article MathSciNet Google Scholar
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30. https://doi.org/10.1080/01621459.1963.10500830
Article MathSciNet Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Article Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:498–520. https://doi.org/10.1037/h0071325
Article Google Scholar
Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE T Neural Netw 10(3):626–634. https://doi.org/10.1109/72.761722
Article Google Scholar
Jacod J, Protter P (2004) Probability essentials. Springer Science & Business Media, SN
Book Google Scholar
Jenssen R (2009) Kernel entropy component analysis. IEEE Transact Pattern Anal Mach Intell 32(5):847–860. https://doi.org/10.1109/TPAMI.2009.100
Article Google Scholar
Jing L, Zhang C, Ng MK (2012) SNMFCA: supervised NMF-based image classification and annotation. IEEE T Image Process 21(11):4508–4521. https://doi.org/10.1109/TIP.2012.2206040
Article MathSciNet Google Scholar
Johnson R, Wichern D (2007) Applied multivariate statistical analysis. Pearson Prentice Hall, Hoboken
Google Scholar
Kononenko I, Šimec E, Robnik-Šikonja M (1997) Overcoming the myopia of inductive learning algorithms with relief. Appl Intell 7:39–55
Article Google Scholar
Kovalerchuk B, Ahmad MA, Teredesai A (2021) Survey of explainable machine learning with visual and granular methods beyond quasi-explanations. Interpret Artif Intell A Perspect Granul Comput 217–267
Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images
Lafon S, Lee AB (2006) Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transact Pattern Anal Mach Intell 28(9):1393–1403. https://doi.org/10.1109/TPAMI.2006.184
Article Google Scholar
Lahav O, Mastronarde N, van der Schaar M (2018) What is interpretable? using machine learning to design interpretable decision-support systems. arXiv preprint arXiv:1811.10799
Lawrence J (1993) Introduction to neural networks. California Scientific Software, California
Google Scholar
Li J, Cheng K, Wang S et al (2017) Feature selection: a data perspective. ACM Comput Surv (CSUR) 50(6):1–45
Article Google Scholar
Lu Y, Lai Z, Xu Y et al (2016) Nonnegative discriminant matrix factorization. IEEE Transact Circuits Syst Video Technol 27(7):1392–1405. https://doi.org/10.1109/TCSVT.2016.2539779
Article Google Scholar
Maurer A, Pontil M (2009) Empirical Bernstein bounds and sample-variance penalization. In: The 22nd conference on learning theory
Pearson K (1901) Liii. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci 2(11):559–572. https://doi.org/10.1080/14786440109462720
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Raducanu B, Dornaika F (2012) A supervised non-linear dimensionality reduction approach for manifold learning. Pattern Recognit 45(6):2432–2444. https://doi.org/10.1016/j.patcog.2011.12.006
Article Google Scholar
Ribeiro B, Vieira A, Carvalho das Neves J (2008) Supervised isomap with dissimilarity measures in embedding learning. In: Lect Notes Comput Sc, https://doi.org/10.1007/978-3-540-85920-8_48
Robnik-Sikonja M, Kononenko I (1997) An adaptation of relief for attribute estimation in regression. In: International conference on machine learning https://api.semanticscholar.org/CorpusID:2579394
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/science.290.5500.2323
Article Google Scholar
Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE T Comput 100(5):401–409. https://doi.org/10.1109/T-C.1969.222678
Article Google Scholar
Shawe-Taylor J, Cristianini N et al (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511809682
Book Google Scholar
Sorzano COS, Vargas J, Montano AP (2014) A survey of dimensionality reduction techniques. arXiv preprint arXiv:1403.2877
Teh Y, Roweis S (2002) Automatic alignment of local representations. In: Advances in neural information processing systems, pp 841–848
Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323. https://doi.org/10.1126/science.290.5500.2319
Article Google Scholar
Thurstone LL (1931) Multiple factor analysis. Psychol Rev 38(5):406. https://doi.org/10.1037/h0069792
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B (Methodological) 58(1):267–288
MathSciNet Google Scholar
Ulfarsson MO, Solo V (2011) Vector $l_0$ sparse variable PCA. IEEE T Signal Proces 59(5):1949–1958. https://doi.org/10.1109/TSP.2011.2112653
Article Google Scholar
Van Der Maaten L, Postma E, Van den Herik J et al (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71
Google Scholar
Weinberger KQ, Sha F, Saul LK (2004) Learning a kernel matrix for nonlinear dimensionality reduction. In: Proceedings of the twenty-first international conference on Machine learning, p 106, https://doi.org/10.1145/1015330.1015345
Yu S, Yu K, Tresp V, et al (2006) Supervised probabilistic principal component analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 464–473, https://doi.org/10.1145/1150402.1150454
Zaki MJ, Meira WJ (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge
Book Google Scholar
Zellner P, Castelli M (2022) Vegetation health index - 231 m 8 days (version 1.0) [data set]. Eurac Res https://doi.org/10.48784/161b3496-534a-11ec-b78a-02000a08f41d
Zhang SQ (2009) Enhanced supervised locally linear embedding. Pattern Recogn Lett 30:1208–1218. https://doi.org/10.1016/j.patrec.2009.05.011
Article Google Scholar
Zhang Y, Zhang Z, Qin J et al (2018) Semi-supervised local multi-manifold ISOMAP by linear embedding for feature extraction. Pattern Recogn. https://doi.org/10.1016/j.patcog.2017.09.043
Article Google Scholar
Zhang Z, Zha H (2004) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J Sci Comput 26(1):313–338. https://doi.org/10.1137/S1064827502419154
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work has been supported by the CLINT research project funded by the H2020 Programme of the European Union under Grant Agreement No 101003876. This paper is supported by PNRR-PE-AI FAIR project funded by the NextGeneration EU program.

Funding

Open access funding provided by Politecnico di Milano within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

DEIB, Politecnico di Milano, Milan, 20133, Italy
Paolo Bonetti, Alberto Maria Metelli & Marcello Restelli

Authors

Paolo Bonetti
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Maria Metelli
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Restelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Bonetti.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Responsible editor: Mark Last.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Two-dimensional analysis: additional proofs and results

This section shows proofs and additional technical results that are not reported in Sect. 4 to keep the exposition clear.

1.1 A.1 Variance

This subsection contains some additional proofs related to the bivariate analysis of variance presented in the main paper.

Proof of Equation (3)

Given the training set of features ${\textbf {X}}$ and target ${\textbf {y}}$, in a linear regression model, the estimated weights are computed as $\hat{w} = ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{\textbf {y}}$. Therefore:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}})&= var_{\mathcal {T}}(({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T {\textbf {y}}|{\textbf {X}})\\&= ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{\textbf {X}}({\textbf {X}}^T{\textbf {X}})^{-1}var_{\mathcal {T}}({\textbf {y}}|{\textbf {X}}). \end{aligned}$$

Since $({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{\textbf {X}}={\textbf {I}}$ and $var_{\mathcal {T}}({\textbf {y}}|{\textbf {X}}) = \sigma ^2$ by hypothesis, the result follows. $\square$

Proof of Lemma 1

To prove this results it is enough to start from Eq. (3) and substitute the values of ${\textbf {X}}$.

For the one-dimensional setting:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}})&= ({\textbf {X}}^T{\textbf {X}})^{-1}\sigma ^2\\&=\Big (\begin{bmatrix} \bar{x}^1&...&\bar{x}^n \end{bmatrix} \begin{bmatrix} \bar{x}^1 \\ ... \\ \bar{x}^n \end{bmatrix}\Big )^{-1}\sigma ^2= \frac{\sigma ^2}{\sum _{i=1}^n (\bar{x}^i)^2}. \end{aligned}$$

Recalling that the expected value of the random variables $x_1$ and $x_2$ is zero by hypothesis, then $\sum _{i=1}^n (\bar{x}^i)^2=(n-1)\hat{\sigma }^2_{\bar{x}}.$

For the two dimensional setting:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}})&= ({\textbf {X}}^T{\textbf {X}})^{-1}\sigma ^2\\&= \Bigg (\begin{bmatrix} x^1_1 &{} ... &{} x^n_1 \\ x_2^1 &{} ... &{} x_2^n \end{bmatrix} \begin{bmatrix} x^1_1 &{} x^1_2 \\ ... &{} ... \\ x_1^n &{} x_2^n \end{bmatrix}\Bigg )^{-1}\sigma ^2\\&= \Bigg (\begin{bmatrix} (x^1_1)^2+...+(x^n_1)^2 &{} x^1_1x^1_2+...+x^n_1x^n_2 \\ x^1_1x^1_2+...+x^n_1x^n_2 &{} (x^1_2)^2+...+(x^n_2)^2 \end{bmatrix}\Bigg )^{-1}\sigma ^2\\&=\frac{\sigma ^2}{(\sum _{i=1}^n(x_1^i)^2\sum _{i=1}^n(x_2^i)^2)-(\sum _{i=1}^n(x_1^ix_2^i))^2}\\&\quad \times \begin{bmatrix} (x^1_2)^2+...+(x^n_2)^2 &{} -(x^1_1x^1_2+...+x^n_1x^n_2) \\ -(x^1_1x^1_2+...+x^n_1x^n_2) &{} (x^1_1)^2+...+(x^n_1)^2 \end{bmatrix}. \end{aligned}$$

The result follows recalling again that the expected value of the random variables $x_1,x_2$ is zero, therefore $\sum _{i=1}^n(x_1^i)^2=(n-1)\hat{\sigma }^2_{x_1}$, $\sum _{i=1}^n(x_2^i)^2=(n-1)\hat{\sigma }^2_{x_2}$ and $\sum _{i=1}^n(x_1^ix_2^i)=(n-1)\hat{cov}(x_1,x_2)$. $\square$

Proof of Theorem 1

Let us consider a training dataset $\mathcal {T}$ and a univariate test sample (x, y). Then the variance is:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x,\mathcal {T}}[(h_\mathcal {T}(x)-\bar{h}(x))^2]={{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(h_\mathcal {T}(x)-\bar{h}(x))^2]. \end{aligned}$$

Therefore, for the one-dimensional regression:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(h_\mathcal {T}(x)-\bar{h}(x))^2]= {{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(\hat{w}x-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w} x])^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(x(\hat{w}-{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}])^2]={{\,\mathrm{\mathbb {E}}\,}}_{x}[x^2]{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[(\hat{w}-{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}])^2]\\&\quad = var_x(x)var_\mathcal {T}(\hat{w})=\sigma ^2_x var_\mathcal {T}(\hat{w}). \end{aligned}$$

Conditioning on the features training set ${\textbf {X}}$:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(h_\mathcal {T}(x)-\bar{h}(x))^2 |{\textbf {X}}] = \sigma ^2_x var_\mathcal {T}(\hat{w}|{\textbf {X}}) = \frac{\sigma _x^2\sigma ^2}{(n-1)\hat{\sigma }^2_x}. \end{aligned}$$

Regarding the two dimensional regression:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(h_\mathcal {T}(x)-\bar{h}(x))^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(\hat{w}_1x_1+\hat{w}_2x_2-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_1x_1+\hat{w}_2x_2])^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(x_1(\hat{w}_1-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_1])+x_2(\hat{w}_2-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_2]))^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(x_1(\hat{w}_1-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_1]))^2]+{{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(x_2(\hat{w}_2-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_2]))^2]\\&\qquad +2{{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[x_1x_2(\hat{w}_1-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_1])(\hat{w}_2-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}_2])]\\&\quad =var_x(x_1)var_\mathcal {T}(\hat{w}_1)+var_x(x_2)var_\mathcal {T}(\hat{w}_2)\\&\qquad +2cov_x(x_1,x_2)cov_\mathcal {T}(\hat{w}_1,\hat{w}_2). \end{aligned}$$

Conditioning on the features training set:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[(h_\mathcal {T}(x)-\bar{h}(x))^2 |{\textbf {X}}]\\&\quad =var_x(x_1)var_\mathcal {T}(\hat{w}_1|{\textbf {X}}) + var_x(x_2)var_\mathcal {T}(\hat{w}_2|{\textbf {X}})\\&\qquad +2cov_x(x_1,x_2)cov_\mathcal {T}(\hat{w}_1,\hat{w}_2|{\textbf {X}})\\&\quad =\frac{\sigma ^2(\sigma ^2_{x_1}\hat{\sigma }^2_{x_2}+\sigma ^2_{x_2}\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}. \end{aligned}$$

$\square$

Proof of Theorem 3

From Theorem 1, the difference of variances between the two-dimensional and the one-dimensional cases is:

$$\begin{aligned}&\frac{\sigma ^2(\sigma ^2_{x_1}\hat{\sigma }^2_{x_2}+\sigma ^2_{x_2}\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}\\&\quad -\sigma _{x_1+x_2}^2\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1+x_2}}, \end{aligned}$$

that with the assumptions of Eq. (10) can be written as:

$$\begin{aligned}&\frac{\sigma ^2(2\sigma ^2_{x}\hat{\sigma }^2_{x}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^4_{x} - \hat{cov}(x_1,x_2)^2)}\\&\quad -\sigma _{x_1+x_2}^2\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1+x_2}}. \end{aligned}$$

Recalling that $\sigma ^2_{x_1+x_2}=\sigma ^2_{x_1}+\sigma ^2_{x_2}+2cov(x_1,x_2)$, and that the same applies for the sample variance, the expression above is equal to:

$$\begin{aligned}&\frac{\sigma ^2(2\sigma ^2_{x}\hat{\sigma }^2_{x}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^4_{x} - \hat{cov}(x_1,x_2)^2)}\\&\quad -(2\sigma ^2_x+2cov(x_1,x_2))\frac{\sigma ^2}{(n-1)(2\hat{\sigma }^2_{x}+2\hat{cov}(x_1,x_2))}. \end{aligned}$$

Applying the common denominator the result follows:

$$\begin{aligned}&\frac{\sigma ^2}{(n-1)(\hat{\sigma }^4_{x} - \hat{cov}(x_1,x_2)^2)}\\&\qquad \times [(2\sigma ^2_{x}\hat{\sigma }^2_{x}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))\\&\qquad -(\sigma ^2_x+cov(x_1,x_2))(\hat{\sigma }^2_x-\hat{cov}(x_1,x_2))]\\&\quad = \frac{\sigma ^2(\sigma ^2_x-cov(x_1,x_2))(\hat{\sigma }^2_x+\hat{cov}(x_1,x_2))}{(n-1)\hat{\sigma }^4_{x}(1- \hat{\rho }_{x_1,x_2}^2)}\\&\quad =\frac{\sigma ^2}{(n-1)}\cdot \frac{\sigma ^2_x(1-\rho _{x_1,x_2})}{\hat{\sigma }^2_x(1-\hat{\rho }_{x_1,x_2})}. \end{aligned}$$

$\square$

1.2 A.2 Bias

This subsection contains some technical results and proofs used to compute the difference of biases between the two considered model in the bivariate linear setting of the main paper.

1.2.1 A.2.1 Expected value of the estimators

The expected value with respect to the training set $\mathcal {T}$ of the vector $\hat{w}$ of the regression coefficients estimates is necessary for the computations of the bias of the models. Given the training features ${\textbf {X}}$, its known expression in a general problem $y=f({\textbf {X}})+\epsilon$ is given by (Johnson and Wichern 2007):

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}] = ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T f({\textbf {X}}). \end{aligned}$$

(A1)

Proof

Given the training set of features ${\textbf {X}}$ and target ${\textbf {y}}$, in a linear regression model, the estimated weights are computed as $\hat{w} = ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{\textbf {y}}$. Therefore:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]&= {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T {\textbf {y}}|{\textbf {X}}]\\&={{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T(f({\textbf {X}})+\epsilon )|{\textbf {X}}]\\&=({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[f({\textbf {X}})|{\textbf {X}}]+({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\epsilon |{\textbf {X}}]\\&=({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T f({\textbf {X}}), \end{aligned}$$

where the last equality holds since the expected value of the noise term $\epsilon$ is null by hypothesis. $\square$

The following lemma shows the expected value of the weights for the two models that we are considering.

Lemma 3

Let the real model be linear with respect to the features $x_1$ and $x_2$ ($y=w_1x_1+w_2x_2+\epsilon$). Then, in the one-dimensional case $\hat{y}=\hat{w}\bar{x}$, we have:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}] =\frac{2(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))}{\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2)}. \end{aligned}$$

(A2)

In the two-dimensional case $\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2$ the estimators are unbiased:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}] = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}. \end{aligned}$$

(A3)

Proof

To prove this result it is enough to apply Equation (A1) in the two settings.

In the one dimensional case, with $x=\frac{x_1+x_2}{2}$, it becomes:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]&= \frac{\sum _{i=1}^n x^if(x)}{(n-1)\hat{\sigma }^2_x}\\&=\frac{2(\sum _{i=1}^n x_1^if(x_1^i,x_2^i)+\sum _{i=1}^n x_2^if(x_1^i,x_2^i))}{(n-1)\hat{\sigma }^2_{x_1+x_2}}. \end{aligned}$$

Assuming the real model to be linear ($y=w_1x_1+w_2x_2+\epsilon$):

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]&=\frac{2(\sum _{i=1}^n x_1^if(x_1^i,x_2^i)+\sum _{i=1}^n x_2^if(x_1^i,x_2^i))}{(n-1)\hat{\sigma }^2_{x_1+x_2}}\\&= \frac{2(\sum _{i=1}^n x_1^i(w_1x_1^i+w_2x_2^i)+\sum _{i=1}^n x_2^i(w_1x_1^i+w_2x_2^i))}{(n-1)\hat{\sigma }^2_{x_1+x_2}}\\&=\frac{2(w_1\sum _{i=1}^n(x_1^i)^2+(w_1+w_2)\sum _{i=1}^n x_1^ix_2^i+w_2\sum _{i=1}^n (x_2^i)^2)}{(n-1)\hat{\sigma }^2_{x_1+x_2}}. \end{aligned}$$

Remembering that the expected values of $x_1,x_2$ are equal to zero it is possible to substitute the summations with sample variances and covariances, obtaining the result.

In the two dimensional setting, from the general result and substituting $f({\textbf {X}})={\textbf {X}}w$:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}] = ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T f({\textbf {X}}) = ({\textbf {X}}^T{\textbf {X}})^{-1}{} {\textbf {X}}^T {\textbf {X}}w = w. \end{aligned}$$

$\square$

1.2.2 A.2.2 Bias of the model

In Eq. (2), we defined the (squared) bias as follows:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]= {{\,\mathrm{\mathbb {E}}\,}}_x[({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[h(x)]-{{\,\mathrm{\mathbb {E}}\,}}_{y|x}[y])^2]. \end{aligned}$$

(A4)

The following result shows the bias of the two specific models considered in this section.

Theorem 9

Let the real model be linear with respect to the two features $x_1,x_2$ ($y=w_1x_1+w_2x_2+\epsilon$). Then, in the one dimensional case $y=\hat{w}\bar{x}$, we have:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] \nonumber \\&\quad =\frac{\sigma ^2_{x_1+x_2}}{(\hat{\sigma }^2_{x_1+x_2})^2}(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))^2\nonumber \\&\qquad +w_1^2\sigma ^2_{x_1}+w_2^2\sigma ^2_{x_2}+2w_1w_2cov(x_1,x_2)\nonumber \\&\qquad -\frac{2}{\hat{\sigma }^2_{x_1+x_2}}(w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2))\nonumber \\&\qquad \times (w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2)). \end{aligned}$$

(A5)

On the other hand, in the two dimensional case $y=\hat{w}_1x_1+\hat{w}_2x_2$ the model is unbiased:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] = 0. \end{aligned}$$

(A6)

Proof

The proof combines the results of Lemma 3 with the definition of bias given in Eq. (A4).

Let us consider a training dataset $\mathcal {T}$ and a test sample x, y. Given the definition of (squared) bias:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]= {{\,\mathrm{\mathbb {E}}\,}}_x[({{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[h(x)]-{{\,\mathrm{\mathbb {E}}\,}}_{y|x}[y])^2], \end{aligned}$$

in the one dimensional case, considering $x=\frac{x_1+x_2}{2}$:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x}[({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w} x]-(w_1x_1+w_2x_2))^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}[(x{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}]-(w_1x_1+w_2x_2))^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}[x^2{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}]^2]+{{\,\mathrm{\mathbb {E}}\,}}_{x}[(w_1x_1+w_2x_2)^2]\\&\qquad -2{{\,\mathrm{\mathbb {E}}\,}}_{x}[{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}]x(w_1x_1+w_2x_2)]. \end{aligned}$$

Conditioning on the features training set and exploiting the independence between train and test set:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2|{\textbf {X}}] \\&\quad =\sigma ^2_x{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]^2+{{\,\mathrm{\mathbb {E}}\,}}_{x}[(w_1x_1+w_2x_2)^2]\\&\qquad -2{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]{{\,\mathrm{\mathbb {E}}\,}}_x[x(w_1x_1+w_2x_2)]. \end{aligned}$$

That, substituting $x=\frac{x_1+x_2}{2}$, is equal to:

$$\begin{aligned}&\frac{1}{4}(\sigma ^2_{x_1}+\sigma ^2_{x_2}+2cov(x_1,x_2)){{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]^2\\&\quad +(w_1^2\sigma ^2_{x_1}+w_2^2\sigma ^2_{x_2}+2w_1w_2cov(x_1,x_2))\\&\quad -{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}](w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2)). \end{aligned}$$

Substituting in the last equation the expression found in Lemma 3 for ${{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}|{\textbf {X}}]$ in the one-dimensional setting, the result follows.

In the two dimensional regression, the bias is:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] = {{\,\mathrm{\mathbb {E}}\,}}_{x}[({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w_1}x_1+\hat{w_2}x_2]-(w_1x_1+w_2x_2))^2]. \end{aligned}$$

Conditioning on the features training set, exploiting the independence between train and test set and recalling ${{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}|{\textbf {X}}] = \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}$:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x}[(x_1{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w_1}]+x_2{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w_2}]-(w_1x_1+w_2x_2))^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}[(x_1w_1+x_2w_2-w_1x_1-w_2x_2)^2]=0. \end{aligned}$$

$\square$

1.2.3 A.2.3 Comparisons

The asymptotic and finite-samples results for the comparisons of biases can be found in the main paper, in this subsection the related proofs are shown.

Proof of Theorem 4

Considering the bias of the two models computed in Theorem 9 and exploiting consistency of the estimators, the difference between the one and the two dimensional model is equal to:

$$\begin{aligned}&\frac{\sigma ^2_{x_1+x_2}}{(\sigma ^2_{x_1+x_2})^2} (w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2))^2\\&\qquad +w_1^2\sigma ^2_{x_1}+w_2^2\sigma ^2_{x_2}+2w_1w_2cov(x_1,x_2)\\&\qquad -\frac{2}{\sigma ^2_{x_1+x_2}}(w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2))\\&\qquad \times (w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2)) \\&\quad =\frac{(\sigma ^2_{x_1}\sigma ^2_{x_2}-cov(x_1,x_2)^2)(w_1-w_2)^2}{\sigma ^2_{x_1+x_2}}. \end{aligned}$$

The result follows by definition of covariance. $\square$

Proof of Theorem 5

From Theorem 9, the difference between the one-dimensional and the two-dimensional bias is equal to the one-dimensional bias, since the two-dimensional model is unbiased. Therefore it is equal to:

$$\begin{aligned}&\frac{\sigma ^2_{x_1+x_2}(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))^2}{(\hat{\sigma }^2_{x_1+x_2})^2}\\&\quad +w_1^2\sigma ^2_{x_1}+w_2^2\sigma ^2_{x_2}+2w_1w_2cov(x_1,x_2)\\&\quad -\frac{2(w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2))}{\hat{\sigma }^2_{x_1+x_2}}\\&\quad \times (w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2)). \end{aligned}$$

Substituting the assumptions from Eq. (10) we get:

$$\begin{aligned}&\frac{2(\sigma ^2_{x}+cov(x_1,x_2))((w_1+w_2)(\hat{cov}(x_1,x_2)+\hat{\sigma }^2_{x}))^2}{(\hat{\sigma }^2_{x_1+x_2})^2}\\&\quad +\frac{(\hat{\sigma }^2_{x_1+x_2})^2(w_1^2\sigma ^2_{x}+w_2^2\sigma ^2_{x}+2w_1w_2cov(x_1,x_2))}{(\hat{\sigma }^2_{x_1+x_2})^2}+ \\&\quad -\frac{2(\hat{\sigma }^2_{x_1+x_2})(w_1\sigma ^2_{x}+w_2\sigma ^2_{x}+(w_1+w_2)cov(x_1,x_2))}{(\hat{\sigma }^2_{x_1+x_2})^2}\\&\quad \times (w_1\hat{\sigma }^2_{x}+w_2\hat{\sigma }^2_{x}+(w_1+w_2)\hat{cov}(x_1,x_2)), \end{aligned}$$

from which, after basic algebraic computations, the result follows. $\square$

Appendix B Two-dimensional analysis: additional setting

In the main paper, the finite-sample analysis in the two-dimensional case is performed with the assumption that the two features $x_1,x_2$ have respectively the same variance and sample variance. In this section are reported the results after the relaxation of this hypothesis.

In particular, the only assumption for this finite-sample analysis is unitary variance:

$$\begin{aligned} \sigma _{x_1}=\sigma _{x_2}=1, \end{aligned}$$

(B7)

implying by definition of covariance that $cov(x_1,x_2)=\rho _{x_1,x_2}$.

This is not an impacting restriction since it is always possible to scale a random variable to have unitary variance dividing it by its standard deviation.

The following theorem shows the difference of variance between the two-dimensional and the one-dimensional linear regression models.

Theorem 10

In the finite-case with unitary variances, the difference of variances of the linear model with two features compared to the one with a single feature (which is their mean), is equal to:

$$\begin{aligned}&\frac{\sigma ^2}{(n-1)}\times \Big [\frac{(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2}{\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}(1-\hat{\rho }_{x_1,x_2}^2)\hat{\sigma }^2_{x_1+x_2}}\nonumber \\&\quad +\frac{2(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))}{\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}(1-\hat{\rho }_{x_1,x_2}^2)\hat{\sigma }^2_{x_1+x_2}}\Big ]. \end{aligned}$$

(B8)

Proof

Starting from the difference of variances between the two-dimensional and the one-dimensional model (Theorem 1):

$$\begin{aligned}&\frac{\sigma ^2}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}\\&\quad \times (\sigma ^2_{x_1}\hat{\sigma }^2_{x_2}+\sigma ^2_{x_2}\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))\\&\quad -\sigma _{x_1+x_2}^2\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1+x_2}}, \end{aligned}$$

exploiting the unitary variance assumption becomes:

$$\begin{aligned}&\frac{\sigma ^2}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}\\&\quad \times (\hat{\sigma }^2_{x_2}+\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))\\&\quad - \frac{\sigma ^2(2+2cov(x_1,x_2))}{(n-1)(\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2))}. \end{aligned}$$

Applying the common denominator:

$$\begin{aligned}&\frac{1}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)(\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2))}\\&\quad \times \Bigg [\sigma ^2(\hat{\sigma }^4_{x_1}+\hat{\sigma }^4_{x_2}+2\hat{\sigma }^2_{x_1}\hat{cov}(x_1,x_2)\\&\quad +2\hat{\sigma }^2_{x_2}\hat{cov}(x_1,x_2)-2\hat{\sigma }^2_{x_1}\hat{cov}(x_1,x_2)cov(x_1,x_2))\\&\quad +\sigma ^2(-2\hat{\sigma }^2_{x_2}\hat{cov}(x_1,x_2)cov(x_1,x_2)\\&\quad -2cov(x_1,x_2)\hat{cov}(x_1,x_2)^2)\\&\quad +\sigma ^2(2\hat{cov}(x_1,x_2)^2-2\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}cov(x_1,x_2))\Bigg ]. \end{aligned}$$

Finally, adding and subtracting on the numerator the term $2\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}$ and grouping the terms, it is equal to:

$$\begin{aligned}&\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}(1 - \hat{\rho }_{x_1,x_2}^2)\hat{\sigma }^2_{x_1+x_2}}\\&\quad \times \Bigg [(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2+2(1-\rho _{x_1,x_2})\\&\quad \times (\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))\Bigg ]. \end{aligned}$$

$\square$

Remark 14

When the number of samples n tends to infinity the result becomes the same found in the asymptotic analysis. Moreover, when the sample variances of the two features are equal, the result becomes the same of the finite case analysis with equal sample and real variances.

Lemma 4

The quantity found as difference of variances between the two-dimensional and one-dimensional case in this general setting is always non-negative.

Proof

Recalling the result of Equation (B8), the first factor $\frac{\sigma ^2}{(n-1)}$ and the denominator of the second one $\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}(1-\hat{\rho }_{x_1,x_2}^2)\hat{\sigma }^2_{x_1+x_2}$ are always non-negative, so the difference of features is positive if and only if the second numerator is positive:

$$\begin{aligned}&(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2+2(1-\rho _{x_1,x_2})\\&\quad \times (\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))\ge 0. \end{aligned}$$

Focusing on the term $2(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2)),$ the function $(\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))$ takes minimum value when $\hat{cov}(x_1,x_2)=-\frac{\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}}{2}$, therefore the minimum value of this term is:

$$\begin{aligned} 2(1-\rho _{x_1,x_2})\left( -\frac{1}{4}\right) (\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2. \end{aligned}$$

Substituting it back in the original inequality:

$$\begin{aligned}&(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2-\frac{1}{2}(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2\\&\quad =\frac{1}{2}(1+\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2, \end{aligned}$$

that is a quantity always non-negative and proves the lemma. $\square$

The following theorem shows the difference of (squared) bias between the one-dimensional and the two-dimensional models.

Theorem 11

In the finite-case, assuming unitary variances $\sigma _{x_1}=\sigma _{x_2}=1$, the increase of bias due to the aggregation of the two features with their average is equal to:

$$\begin{aligned}&\frac{1}{(\hat{\sigma }^2_{x_1+x_2})^2}\nonumber \\&\quad \times (2(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))\hat{cov}(x_1,x_2)\nonumber \\&\quad +\hat{\sigma }^4_{x_1}+\hat{\sigma }^4_{x_2}-2\rho _{x_1,x_2} \hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2})(w_1-w_2)^2. \end{aligned}$$

(B9)

Proof

Recalling the expression of the difference of bias between the one-dimensional and the two-dimensional linear regression models (Theorem 9):

$$\begin{aligned}&\frac{\sigma ^2_{x_1+x_2}}{(\hat{\sigma }^2_{x_1+x_2})^2}(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))^2\\&\quad +w_1^2\sigma ^2_{x_1}+w_2^2\sigma ^2_{x_2}+2w_1w_2cov(x_1,x_2)\\&\quad -\frac{2}{\hat{\sigma }^2_{x_1+x_2}}\\&\quad \times (w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2))\\&\quad \times (w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2)), \end{aligned}$$

exploiting the unitary variance assumption can be written as:

$$\begin{aligned}&\frac{2(1+\rho _{x_1,x_2})}{(\hat{\sigma }^2_{x_1+x_2})^2}(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))^2\\&\quad +\frac{(w_1^2+w_2^2+2w_1w_2\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2))^2}{(\hat{\sigma }^2_{x_1+x_2})^2}\\&\quad -\frac{2}{(\hat{\sigma }^2_{x_1+x_2})^2}(w_1+w_2)(1+\rho _{x_1,x_2})\\&\quad \times (w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2))\\&\quad \times (\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2)). \end{aligned}$$

After basic algebraic computations the result follows. $\square$

Remark 15

When the number of samples n tends to infinity the result becomes the same found in the asymptotic analysis. Moreover, when the sample variances of the two features are equal, the result becomes the same of the finite case analysis with equal sample and real variances.

Theorem 12

Necessary and sufficient condition for positivity of the difference between the reduction of variance and the increase of bias when aggregating two features with their average in the unitary-variance finite-sample setting is:

$$\begin{aligned}&\sigma ^2[(\hat{\sigma }^2_{x_1}-\hat{\sigma }^2_{x_2})^2\nonumber \\&\quad +2(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{cov}(x_1,x_2))(\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))]\nonumber \\&\quad \times (\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+2\hat{cov}(x_1,x_2))\nonumber \\&\quad - [2(1-\rho _{x_1,x_2})(\hat{\sigma }^2_{x_1}+\hat{\sigma }^2_{x_2}+\hat{cov}(x_1,x_2))\hat{cov}(x_1,x_2)\nonumber \\&\quad +\hat{\sigma }^4_{x_1}+\hat{\sigma }^4_{x_2}-2\rho \hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}]\nonumber \\&\quad \times (w_1-w_2)^2(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}-\hat{cov}(x_1,x_2)^2) \ge 0. \end{aligned}$$

(B10)

Proof

The result is obtained subtracting the results of the two previous theorems, after algebraic computations. $\square$

Appendix C Two-dimensional analysis: theoretical and practical quantities

This section elaborates the inequalities found in the main paper in Theorem 6, 7 considering only theoretical quantities or, on the other hand, quantities that can all be computed from data. In this way, in the bivariate case, we have both a theoretical conclusion of the analysis and an empirical one that can be used in practice.

For the asymptotic analysis it is straightforward to obtain a theoretical and an empirical expression, indeed at the limit the estimators converge in probability to the theoretical quantities.

Theorem 13

In the asymptotic setting of Theorem 6, considering only theoretical quantities, the following inequalities hold:

$$\begin{aligned} {\left\{ \begin{array}{ll}\rho ^2_{x_1,x_2} \ge 1-\frac{\sigma ^2\sigma ^2_{x_1+x_2}}{(n-1)\sigma ^2_{x_1}\sigma ^2_{x_2}(w_1-w_2)^2}\\ \rho _{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}\ (if\ \sigma _{x_1}=\sigma _{x_2}=1).\end{array}\right. } \end{aligned}$$

(C11)

On the other hand, considering only quantities that can be derived from data:

$$\begin{aligned} {\left\{ \begin{array}{ll}\hat{\rho }^2_{x_1,x_2} \ge 1-\frac{s^2\hat{\sigma }^2_{x_1+x_2}}{(n-1)\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}(\hat{w}_1-\hat{w}_2)^2} \\ \hat{\rho }_{x_1,x_2} \ge 1-\frac{2s^2}{(n-1)(\hat{w}_1-\hat{w}_2)^2}\ (if\ \hat{\sigma }_{x_1}=\hat{\sigma }_{x_2}=1), \end{array}\right. } \end{aligned}$$

(C12)

where $s^2=\frac{\hat{\epsilon }^T\hat{\epsilon }}{n-3}$ is the unbiased estimator of the variance $\sigma ^2$ of the residual $\epsilon$ of the linear regression (an estimate $\hat{\epsilon }$ of the residual can be computed subtracting the predicted value to the real value of the target).

Proof

Equation (C11) is the same result of Theorem 6.

To derive Eq. (C12) it is sufficient to substitute the theoretical quantities with their consistent estimators. $\square$

For the finite-samples analysis it is necessary to introduce confidence intervals to substitute theoretical with empirical quantities and viceversa.

Theorem 14

In the finite-case setting of Theorem 7, considering only empirical quantities, the following inequality holds with probability at least $1-\delta$:

$$\begin{aligned} \hat{\rho }_{x_1,x_2}&\ge 1-\frac{2(n-3)s^2}{(n-1)\chi ^2_{n-3}\left( \frac{\delta }{2}\right) \hat{\sigma }^2_x}\nonumber \\&\quad \times \frac{1}{(|\hat{w}_1-\hat{w}_2|+\sqrt{3F_{3,n-3}\left( \frac{\delta }{2}\right) }(\sqrt{\hat{var}(\hat{w}_1)}+\sqrt{\hat{var}(\hat{w}_2)}))^2}, \end{aligned}$$

(C13)

where $\chi ^2_{n-3}(\cdot )$ represents a Chi-squared distribution with $n-3$ degrees of freedom and $F_{3,n-3}(\cdot )$ a Fisher distribution with $3,n-3$ degrees of freedom.

Proof

The unilateral confidence interval for the variance $\sigma ^2$ of the residual $\epsilon$ of the linear regression model $y=w_1x_1+w_2x_2+\epsilon$, assuming $\epsilon \sim \mathcal {N} (0,\sigma ^2)$ is, with probability $1-\alpha$ (Johnson and Wichern 2007):

$$\begin{aligned} \frac{(n-r-1)s^2}{\chi ^2_{n-r-1}(\alpha )}\le \sigma ^2. \end{aligned}$$

The simultaneous confidence interval for the weights $w_1,w_2$ of the linear regression model $y=w_1x_1+w_2x_2+\epsilon$ is, with probability $1-\gamma$:

$$\begin{aligned} {\left\{ \begin{array}{ll} w_1\in [\hat{w_1}\pm \sqrt{\hat{var}(\hat{w}_1)}\sqrt{(r+1)F_{r+1,n-r-1}(\gamma )}]\\ w_2\in [\hat{w_2}\pm \sqrt{\hat{var}(\hat{w}_2)}\sqrt{(r+1)F_{r+1,n-r-1}(\gamma )}]. \end{array}\right. } \end{aligned}$$

Considering the confidence intervals and the inequality of Theorem 7, with probability $1-\delta$:

$$\begin{aligned} \hat{\rho }_{x_1,x_2}&\ge 1-\frac{2(n-3)s^2}{(n-1)\chi ^2_{n-3}\left( \frac{\delta }{2}\right) \hat{\sigma }^2_x}\\&\quad \times \frac{1}{(|\hat{w}_1-\hat{w}_2|+\sqrt{3F_{3,n-3}\left( \frac{\delta }{2}\right) }(\sqrt{\hat{var}(\hat{w}_1)}+\sqrt{\hat{var}(\hat{w}_2)}))^2}\\&\ge 1-\frac{2\sigma ^2}{(n-1)\hat{\sigma }^2_x(w_1-w_2)^2}. \end{aligned}$$

This means that the inequality holds and concludes the proof. $\square$

Remark 16

At the limit the quantity $\frac{\chi ^2}{n-3}$ tends to 1, so the result of Theorem 14 is coherent with the asymptotic result.

In order to obtain the result with only theoretical quantities in the finite-sample case, it is necessary to introduce two bounds on the difference between covariance and sample covariance.

Proposition 15

The following inequalities hold.

With probability $1-\delta$:
$$\begin{aligned} \hat{cov}(x_1,x_2)-cov(x_1,x_2)\le 3\sqrt{\frac{log\left( \frac{4}{\delta }\right) }{n-1}}. \end{aligned}$$
(C14)
with probability $1-\delta$:
$$\begin{aligned} cov(x_1,x_2)-\hat{cov}(x_1,x_2)\le 4\sqrt{\frac{log\left( \frac{4}{\delta }\right) }{n-1}}; \end{aligned}$$
(C15)

Proof

The proof exploits Hoeffding’s inequality by applying it to the random variable $Z=x_1x_2$. The proof will derive the results of the proposition for two general random variables X, Y and n data $x_i,y_i$ sampled from their distribution. We denote with $\bar{X},\bar{Y}$ the means of the considered samples.

From Hoeffding’s inequality (Hoeffding 1963) applied to the variable $Z=XY$, with probability $1-\delta$, we get:

$$\begin{aligned} |{{\,\mathrm{\mathbb {E}}\,}}[XY]-\frac{1}{n}\sum _{i=1}^{n}x_iy_i|\le \sqrt{\frac{log\left( \frac{2}{\delta }\right) }{2n}}, \end{aligned}$$

which implies:

$$\begin{aligned} |{{\,\mathrm{\mathbb {E}}\,}}[XY]-\frac{1}{n-1}\sum _{i=1}^{n}x_iy_i|\le \sqrt{\frac{log\left( \frac{2}{\delta }\right) }{n-1}}. \end{aligned}$$

Then with probability $1-2\delta$:

$$\begin{aligned}&\hat{cov}(X,Y)-cov(X,Y)\\&\quad =\frac{1}{n-1}\sum _{i=1}^{n}x_iy_i - \frac{n}{n-1}\bar{X}\bar{Y}-{{\,\mathrm{\mathbb {E}}\,}}[XY]+{{\,\mathrm{\mathbb {E}}\,}}[X]{{\,\mathrm{\mathbb {E}}\,}}[Y]\\&\quad \le \frac{1}{n-1}\sum _{i=1}^{n}x_iy_i - \bar{X}\bar{Y}-{{\,\mathrm{\mathbb {E}}\,}}[XY]+{{\,\mathrm{\mathbb {E}}\,}}[X]{{\,\mathrm{\mathbb {E}}\,}}[Y]\pm \bar{X}{{\,\mathrm{\mathbb {E}}\,}}[Y]\\&\quad \le \sqrt{\frac{log\left( \frac{2}{\delta }\right) }{n-1}}+\bar{Y}\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{2n}}+{{\,\mathrm{\mathbb {E}}\,}}[{Y}]\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{2n}}\le 3\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{n-1}}, \end{aligned}$$

where the second inequality applies Hoeffding’s inequality three times. Equation (C14) is therefore proved.

On the other hand, with probability $1-2\delta$:

$$\begin{aligned}&cov(X,Y)-\hat{cov}(X,Y)\\&\quad = -\frac{1}{n-1}\sum _{i=1}^{n}x_iy_i + \frac{n}{n-1}\bar{X}\bar{Y}+{{\,\mathrm{\mathbb {E}}\,}}[XY]-{{\,\mathrm{\mathbb {E}}\,}}[X]{{\,\mathrm{\mathbb {E}}\,}}[Y]\\&\quad =-\frac{1}{n-1}\sum _{i=1}^{n}x_iy_i + \bar{X}\bar{Y}+\frac{1}{n-1}\bar{X}\bar{Y}\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}[XY]-{{\,\mathrm{\mathbb {E}}\,}}[X]{{\,\mathrm{\mathbb {E}}\,}}[Y]\pm \bar{X}{{\,\mathrm{\mathbb {E}}\,}}[Y]\\&\quad \le \sqrt{\frac{log\left( \frac{2}{\delta }\right) }{n-1}}+\bar{X}\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{2n}}+{{\,\mathrm{\mathbb {E}}\,}}[{Y}]\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{2n}}+\frac{1}{n-1}\bar{X}\bar{Y}\\&\quad \le 4\sqrt{\frac{log\left( \frac{2}{\delta }\right) }{n-1}}, \end{aligned}$$

where the first inequality is again due to the application of Hoeffding’s inequality three times. From this result, Eq. C15 follows. $\square$

It is now possible to derive the expression of Eq. (18) with only theoretical quantities.

Theorem 16

In the finite-case setting of Theorem 7, considering only theoretical quantities, the following inequality holds with probability at least $1-\delta$:

$$\begin{aligned} \rho _{x_1,x_2}&\ge 1-\frac{2\sigma ^2}{(n-1)\sigma ^2_x(w_1-w_2)^2}\nonumber \\&\quad +\frac{1}{\sigma _x^2}\Bigg ( \frac{2log\left( \frac{2}{\delta }\right) }{n-1} + 2\sigma _x\sqrt{\frac{2log\left( \frac{2}{\delta }\right) }{n-1}}+4 \sqrt{\frac{log\left( \frac{8}{\delta }\right) }{n-1}} \Bigg ). \end{aligned}$$

(C16)

Proof

To prove the theorem it is enough to apply the upper bound for the sample variance from (Maurer and Pontil 2009) and the lower bound for the sample covariance from the inequalities of Proposition 15.

Regarding the sample variance, with probability $1-\alpha$ it holds (Maurer and Pontil 2009):

$$\begin{aligned} \hat{\sigma }^2_x\le \Bigg (\sigma _x+\sqrt{\frac{2log\left( \frac{1}{\alpha }\right) }{n-1}}\Bigg )^2. \end{aligned}$$

For the sample covariance, Proposition 15 shows that with probability $1-\gamma$:

$$\begin{aligned} \hat{cov}(x_1,x_2)\ge cov(x_1,x_2)-4\sqrt{\frac{log\left( \frac{4}{\gamma }\right) }{n-1}}. \end{aligned}$$

Starting from the inequality of Theorem 7:

$$\begin{aligned} \hat{\rho }_{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)\hat{\sigma }^2_x(w_1-w_2)^2}, \end{aligned}$$

it is equal to:

$$\begin{aligned} \hat{\sigma }^2_x-\hat{cov}(x_1,x_2) \le \frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$

Therefore, with probability $1-\delta$:

$$\begin{aligned}&\hat{\sigma }^2_x-\hat{cov}(x_1,x_2) \\&\quad \le \Bigg (\sigma _x+\sqrt{\frac{2log\left( \frac{2}{\delta }\right) }{n-1}}\Bigg )^2- \Bigg (cov(x_1,x_2)-4\sqrt{\frac{log\left( \frac{8}{\delta }\right) }{n-1}}\Bigg )\\&\quad \le \frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$

After basic algebraic computations, the result follows. $\square$

Remark 17

The empirical result of Theorem 14 depends on the distribution of the residual, assuming it to be Gaussian. On the other hand, the theoretical expression of Theorem 16 does not need any assumption on the distribution.

Appendix D three-dimensional algorithm

This section contains detailed results and proofs related to Sect. 5 of the main paper.

1.1 D.1 Variance

The following theorem shows the asymptotic difference of variance between the two considered linear regression models.

Theorem 17

Let ${\textbf {X}}=[{\textbf {x}}_{{\textbf {1}}}\ {\textbf {x}}_{{\textbf {2}}}\ {\textbf {x}}_{{\textbf {3}}}]$, $\hat{{\textbf {X}}}=[{\textbf {x}}_{{\textbf {1}}}\ {\textbf {x}}_{{\textbf {2}}}]$ and $\bar{{\textbf {X}}}=[\bar{{\textbf {x}}}]$, with $\bar{x}=\frac{x_1+x_2}{2}$. Then, for the one-dimensional linear regression $\hat{y}=\hat{w}\frac{x_1+x_2}{2}$, we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = (\bar{{\textbf {X}}}^T\bar{{\textbf {X}}})^{-1}\sigma ^2, \end{aligned}$$

(D17)

and for the two-dimensional linear regression $\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2$, we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = (\hat{{\textbf {X}}}^T\hat{{\textbf {X}}})^{-1}\sigma ^2. \end{aligned}$$

(D18)

Proof

The results follow from the general expression of variance of the estimators from Equation (3) and substituting respectively $\bar{{\textbf {X}}}$ and $\hat{{\textbf {X}}}$ for the two considered models. $\square$

Remark 18

Since the linear regression models are the same of the bivariate case, starting from the result of Theorem 17, the variance of the estimators in the two cases remains the same of Lemma 1 and the asymptotic difference of variances remains the one of Theorem 2 ($\Delta _{var}^{n\rightarrow \infty }=\frac{\sigma ^2}{(n-1)}$).

1.2 D.2 Bias

This subsection introduces the asymptotic difference of bias of the two considered linear regression models in the three dimensional setting.

As in the bivariate setting, the first step is to calculate the bias for each of the two considered models. In the asymptotic case, assuming unitary variances of the features $\sigma _{x_1}=\sigma _{x_2}=\sigma _{x_3}=1$, for the one-dimensional regression $\hat{y}=\hat{w}\frac{x_1+x_2}{2}$ and for the two-dimensional regression $\hat{y}=\hat{w_1}x_1+\hat{w_2}x_2$, the (squared) bias ${{\,\mathrm{\mathbb {E}}\,}}_{x}[(\bar{h}(x)-\bar{y})^2]$ can be expressed with two functions, that will be denoted respectively with ${{\mathcal F}(\cdot )}$ and ${{\mathcal G}(\cdot )}$. These functions depend on the three features, their coefficients and their correlations.

For the extension to the D-dimensional case, it will be necessary to keep the feature $x_3$ having general variance $\sigma _{x_3}^2$. With little changes in the algebraic computations of the proof, the bias of the two models can be easily extended (see Appendix D for the details).

From the results obtained, it is possible to compute the increase of bias due to the aggregation of the two variables $x_1,x_2$ with their average $\bar{x}=\frac{x_1+x_2}{2}$.

Theorem 18

In the asymptotic setting, let the relationship between the features and the target be linear with Gaussian noise. Assuming unitary variances of the features $\sigma _{x_1}=\sigma _{x_2}=\sigma _{x_3}=1$, the increase of bias due to the aggregation of the features $x_1$ and $x_2$ with their average is given by:

$$\begin{aligned} \Delta _{Bias}^{n\rightarrow \infty }&= \frac{1}{2}(1-\rho _{x_1,x_2})(w_1-w_2)^2\nonumber \\&\quad +(w_1w_3-w_2w_3)(\rho _{x_1,x_3}-\rho _{x_2,x_3})\nonumber \\&\quad +w_3^2\frac{(\rho _{x_1,x_3}-\rho _{x_2,x_3})^2}{2(1-\rho _{x_1,x_2})}. \end{aligned}$$

(D19)

Proof

The result follows from the difference of the biases of the two models, after algebraic computations. $\square$

Remark 19

Letting the feature $x_3$ having general variance $\sigma _{x_3}^2$, with little changes in the algebraic computations of the proof, the difference of biases is given by:

$$\begin{aligned} \Delta _{Bias}^{n\rightarrow \infty }&= \frac{1}{2}(1-\rho _{x_1,x_2})(w_1-w_2)^2\nonumber \\&\quad +\sigma _{x_3}(w_1w_3-w_2w_3)(\rho _{x_1,x_3}-\rho _{x_2,x_3})\nonumber \\&\quad +w_3^2\sigma _{x_3}^2\frac{(\rho _{x_1,x_3}-\rho _{x_2,x_3})^2}{2(1-\rho _{x_1,x_2})}. \end{aligned}$$

(D20)

Bias of the two models, expressions and derivations In the asymptotic setting, letting the relationship between the features and the target be linear with Gaussian noise and assuming unitary variances of the features $\sigma _{x_1}=\sigma _{x_2}=\sigma _{x_3}=1$, for the one-dimensional regression $\hat{y}=\hat{w}\frac{x_1+x_2}{2}$:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] \\&\quad =\mathcal {F}(x_1,x_2,x_3,w_1,w_2,w_3,\rho _{x_1,x_2},\rho _{x_1,x_3},\rho _{x_2,x_3})\\&\quad =-\frac{((w_1+w_2)(1+\rho _{x_1,x_2})+w_3(\rho _{x_1,x_3}+\rho _{x_2,x_3}))^2}{2(1+\rho _{x_1,x_2})}\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2]. \end{aligned}$$

For the two-dimensional regression $\hat{y}=\hat{w_1}x_1+\hat{w_2}x_2$:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] \\&\quad = \mathcal {G}(x_1,x_2,x_3,w_1,w_2,w_3,\rho _{x_1,x_2},\rho _{x_1,x_3},\rho _{x_2,x_3})\\&\quad =(w_1+aw_3)^2+(w_2+bw_3)^2\\&\qquad +2\rho _{x_1,x_2}(w_1+aw_3)(w_2+bw_3)\\&\qquad -2(w_1+aw_3)(w_1+w_2\rho _{x_1,x_2}+w_3\rho _{x_1,x_3})\\&\qquad -2(w_2+bw_3)(w_1\rho _{x_1,x_2}+w_2+w_3\rho _{x_2,x_3})\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2],\\&\qquad \text {with } {\left\{ \begin{array}{ll} a=\frac{\rho _{x_1,x_3}-\rho _{x_1,x_2}\rho _{x_2,x_3}}{1-\rho _{x_1,x_2}^2}\\ b=\frac{\rho _{x_2,x_3}-\rho _{x_1,x_2}\rho _{x_1,x_3}}{1-\rho _{x_1,x_2}^2}. \end{array}\right. } \end{aligned}$$

Proof

In the one dimensional setting, letting $\bar{x}=\frac{x_1+x_2}{2}$:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}[({{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}\bar{x}]-(w_1x_1+w_2x_2+w_3x_3))^2]\\&\quad =\sigma ^2_{\bar{x}}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}]^2+{{\,\mathrm{\mathbb {E}}\,}}_{x}[(w_1x_1+w_2x_2+w_3x_3)^2]\\&\qquad -2{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}]^2{{\,\mathrm{\mathbb {E}}\,}}_{x}[\bar{x}(w_1x_1+w_2x_2+w_3x_3)]. \end{aligned}$$

Where the last equivalence is due to the fact that train and test data are independent. Then, conditioning on the training features set ${\textbf {X}}$ and substituting the value of ${{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}|{\textbf {X}}]$ from Equation (A1), if follows:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2|{\textbf {X}}]\\&\quad =\frac{\sigma ^2_{x_1+x_2}}{(\hat{\sigma }^2_{x_1+x_2})^2}\\&\qquad \times (w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2)\\&\qquad +w_3(\hat{cov}(x_1,x_3)+\hat{cov}(x_2,x_3)))^2\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2]\\&\qquad -\frac{2}{\hat{\sigma }^2_{x_1+x_2}}(w_1\hat{\sigma }^2_{x_1}+w_2\hat{\sigma }^2_{x_2}+(w_1+w_2)\hat{cov}(x_1,x_2)\\&\qquad +w_3(\hat{cov}(x_1,x_3)+\hat{cov}(x_2,x_3)))\\&\qquad \times (w_1\sigma ^2_{x_1}+w_2\sigma ^2_{x_2}+(w_1+w_2)cov(x_1,x_2)\\&\qquad +w_3(cov(x_1,x_3)+cov(x_2,x_3))). \end{aligned}$$

Considering the asymptotic case, substituting the sample estimators with the real statistical measures and the variances with 1 the result follows.

In the two dimensional setting:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]\\&\quad ={{\,\mathrm{\mathbb {E}}\,}}_{x}[({{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_1}x_1+\hat{w_2}x_2]-(w_1x_1+w_2x_2+w_3x_3))^2]\\&\quad =\sigma ^2_{x_1}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_1}]^2+\sigma ^2_{x_2}{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_2}]^2+2cov(x_1,x_2){{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_1}]{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_2}]\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2]\\&\qquad -2({{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_1}]{{\,\mathrm{\mathbb {E}}\,}}_x[x_1(w_1x_1+w_2x_2+w_3x_3)]\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w_2}]{{\,\mathrm{\mathbb {E}}\,}}_x[x_1(w_1x_1+w_2x_2+w_3x_3)]), \end{aligned}$$

exploiting again the independence between train and test data.

Then, conditioning on the training set ${\textbf {X}}$, substituting the value of ${{\,\mathrm{\mathbb {E}}\,}}_\mathcal {T}[\hat{w}|{\textbf {X}}]^2$ from Equation (A1) and calling:

$$\begin{aligned} {\left\{ \begin{array}{ll} a=\frac{\hat{\sigma }^2_{x_2}\hat{cov}(x_1,x_3)-\hat{cov}(x_1,x_2)\hat{cov}(x_2,x_3)}{\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}-\hat{cov}(x_1,x_2)^2}\\ b=\frac{\hat{\sigma }^2_{x_1}\hat{cov}(x_2,x_3)-\hat{cov}(x_1,x_2)\hat{cov}(x_1,x_3)}{\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2}-\hat{cov}(x_1,x_2)^2} \end{array}\right. }, \end{aligned}$$

it follows:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2|{\textbf {X}}] \\&\quad =\sigma ^2_{x_1}(w_1+w_3a)^2+\sigma ^2_{x_2}(w_2+w_3b)^2\\&\qquad +2cov(x_1,x_2)(w_1+w_3a)(w_2+w_3b)\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2]\\&\qquad -2((w_1+w_3a)(w_1\sigma ^2_{x_1}+w_2cov(x_1,x_2)+w_3cov(x_1,x_3))\\&\qquad +(w_2+w_3b)(w_1cov(x_1,x_2)+w_2\sigma ^2_{x_2}+w_3cov(x_2,x_3))). \end{aligned}$$

For the asymptotic case, substituting the sample estimators with the real statistical measures and the variances with 1 the result follows. $\square$

Extension of bias of the models to general variance of third variable Considering a general variance $\sigma ^2_{x_3}$ for the third variable $x_3$, the bias of the models computed in the previous paragraph become (respectively for the one and two dimensional estimates):

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2] \\&\quad = -\frac{((w_1+w_2)(1+\rho _{x_1,x_2})+w_3\sigma _{x_3}(\rho _{x_1,x_3}+\rho _{x_2,x_3}))^2}{2(1+\rho _{x_1,x_2})}\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2],\\&{{\,\mathrm{\mathbb {E}}\,}}_x [(\bar{h}(x)-\bar{y})^2] = (w_1+aw_3)^2+(w_2+bw_3)^2\\&\qquad +2\rho _{x_1,x_2}(w_1+aw_3)(w_2+bw_3)\\&\qquad -2(w_1+aw_3)(w_1+w_2\rho _{x_1,x_2}+w_3\rho _{x_1,x_3}\sigma _{x_3})\\&\qquad -2(w_2+bw_3)(w_1\rho _{x_1,x_2}+w_2+w_3\rho _{x_2,x_3}\sigma _{x_3})\\&\qquad +{{\,\mathrm{\mathbb {E}}\,}}_x[(w_1x_1+w_2x_2+w_3x_3)^2],\\&\qquad \text {with } {\left\{ \begin{array}{ll} a=\sigma _{x_3}\frac{\rho _{x_1,x_3}-\rho _{x_1,x_2}\rho _{x_2,x_3}}{1-\rho _{x_1,x_2}^2}\\ b=\sigma _{x_3}\frac{\rho _{x_2,x_3}-\rho _{x_1,x_2}\rho _{x_1,x_3}}{1-\rho _{x_1,x_2}^2}. \end{array}\right. } \end{aligned}$$

Appendix E Experiments

This section provides more details and results on the experiments performed in the two-dimensional, three-dimensional and D-dimensional settings.

1.1 E.1 Bivariate synthetic data

Table 6 95% confidence intervals for bivariate synthetic experiments with large difference of weights ($w_1=0.2,w_2=0.8$)

Full size table

Table 7 95% confidence intervals for bivariate synthetic experiments with small difference of weights ($w_1=0.47,w_2=0.52$)

Full size table

As introduced in Sect. 6 of the main paper, the experiments performed in the bivariate setting are synthetic. In particular, for each of the six experiments, the data are computed as follows. The samples of the first independent variable $x_1$ are extracted from a uniform distribution in the interval [0, 1]. The second feature $x_2$ is a linear combination between the feature $x_1$ and a random sample extracted from a uniform distribution in the interval [0, 1] (specifically $x_2=0.8x_1+0.2u,\ u\sim \mathcal {U}([0,1])$). Finally, the target variable y is a linear combination between the two features $x_1,x_2$ with weights $w_1,w_2$ and the addition of a gaussian noise with variance $\sigma ^2$.

Tables 6,7 provide more details about the bivariate results introduced in Table 1 in the main paper.

In Table 6 the extended results associated with large difference between weights $w_1=0.2, w_2=0.8$ and three different values of standard deviation of the noise $\sigma \in \{0.5,1,10\}$ are reported, repeating $s=500$ times each experiment, considering $n=500$ data for training and $n=500$ data for testing. The quantity $\bar{\rho }$ represents the minimum value of correlation for which it is convenient to aggregate the two features and it is computed exploiting the asymptotic result of Equation (17). From its theoretical values is clear that, for a reasonable amount of variance of the noise, since the weights of the linear model are significantly different, it is better to keep the features separated. This is confirmed by the confidence intervals of the $R^2$ and the MSE, which are both better in the bivariate case (full) rather than the univariate case (aggr). On the other hand, when the variance of the noise is large, the process becomes much more noisy and it is convenient to aggregate the two features. Considering the empirical $\bar{\rho }$ rather than the theoretical one, which is unknown with real datasets, the only situation where in the majority of the cases it is useful to aggregate is with large variance. It is important also to notice that the confidence intervals are much larger in the noisy setting, meaning that there is much more uncertainty and therefore it is useful to aggregate the two features in order to reduce it.

In Table 7 the same results are reported, considering a linear relationship with small difference between weights $w_1=0.47,w_2=0.52$. In this setting, since the weights are closer, also with small amount of variance it is convenient to aggregate the two features if they are sufficiently correlated. Also with the empirical evaluation of the threshold, the two features would be aggregated for the majority of the repetitions of the experiment, leading to a non-worsening of the MSE and $R^2$ coefficient for the aggregated case, as shown by the confidence intervals.

1.2 E.1 Three-dimensional synthetic data

Table 8 Detailed synthetic experiment in the three dimensional setting

Full size table

This subsection explains more details about the synthetic experiments performed in the three-dimensional setting and introduced in Sect. 6 of the main paper. They show the usefulness of the extension to the three-dimensional linear regression model. In particular the samples of the first independent variable $x_1$ are extracted from a uniform distribution in the interval [0, 1]. The second feature $x_2$ is a linear combination between the feature $x_1$ and a random sample extracted from a uniform distribution in the interval [0, 1] (specifically $x_2=0.65x_1+0.35u,\ u\sim \mathcal {U}([0,1])$). The third feature $x_3$ is a linear combination between the features $x_1,x_2$ and a random sample extracted from a uniform distribution in the interval [0, 1] ($x_3=0.5x_1+0.5x_2+0.5u,\ u\sim \mathcal {U}([0,1])$). Finally, the target variable y is a linear combination between the three features $x_1,x_2,x_3$ with weights $w_1=0.4,\ w_2=0.6$ that are closer than the third weight $w_3=0.2$ and the addition of a gaussian noise with variance $\sigma ^2=0.25$.

The experiment has been repeated $s=500$ times with $n=500$ samples both for the train and the test set. As reported in Table 8, which is an extension of Table 2 reported in the main paper, the theoretical values of correlation thresholds computed from the asymptotic result of Eq. (20) and the empirical ones computed substituting the unbiased estimators of the quantities show that it is convenient to aggregate the two features $x_1,x_2$. This is confirmed both by the MSE and the $R^2$ coefficient, which are statistically not worse in the aggregated case than in the three dimensional one.

1.3 E.3 D-dimensional synthetic data

This subsection provides more details on the application, introduced in Sect. 6 of the main paper, of the algorithm LinCFA on a D-dimensional synthetic dataset. In particular, $D=100$ features are considered. The samples of the first independent variable $x_1$ are extracted from a uniform distribution in the interval [0, 1]. Then, each feature $x_i$, is a linear combination between one of the previous features $x_j,\ j<i$ and a random sample extracted from a uniform distribution in the interval [0, 1] (specifically $x_i=0.7x_j+0.3u,\ u\sim \mathcal {U}([0,1])$). Finally, the target variable y is a linear combination between the D features $x_1,..,x_{100}$, with coefficients randomly sampled from a uniform distribution in the interval [0, 1], and a gaussian noise with standard deviation $\sigma =10$.

The algorithm is applied on the features both evaluating the threshold computed with the exact coefficients and with their unbiased estimates. The experiment has been repeated $s=500$ times on a dataset of $n=500$ samples both in train and test set, considering both the exact parameters (unkown in practice) and their estimators.

1.4 E.4 D-Dimensional real datasets

Data This subsection describes with more details the datasets introduced in Sect. 6 of the main paper to apply the proposed algorithm LinCFA on real data. Additional experiments are also discussed and their results are reported extensively.

The first dataset considered in the main paper focuses on the prediction of Life Expectancy from several factors that can be categorized into immunization related factors, mortality factors, economical factors and social factors. The dataset is available on Kaggle^{Footnote 8} and it is also provided in the repository of this work. It is made of $D=18$ continuous input variables and a scalar output.

The second dataset reported in the main paper is a financial dataset made of $D=75$ continuous features and a scalar output. The model predicts the cash ratio depending on other metrics from which it is possible to derive many fundamental financial indicators. The dataset is taken from Kaggle^{Footnote 9} and it is provided in the repository of this work.

Finally, the algorithm is tested on two climatological dataset composed by $D=136$ and $D=1991$ continuous climatological features and a scalar target which represents the state of vegetation of a basin of Po river. This datasets have been composed by the authors merging different sources for the vegetation index, temperature and precipitation over different basins (see (Didan 2015; Cornes et al. 2018; Zellner and Castelli 2022)), and they are available in the repository of this work. The main purpose of this regession tasks is to predict the state of the vegetation of a region through meteorological features, which may add insights on the relationship between temperature and precipitation and the state of the vegetation in a highly regulated basin with complex hydrological interactions as the Po River basin.

In this section are also reported the results of the application of the proposed algorithm and the identified baselines on three additional classical regression datasets, which have been selected to further verify the validity of the proposed approach. Although these dataset does not show a particularly meaningful linear dependency between the features and the target, they have been selected to empirically test the validity of the method outside linear contexts, where its validity has theoretical guarantees. In particular, we considered a simple classical dataset with 13 features and 506 samples, the Boston Housing dataset (Harrison and Rubinfeld 1978). This dataset is a classical statistical and ML benchmark collected by the U.S Census Service. Its aim is to inspect the relationship between the price of houses in Boston suburbs and some features related to crime, environment, politics, and social aspects.

As a second additional benchmark dataset, we selected the Superconductivity dataset (Hamidieh 2018) with 81 features and 21263 samples, from the UCI repository. Indeed, this repository mantains more than 600 hundreds of datasets to make them available to the ML community, widely used from the eighties in statistics and ML. This specific dataset allows to test the LinCFA Algorithm on a larger set of features with many samples where linear models perform poorly w.r.t. non-linear approaches. The dataset is aimed to identify the relationship between the critical temperature of different superconductors (one for each sample) and the features available, which describe the main characteristics of the superconductors (e.g., atomic mass and radius).

Finally, as a third additional dataset, we considered the Cifar-10 dataset (Krizhevsky et al. 2009), that is a famous classification dataset composed by $32\times 32$ images of 10 different classes with 60000 samples. In the experiments we considered 6000 samples at random, to have a number of samples comparable with the number of features, which enforces a larger risk of overfitting and the need of dimensionality reduction methods. Moreover, since the LinCFA algorithm is designed for regression tasks, we transformed the problem into a regression by considering each pixel of each of the three color layers as a feature and removing a pixel, considered as target. In this way, also a more recent ML benchmark dataset with a significantly larger number of features retrieved from images has been added in the comparisons.

In order to further explore the behavior of the LinCFA on a dataset with a large number of features (19133) and a relatively small number of samples (801) from bioinformatics, a gene expression dataset from the UCI ML repository has been considered (Fiorini 2016)^{Footnote 10}. In particular, one gene expression has been considered as target variable and the other gene expressions available in the dataset have been considered as features, filtering the constant columns. The dataset is part of the RNA-Seq (HiSeq) PANCAN data set, where the author of the dataset has performed a random extraction of gene expressions of patients having different types of tumor.

Results In this section the extensive results related to the eight datasets under analysis are reported. In particular, we firstly apply the LinCFA Algorithm on the training data to perform the aggregation of features, repeating the experiment five times, bootstrapping the training set with different seeds. The same is done considering the dimensionality reduction methods considered as baselines: PCA, Supervised PCA, Kernel PCA, LLE, LPP, Isomap, RReliefF. For all the methods, from $D=1$ to $D=50$ reduced features are considered, and the best performing number of components is selected. The only exception is Cifar-10 dataset, where the proposed algorithm selects $D=73$ reduced features, and the same number is forced to all the methods, since trying all the different values starting from $D=1$ would be much computational expensive. Then, supervised learning regression methods have been applied to the reduced features. In particular, given the linearity guarantees of the proposed method, linear regression has firstly been applied and its results are reported in the main paper. Additionally, SVM, XGBoost and Neural Network methods have been also applied, to further inspect the behavior of the method. These approaches have been applied to each dataset, reduced with the proposed method or with the baselines, with default hyperparameters. Additionally, the performances of the full dataset has been reported, considering the same models and additionally taking into account Ridge and Lasso regression, which are regularized variants of the standard linear regression and have better performances on the full dataset. The metrics selected to evaluate the test performances of the methods are the $R^2$ score, the MSE and the Relative Root Mean Squared Error (RRMSE). In this way, the performances are considered taking into account the quantity on which the theoretical analysis is based on (the MSE), a classical metric that evaluates the performance in regression settings ($R^2$) and a relative metric that filters the magnitudes of the predictions (RRMSE).

The confidence intervals of test scores obtained with linear regression on the datasets considered and with the best performing baseline are reported in Sect. 6 of the main paper. The complete results on all the datasets are reported in one table for each dataset in Tables 9, 10, 11, 12, 13, 14, 15, 16.

Additionally, Tables 17, 18 show the results of the application of PCA and Supervised PCA on a number of reduced features equal to the one identified by LinCFA, in comparison with LinCFA itself and the not-reduced dataset, considering linear regression. This further analysis is reported to deepdive the performances when these two baselines have the same number of features identified by the proposed method. The majority of the results show that the LinCFA Algorithm leads to competitive performances w.r.t. the baselines, both considering linear and non-linear models and dimensionality reduction approaches. In particular, for the Life Expectancy, Finance, Climate I and Climate II datasets, non-linear methods do not clearly outperform linear regression and the LinCFA Algorithm leads to the best score or close to the best result. The Boston Housing dataset, Superconductivity and Cifar-10 show a better performance with non-linear approaches, with LinCFA that performs similarly to the application of the best performing non-linear metods directly on the full dataset.

To conclude, in Table 19 we provide an example of time complexity of the proposed algorithms in practice, in comparison with the other dimensionality reduction approaches considered. In particular, we considered as an example the Climate II dataset, and we ran all the experiments on a BullSequana XH2000 supercomputer using the 3rd generation of AMD EPYC CPUs, allocating 1 node and 32 GB of RAM for the experiment. Considering the time between the beginning and the end of the dimensionality reduction phase, we provide confidence intervals with five repetitions.

The LinCFA algorithm exploits the theoretical threshold to perform the aggregation, therefore there are no hyperparameters to tune. On the contrary, the dimensionality reduction and feature selection approaches considered as baselines need to tune at least the number of desired reduced features. For this reason, we reported both the computational time related to one specific application of the baseline algorithms, once the number of reduced components has been identified, together with the computational time needed to perform the validation (written in parentheses in the table), that we considered to select between 1 and 50 reduced components.

Although this empirical evaluation depends on the choice of considering up to 50 possible reduced features for the baselines and on the different implementations, we can conclude that in general the proposed algorithm has a fixed computational time due to the absence of hyperparameters to tune, at a cost on a larger computational time w.r.t. single computations of the baselines, whose computational time increases depending on the number of parameters that the user is willing to tune and on the extension of the grid search considered.

Table 9 Extended result of experiments on Life Expectancy dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 10 Extended result of experiments on Finance dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 11 Extended result of experiments on the first climate dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 12 Extended result of experiments on the second climate dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 13 Extended result of experiments on the Boston housing dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 14 Extended result of experiments on the superconductivity dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 15 Extended result of experiments on the Cifar-10 dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 16 Extended result of experiments on the Gene Expression dataset. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 17 Experiments on real datasets considering the same number of reduced features and linear regression. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 18 Experiments on additional real datasets considering the same number of reduced features and linear regression. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets

Full size table

Table 19 Computational time (in seconds) for each dimensionality reduction method on the Climate II dataset, confidence intervals produced with five repetitions

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bonetti, P., Metelli, A.M. & Restelli, M. Interpretable linear dimensionality reduction based on bias-variance analysis. Data Min Knowl Disc (2024). https://doi.org/10.1007/s10618-024-01015-0

Download citation

Received: 10 January 2023
Accepted: 21 February 2024
Published: 25 March 2024
DOI: https://doi.org/10.1007/s10618-024-01015-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Interpretable linear dimensionality reduction based on bias-variance analysis

Abstract

Similar content being viewed by others

Data Dimensionality Estimation: Achievements and Challanges

An Experimental Study of Dimensionality Reduction Methods

Dimensionality Reduction: Is Feature Selection More Effective Than Random Selection?

1 Introduction

Remark 1

Remark 2

2 Preliminaries

2.1 Notation and assumptions

2.2 Existing methods

2.2.1 Unsupervised dimensionality reduction

2.2.2 Supervised dimensionality reduction

3 Proposed methodology

4 Two-dimensional analysis

4.1 Setting

4.2 Variance analysis

4.2.1 Variance of the estimators

Lemma 1

Proof

4.2.2 Variance of the model

Theorem 1

Proof

4.2.3 Comparisons

Theorem 2

Proof

Theorem 3

Proof

Remark 3

Remark 4

4.3 Bias analysis

Theorem 4

Proof

Theorem 5

Proof

Remark 5

Remark 6

4.4 Correlation threshold

Theorem 6

Proof

Theorem 7

Proof

Remark 7

Remark 8

5 Generalization: three-dimensional and D-dimensional analysis

5.1 Three-dimensional case

5.1.1 Correlation threshold

Theorem 8

Proof

Remark 9

Remark 10

Remark 11

5.2 D-dimensional case

Lemma 2

Proof

5.3 D-dimensional algorithm

Remark 12

Remark 13

6 Numerical validation

6.1 Two-dimensional application

6.2 Three-dimensional application

6.3 D-dimensional application

7 Conclusion and future work

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Two-dimensional analysis: additional proofs and results

1.1 A.1 Variance

Proof of Equation (3)

Proof of Lemma 1