1 Introduction

Dimensionality reduction plays a crucial role in applying Machine Learning (ML) techniques in real-world datasets (Sorzano et al. 2014). Indeed, in a large variety of scenarios, data are high-dimensional with a large number of correlated features. For instance, financial datasets are characterized by time series representing the trend of stocks in the financial market, and climatological datasets include several highly-correlated features that, for example, represent temperature value at different points on the Earth. On the other hand, only a small subset of features is usually significant for learning a specific task, and it should be identified to train a well-performing ML algorithm. In particular, considering many redundant features boosts the model complexity, which increases its variance and the risk of overfitting (Hastie et al. 2009). Furthermore, when the number of features is high, and comparable with the number of samples, the available data become sparse, leading to poor performance (curse of dimensionality (Bishop and Nasrabadi 2006)). For this reason, dimensionality reduction and feature selection techniques are usually applied. Feature selection (Chandrashekar and Sahin 2014) focuses on choosing a subset of features important for learning the target following a specific criterion (e.g., the most correlated with the target, the ones that produce the highest validation score), discarding the others. On the other hand, dimensionality reduction methods (Sorzano et al. 2014) maintain all the features projecting them in a (much) lower dimensional space, producing new features that are linear or non-linear combinations of the original ones. Compared to feature selection, this latter approach has the advantage of reducing the dimensionality without discarding any feature and exploiting all of their contributions to the projections. Moreover, recalling that the variance of a sum of random variables is smaller than or equal to the original one, the features computed with linear dimensionality reduction have smaller variance. However, the reduced features might be less interpretable since they are linear combinations of the original ones with different coefficients.

In this paper, we propose a novel dimensionality reduction method that exploits the information of each feature, without discarding any of them, while preserving the interpretability of the resulting feature set. To this end, we aggregate features through their average, and we propose a criterion that aggregates two features when it is beneficial in terms of the bias-variance tradeoff. Specifically, we focus on linear regression, assuming a linear relationship between the features and the target. In this context, the main idea of this work is to identify a group of aggregable features and substitute them with their average. Intuitively, in linear settings, two features should be aggregated if their correlation is large enough. We identify a theoretical threshold on the minimum correlation for which it is profitable to unify the two features. This threshold is the minimum correlation value between two features for which, comparing the two linear regression models before and after the aggregation, the variance decrease is larger than the increase of bias.

Choosing the average to aggregate the features is to preserve interpretability (the resulting reduced feature is just the average of k original features). Another advantage is that the variance of the average is smaller than the variance of the original features if they are not perfectly correlated. Indeed, assuming that we unify k standardized features, the variance of their average becomes \(var(\bar{X})=\frac{1}{k}+\frac{k-1}{k}\rho\), with \(\rho\) being the average correlation of distinct features (Jacod and Protter 2004). The main restriction of choosing the average to aggregate is that we will only consider continuous features since the mean is not well-defined for categorical features. Moreover, it would be improper to evaluate the mean between heterogeneous features: interpretability is preserved only if the aggregation is meaningful.

Another issue may arise when considering features with a different unit of measurement or scale, for this reason we will consider standardized variables.

Remark 1

(About the linearity assumption and non-linear cases.) The theoretical analysis that lays the foundations of the proposed algorithm is limited to linear assumptions and considers linear regression as ML method. However, the proposed algorithm preserves a relevant significance. Indeed, the theoretical analysis allows to prove that the proposed algorithm is theoretically sound, assuming linearity. Then, the algorithm is designed relying on the linear theoretical result, but it can be applied to any regression problem with continuous features, where the proposed threshold becomes a heuristic quantity. While the theoretical guarantees no longer hold, this claim is supported by the empirical validation of the method on real-world datasets, which have no guarantee of linearity, but show a promising applicability of the proposed method outside linear contexts. Additionally, as usually done also in linear regression, it is possible to consider non-linear transformations of the original features as inputs of the LinCFA algorithm to relax the linearity assumption in some specific contexts.

Remark 2

(About interpretability.) Complex (linear and non-linear) transformations of the original features are usually performed by dimensionality reduction methods. As an example, PCA performs a linear combination of potentially all the original features, each with different weights. This kind of aggregation is already defined in Kovalerchuk et al. (2021) as not completely interpretable, since they define these kind of transformation as quasi-explainable. In this context, the LinCFA Algorithm is interpretable, since it only relies on performing the mean of several features, which is a transformation that a domain expert can understand without any additional explanation by ML experts. Lahav et al. (2018) define interpretability as: “the extent to which a ML model can be made understandable to relevant human users, with the goal of increasing users’ trust in, and willingness to utilize, the model in practice” and Kovalerchuk et al. (2021) defines interpretability in this terms: “the model is explainable if it is presented only in the domain terms (e.g., medicine) without terms that have no meaning in the domain”. The mean of variables known by the domain expert can therefore be considered interpretable in these terms. Additionally, the interpretability of the proposed method is particularly clear when the features have the same unit of measure. In this context, the reduced features are simply the average of a set of measurements of the same quantity at different locations, with different sensors or at different time frames.

Fig. 1
figure 1

Figure 1a shows the location of temperature measurements for each of the ten sub-basins of the Po River. Each color identifies the set of locations belonging to the same sub-basin. Figure 1b shows each of the clusters identified by the LinCFA algorithm. Each color represents a different set of locations where the algorithm performs an aggregation with the mean (Color figure online)

Additionally, depending on the applicative problem considered, the reduction performed with the mean can be particularly meaningful for domain experts. An example with meteorological measurements highlights the main applicative motivation behind the proposed approach. Indeed, a standard preprocessing approach often adopted in ML-based works for Earth science applications consists in computing the mean of a set of neighbouring measurements of the same physical quantity (e.g., temperature measurements in different locations). This method leads to the extraction of features that are average values of quantities over a region. As an example, Fig. 1a shows temperature gridded data related to ten different sub-basins of the Po River. In particular, each colored point in the figure represents one location, where temperature measurements are available. Therefore, each point can be seen as the location of a feature, that represents the temperature in that specific coordinates. To reduce the dimensionality, one may average the measurements of all the points within a sub-basin, following the geographical location of the data (in Fig. 1a each color identifies the set of locations belonging to a specific sub-basin). However, this has no guarantees on the ML performance and does not take into account the relationships between the data. The LinCFA algorithm, on the other hand, focuses on the relationships between pairs of points (i.e., features) and their relationship with the target to decide whether to aggregate them. From Fig. 1b it is possible to see the aggregations performed by the LinCFA algorithm: the dots having the same color correspond to the locations of the temperature features that the algorithm aggregates with their mean. Therefore, from the figure it is possible to conclude that, in this case, the data-driven approach aggregates the points differently from the geographical boundaries of the sub-basins. This preserves the interpretability since it aggregates measurements in different locations in the same way that domain expert does, with the advantage of being a data-driven approach, theoretically motivated.

Outline: The paper is structured as follows. In Sect. 2, we formally define the problem, and we provide a brief overview of the main dimensionality reduction methods. Section 3 introduces the methodology that will be followed throughout the paper. In Sect. 4, the main theoretical result is presented for the bivariate setting, which is then generalized to D dimensions in Sect. 5. Finally, in Sect. 6, the proposed algorithm, Linear Correlated Features Aggregation (LinCFA), is applied to synthetic and real-world datasets to experimentally confirm the result and lead to the conclusions of Sect. 7. The paper is accompanied by supplementary material. Specifically, Appendix A contains the proofs and technical results of the bivariate case that are not reported in the main paper, Appendix B shows an additional finite-samples bivariate analysis, Appendix C elaborates on the bivariate results to be composed only of theoretical or empirical quantities, Appendix D contains the proofs and technical results of the three-dimensional setting, and Appendix E presents in more details the experiments performed.

2 Preliminaries

In this section, we introduce the notation and assumptions employed in the paper (Sect. 2.1) and we survey the main related works (Sect. 2.2).

2.1 Notation and assumptions

Let (XY) be random variables with joint probability distribution \(P_{X,Y}\), where \(X\in \mathbb {R}^D\) is the D-dimensional vector of features and \(Y\in \mathbb {R}\) is the scalar target of a supervised learning regression problem. Given N data sampled from the distribution \(P_{X,Y}\), we denote the corresponding feature matrix as \({\textbf {X}}\in \mathbb {R}^{N\times D}\) and the target vector as \({\textbf {Y}}\in \mathbb {R}^{N}\). Each element of the random vector X is denoted with \(x_i\) and it is called a feature of the ML problem. We denote as y the scalar target random variable and with \(\sigma ^2_y\) and \(\hat{\sigma }^2_y\) its variance and sample variance. For each pair of random variables ab we denote with \(\sigma ^2_{a}\), cov(ab) and \(\rho _{a,b}\) respectively the variance of the random variable a and its covariance and correlation with the random variable b. Their estimators are \(\hat{\sigma }^2_{a}\), \(\hat{cov}(a,b)\) and \(\hat{\rho }_{a,b}\). Finally, the expected value and the variance operators applied on a function f(a) of a random variable a w.r.t. its distribution are denoted with \(\mathbb {E}_a[f(a)]\) and \(var_a(f(a))\).

A dimensionality reduction method can be seen as a function \(\varvec{\phi }: \mathbb {R}^{N \times D} \rightarrow \mathbb {R}^{N \times d}\), mapping the original feature matrix \({\textbf {X}}\) with dimensionality D into a reduced dataset \({\textbf {U}} = \varvec{\phi }({\textbf {X}}) \in \mathbb {R}^{N\times d}\) with \(d<D\). The goal of this projection is to reduce the (possibly huge) dimensionality of the original dataset while keeping as much information as possible in the reduced dataset. This is usually done by preserving a distance (e.g., Euclidean, geodesic) or the probability of a point to have the same neighbours after the projection (Zaki and Meira 2014).

In this paper, we assume a linear dependence between the features X and the target Y, i.e., \(Y = w^{T} X + \epsilon\), where \(\epsilon\) is a zero-mean noise, independent of X, and \(w \in \mathbb {R}^{D}\) is the weight vector. Without loss of generality, the expected value of each feature is assumed to be zero, i.e., \(\mathbb {E}[x_i]=\mathbb {E}[Y]=0\ \forall i\in \{1,\dots ,D\}\). Finally, we consider linear regression as ML method: the i-th estimated coefficient is denoted with \(\hat{w}_i\), the estimated noise with \(\hat{\epsilon }\) and the predicted (scalar) target with \(\hat{y}\).

2.2 Existing methods

This section briefly surveys dimensionality reduction algorithms available in the literature, presenting unsupervised and supervised approaches. More extensive reviews can be found in (Sorzano et al. 2014; Cunningham and Ghahramani 2015; Espadoto et al. 2021; Chao et al. 2019). The algorithm presented in this paper can be considered as a linear supervised dimensionality reduction approach, therefore the focus will be on this topic. However, feature selection also provides a set of reduced features, as discussed in Chapter 1 (the interested reader may refer to literature reviews such as (Li et al. 2017)). Therefore, RReliefF algorithm (Robnik-Sikonja et al. 1997; Kononenko et al. 1997) will also be considered in the empirical evaluation as a supervised feature selection approach.

2.2.1 Unsupervised dimensionality reduction

Classical dimensionality reduction methods can be considered as unsupervised learning techniques which, in general, do not take into account the target, but they focus on projecting the dataset \({\textbf {X}}\), minimizing a given loss.

The most popular unsupervised linear dimensionality reduction technique is Principal Components Analysis (PCA) (Pearson 1901; Hotelling 1933), a linear method that embeds the data into a linear subspace of dimension d describing as much as possible the variance in the original dataset. One of the main difficulties of applying PCA in real problems is that it performs linear combinations of possibly all the D features, usually with different coefficients, losing the interpretability of each principal component and suffering the curse of dimensionality. To overcome this issue, there exist some variants like svPCA (Ulfarsson and Solo 2011), which forces most of the weights of the projection to be zero. This contrasts with the approach proposed in this paper, which aims to preserve interpretability while exploiting the information yielded by each feature.

There exist several variants to overcome different issues of PCA (e.g., out-of-sample generalization, linearity, sensitivity to outliers) and other methods that approach the problem from a different perspective (e.g., generative approach with Factor Analysis, independence-based approach with Independent Component Analysis, matrix factorization with SVD), an extensive overview can be found in (Sorzano et al. 2014). A broader overview of linear dimensionality reduction techniques can be found in (Cunningham and Ghahramani 2015). Specifically, SVD (Golub and Reinsch 1970) leads to the same result of PCA from an algebraic perspective through matrix decomposition. Factor analysis (Thurstone 1931) assumes that the features are generated from a smaller set of latent variables, called factors, and tries to identify them by looking at the covariance matrix. Both PCA and Factor Analysis can reduce through rotations the number of features that are combined for each reduced component to improve the interpretability, but their coefficients can still be different and hard to interpret. Finally, Independent Component Analysis (Hyvärinen 1999) is an information theory approach that looks for independent components (not only uncorrelated as PCA) that are not constrained to be orthogonal. This method is more focused on splitting different signals mixed between features than on reducing their dimensionality, which can be done as a subsequent step with feature selection, which would be simplified from the fact that the new features are independent.

Differently from the linear nature of PCA, many non-linear approaches exist (see (Van Der Maaten et al. 2009; Espadoto et al. 2021) for a broader discussion), following the idea that the data can be projected onto non-linear manifolds. Some of them optimize a convex objective function (usually solvable through a generalized eigenproblem) trying to preserve global similarity of data (e.g., Isomap (Tenenbaum et al. 2000), Kernel PCA (Shawe-Taylor and Cristianini 2004), Kernel Entropy Component Analysis (Jenssen 2009), MVU (Weinberger et al. 2004), Diffusion Maps (Lafon and Lee 2006)) or local similarity of data (LLE (Roweis and Saul 2000), Laplacian Eigenmaps (Belkin and Niyogi 2001), LTSA (Zhang and Zha 2004), LPP (He and Niyogi 2003)). Other methods optimize a non-convex objective function with the purpose of rescaling Euclidean distance (Sammon Mapping (Sammon 1969)) introducing more complex structures like neural networks (Multilayer Autoencoders (Hinton and Salakhutdinov 2006)) or aligning mixtures of models (LLC (Teh and Roweis 2002)).

In this paper we assume linearity, therefore in the experimental section we will compare the proposed method with classical PCA and its supervised version, since it is one of the most applied linear unsupervised dimensionality reduction techniques in ML applications. Non-linear techniques for dimensionality reduction (Kernel PCA, Isomap, LLE, LPP) will also be considered to further test the behavior of the LinCFA algorithm on real data, where linearity is not guaranteed, together with RReliefF algorithm as nonlinear supervised feature selection approach.

2.2.2 Supervised dimensionality reduction

Supervised dimensionality reduction is a less-known but powerful approach when the main goal is to perform classification or regression rather than learn a data projection into a lower dimensional space. The methods of this subfield are usually based on classical unsupervised dimensionality reduction, adding the regression or classification loss in the optimization phase. In this way, the reduced dataset \({\textbf {U}}\) is the specific projection that allows maximizing the performance of the considered supervised problem. This is usually done in classification settings, minimizing the distance within the same class and maximizing the distance between different classes in the same fashion as Linear Discriminant Analysis (Fisher 1936). The other possible approach is to directly integrate the loss function for classification or regression. Following the taxonomy presented in (Chao et al. 2019), these supervised approaches can be divided into PCA-based, NMF-based (mostly linear), and manifold-based (mostly non-linear).

A well-known PCA-based algorithm is Supervised PCA. The most straightforward approach of this kind has been proposed in (Bair et al. 2006), which is a heuristic that applies classical PCA only to the subset of features mostly related to the target. A more advanced approach can be found in (Barshan et al. 2011), where the original dataset is orthogonally projected onto a space where the features are uncorrelated, simultaneously maximizing the dependency between the reduced dataset and the target by exploiting Hilbert-Schmidt independence criterion. The goal of Supervised PCA is similar to that of the algorithm proposed in this paper. The main difference is that we are not looking for an orthogonal projection, but we aggregate features by computing their means (thus, two projected features can be correlated) to preserve interpretability. Many variants of Supervised PCA exist, e.g., to make it a non-linear projection or to make it able to handle missing values (Yu et al. 2006). Since it is defined in the same context (linear) and has the same final purpose (minimize the mean squared regression error), supervised-PCA will be compared with the approach proposed by this paper in the experimental section. NMF-based algorithms (Jing et al. 2012; Lu et al. 2016) have better interpretability than PCA-based, but they focus on the non-negativity property of features, which is not a general property of linear problems. Manifold-based methods (Ribeiro et al. 2008; Zhang et al. 2018; Zhang 2009; Raducanu and Dornaika 2012), on the other hand, perform non-linear projections with higher computational costs. Therefore, both families of techniques will not be considered in this linear context.

3 Proposed methodology

In this section, we introduce the proposed dimensionality reduction algorithm, named Linear Correlated Features Aggregation (LinCFA), from a general perspective. The approach is based on the following simple idea. Starting from the features \(x_i\) of the D-dimensional vector X, we build the aggregated features \(u_k\) of the d-dimensional vector U. The dimensionality reduction function \(\varvec{\phi }\) is fully determined by a partition \(\varvec{\mathcal {P}}=\{\mathcal {P}_1,\dots ,\mathcal {P}_d\}\) of the set of features \(\{x_1,\dots ,x_D\}\). In particular, each feature \(x_i\) is assigned to a set \(\mathcal {P}_k\in \varvec{\mathcal {P}}\) and each feature \(u_k\) is computed as the average of the features in the k-th set of \(\varvec{\mathcal {P}}\):

$$\begin{aligned} u_k = \frac{1}{|\mathcal {P}_k|} \sum _{i \in \mathcal {P}_k} x_i. \end{aligned}$$
(1)

In the following sections, we will focus on finding theoretical guarantees to determine how to build the partition \(\varvec{\mathcal {P}}\). Intuitively, two features will belong to the same element of the partition \(\varvec{\mathcal {P}}\) if their correlation is larger than a threshold. This threshold is formalized as the minimum correlation for which the Mean Squared Error (MSE) of the regression with a single aggregated feature (i.e., the average) is not worse than the MSE with the two separated features.Footnote 1 In particular, it is possible to decompose the MSE as follows (bias-variance decomposition (Hastie et al. 2009)):

$$\begin{aligned}&\underbrace{\mathbb {E}_{x,y,\mathcal {T}}[(h_\mathcal {T}(x)-y)^2]}_{\text {MSE}} = \underbrace{\mathbb {E}_{x,\mathcal {T}}[(h_\mathcal {T}(x)-\bar{h}(x))^2]}_{\text {variance}} \nonumber \\&\quad +\underbrace{\mathbb {E}_{x}[(\bar{h}(x)-\bar{y}(x))^2]}_{\text {bias}} +\underbrace{\mathbb {E}_{x,y}[(\bar{y}(x)-y)^2]}_{\text {noise}}, \end{aligned}$$
(2)

where xy are the features and the target of a test sample, \(\mathcal {T}\) is the training set, \(h_\mathcal {T}(\cdot )\) is the ML model trained on dataset \(\mathcal {T}\), \(\bar{h}(\cdot )\) is its expected value w.r.t. the training set \(\mathcal {T}\) and \(\bar{y}\) is the expected value of the test output target y w.r.t. the input features x. Decreasing model complexity leads to a decrease in variance and an increase in bias. Therefore, in the analysis, we will compare these two variations and identify a threshold as the minimum value of correlation for which, after the aggregation, the decrease of variance is greater or equal than the increase of bias, so that the MSE will be greater or equal than the original one.

4 Two-dimensional analysis

This section introduces the theoretical analysis, performed in the bivariate setting, that identifies the minimum value of the correlation between the two features for which it is convenient to aggregate them with their mean. In particular, Sect. 4.1 introduces the assumptions under which the analysis is performed. Subsection 4.2 computes the amount of variance decreased when performing the aggregation. Then, Sect. 4.3 evaluates the amount of bias increased due to the aggregation. Finally, Sect. 4.4 combines the two results identifying the minimum amount of correlation for which it is profitable to aggregate the two features. In addition, Appendix A contains the proofs and technical results that are not reported in the main paper, Appendix B includes an additional finite-sample analysis, and Appendix C computes confidence intervals that allow stating the results with only theoretical or empirical quantities.

4.1 Setting

In the two-dimensional case (\(D=2\)), we consider the relationship between the two features \(x_1\), \(x_2\) and the target y to be linear and affected by Gaussian noise: \(y=w_1x_1+w_2x_2+\epsilon\), with \(\epsilon \sim \mathcal {N}(0,\sigma ^2)\). As usually done in linear regression (Johnson and Wichern 2007), we assume the training dataset \({\textbf {X}}\) to be known. Moreover, recalling the zero-mean assumption (\({{\,\mathrm{\mathbb {E}}\,}}[x_1]={{\,\mathrm{\mathbb {E}}\,}}[x_2]=0\)), it follows \(\mathbb {E}[y]=w_1\mathbb {E}[x_1]+w_2\mathbb {E}[x_2]=0\) and \(\sigma ^2_y = \sigma ^2\).

We compare the performance (in terms of bias and variance) of the two-dimensional linear regression \(\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2\) with the one-dimensional linear regression, which takes as input the average between the two features \(\hat{y}=\hat{w}\frac{x_1+x_2}{2}=\hat{w}\bar{x}\). As a result of this analysis, we will define conditions under which aggregating features \(x_1\) and \(x_2\) in the feature \(\bar{x}\) is convenient.

4.2 Variance analysis

In this subsection, we compare the variance of the two models with both an asymptotic and a finite-samples analysis. Since the two-dimensional model estimates two coefficients, it is expected to have a larger variance. Instead, aggregating the two features reduces the variance of the model.

4.2.1 Variance of the estimators

A quantity, necessary to compute the variance of the models that will be compared throughout this subsection, is the covariance matrix of the vector \(\hat{w}\) of the estimated regression coefficients w.r.t. the training set. Given the training features \({\textbf {X}}\), a known result in a general linear problem with n samples and D features (Johnson and Wichern 2007) (see Appendix A for the computations) is:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = ({\textbf {X}}^T{\textbf {X}})^{-1}\sigma ^2. \end{aligned}$$
(3)

The following lemma shows the variance of the weights for the two specific models that we are comparing.

Lemma 1

Let the real model be linear with respect to the features \(x_1\) and \(x_2\) (\(y=w_1x_1+w_2x_2+\epsilon\)). In the one-dimensional case \(\hat{y}=\hat{w}\bar{x}\), we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}}) = \frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{\bar{x}}}. \end{aligned}$$
(4)

In the two-dimensional case \(\hat{y}=\hat{w}_1x_1+\hat{w}_2x_2\), we have:

$$\begin{aligned} var_{\mathcal {T}}(\hat{w}|{\textbf {X}})&= \frac{\sigma ^2}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)} \nonumber \\&\quad \times \begin{bmatrix} \hat{\sigma }^2_{x_2} &{} -\hat{cov}(x_1,x_2) \\ -\hat{cov}(x_1,x_2) &{} \hat{\sigma }^2_{x_1} \end{bmatrix}. \end{aligned}$$
(5)

Proof

The proof of the two results follows from Equation (3), see Appendix A for the computations. \(\square\)

4.2.2 Variance of the model

Recalling the general definition of variance of the model from Equation (2), in the specific case of linear regression it becomes:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2] = {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(\hat{w}^T x-{{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[\hat{w}^Tx])^2]. \end{aligned}$$
(6)

The following result shows the variance of the two specific models (univariate and bivariate) considered in this section.

Theorem 1

Let the real model be linear with respect to the two features \(x_1\) and \(x_2\) (\(y=w_1x_1+w_2x_2+\epsilon\)). Then, in the one dimensional case \(y=\hat{w}\frac{x_1+x_2}{2}=\hat{w}\bar{x}\), we have:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2|{\textbf {X}}] = \sigma _{x_1+x_2}^2\frac{\sigma ^2}{(n-1)\hat{\sigma }^2_{x_1+x_2}}. \end{aligned}$$
(7)

In the two dimensional case \(y=\hat{w}_1x_1+\hat{w}_2x_2\), we have:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{x,{\mathcal {T}}}[(h_{\mathcal {T}}(x)-\bar{h}(x))^2|{\textbf {X}}] \nonumber \\&\quad = \frac{\sigma ^2(\sigma ^2_{x_1}\hat{\sigma }^2_{x_2}+\sigma ^2_{x_2}\hat{\sigma }^2_{x_1}-2cov(x_1,x_2)\hat{cov}(x_1,x_2))}{(n-1)(\hat{\sigma }^2_{x_1}\hat{\sigma }^2_{x_2} - \hat{cov}(x_1,x_2)^2)}. \end{aligned}$$
(8)

Proof

The proof combines the results of Lemma 1 with the definition of variance for a linear model given in Equation (6). The detailed proof can be found in Appendix A. \(\square\)

4.2.3 Comparisons

In this subsection, the difference between the variance of the linear regression with two features \(x_1\) and \(x_2\) and the variance of the linear regression with one feature \(\bar{x}=\frac{x_1+x_2}{2}\) is shown. We will prove that, as expected, this difference is positive and it represents the reduction of variance when substituting a two-dimensional random vector with the average of its components.

First, the asymptotic analysis is performed, obtaining a result that can be applied with good approximation when a large number of samples n is available. Then, the analysis is repeated in the finite-samples setting, with an additional assumption on the variance and sample variance of the features \(x_1\) and \(x_2\), that simplify the computations.Footnote 2

Case Iasymptotic analysis. The estimators that we are considering are consistent, i.e., they converge in probability to the real values of the parameters (e.g., \(\text {plim}_{n\rightarrow \infty }\hat{\sigma }^2_{x_1}=\sigma ^2_{x_1}\)). Therefore the following result can be proved.

Theorem 2

If the number of samples n tends to infinity, let \(\Delta _{var}^{n\rightarrow \infty }\) be the difference between the variance of the two-dimensional and the one-dimensional linear models, it is equal to:

$$\begin{aligned} \Delta _{var}^{n\rightarrow \infty }=\frac{\sigma ^2}{n-1} \ge 0, \end{aligned}$$
(9)

that is a positive quantity and tends to zero when the number of samples tends to infinity.

Proof

The result follows from the difference between Eqs. 8 and 7, exploiting the consistency of the estimators. \(\square\)

Case II: finite-samples analysis with equal variance and sample variance. For the finite-samples analysis, we add the following assumption to simplify the computations:

$$\begin{aligned} {\left\{ \begin{array}{ll} \sigma _{x_1}=\sigma _{x_2}{=}{:}\sigma _x \\ \hat{\sigma }_{x_1}=\hat{\sigma }_{x_2}{=}{:}\hat{\sigma }_x. \end{array}\right. } \end{aligned}$$
(10)

Theorem 3

If the conditions of Equation (10) hold, let \(\Delta _{var}\) be the difference between the variance of the two-dimensional and the one-dimensional linear models, it is always non-negative and it is equal to:

$$\begin{aligned} \Delta _{var}=\frac{\sigma ^2}{n-1}\cdot \frac{\sigma ^2_x(1-\rho _{x_1,x_2})}{\hat{\sigma }^2_x(1-\hat{\rho }_{x_1,x_2})}. \end{aligned}$$
(11)

Proof

The proof starts again from the variances of the two models found in Theorem 1 and it performs algebraic computations exploiting the assumption stated in Equation (10). All the steps can be found in Appendix A. \(\square\)

Remark 3

When the number of samples n tends to infinity, the result of Equation (11) reduces to the asymptotic case, as in Equation (9).

Remark 4

The quantities found in Theorem 2 and 3 are always non-negative, meaning that the variance of the two-dimensional case is always greater or equal than the corresponding one-dimensional version, as expected.

4.3 Bias analysis

In this subsection, we compare the (squared) bias of the two models under examination with both an asymptotic and a finite-samples analysis, as done in the previous subsection for the variance. Since the two-dimensional model corresponds to a larger hypothesis space it is expected to have a lower bias w.r.t. the one-dimensional.

The procedure to derive the difference between biases is similar to the one followed for the variance. The first step is to compute the expected value w.r.t. the training set \(\mathcal {T}\) of the vector \(\hat{w}\) of the regression coefficients estimates, given the training features \({\textbf {X}}\). This is used to compute the bias of the models. In particular, in Equation (2), we defined the (squared) bias as follows:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_x[(\bar{h}(x)-\bar{y})^2]= {{\,\mathrm{\mathbb {E}}\,}}_x[({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {T}}[h(x)]-{{\,\mathrm{\mathbb {E}}\,}}_{y|x}[y])^2]. \end{aligned}$$
(12)

Starting from this definition, the bias of the one-dimensional case \(\hat{y}=\hat{w}\bar{x}\) is computed. Moreover, for the two dimensional case \(y=\hat{w}_1x_1+\hat{w}_2x_2\) the model is clearly unbiased. Detailed computations can be found in Appendix A.

After the derivation of the bias of the models, the same asymptotic and finite-samples analysis performed on the variance is repeated in this section for the (squared) bias. Since the two-dimensional model is unbiased, we can conclude that the increase of the bias component of the loss, when the two features are substitute by their mean, is equal to the bias of the one-dimensional model.

Case I: asymptotic analysis. When the number of samples n of the training dataset \(\mathcal {T}\) approaches infinity, recalling that the estimators considered converge in probability to the expected values of the parameters, the following result holds.

Theorem 4

If the number of samples n tends to infinity, let \(\Delta _{bias}^{n\rightarrow \infty }\) be the difference between the bias of the one-dimensional and the two-dimensional models, it is equal to:

$$\begin{aligned} \Delta _{bias}^{n\rightarrow \infty }&=\frac{\sigma ^2_{x_1}\sigma ^2_{x_2}(1-\rho _{x_1,x_2}^2)(w_1-w_2)^2}{\sigma ^2_{x_1+x_2}}\end{aligned}$$
(13)
$$\begin{aligned}&=\frac{(1-\rho _{x_1,x_2})(w_1-w_2)^2}{2}, \end{aligned}$$
(14)

where the second equality holds if \(\sigma _{x_1}=\sigma _{x_2}=1\).

Proof

The proof starts from the bias of the two models computed in Appendix A and exploits the fact that in the limit \(n \rightarrow \infty\), it is possible to substitute every sample estimator with the real quantity of the parameters because they are consistent estimators. Details can be found in Appendix A. \(\square\)

Case II: finite-samples analysis with equal variance and sample variance In the finite-samples case, we provide the same analysis performed for variance, i.e., with the assumptions of Equation (10).

Theorem 5

If the conditions of Equation (10) hold, let \(\Delta _{bias}\) be the difference between the (squared) bias of the one-dimensional and the two-dimensional linear models, then it has value:

$$\begin{aligned} \Delta _{bias}=\frac{\sigma ^2_x(1-\rho _{x_1,x_2})(w_1-w_2)^2}{2}. \end{aligned}$$
(15)

Proof

The proof starts from the bias of the two models and performs algebraic computations exploiting the assumptions of Equation (10). All the steps can be found in Appendix A. \(\square\)

Remark 5

When the number of samples n tends to infinity, the result in Equation (15) reduces to the asymptotic case as in Theorem 4.

Remark 6

Some observations are in order:

  • As expected, the quantities found in Theorem 45 are always non-negative, since the hypothesis space of the univariate model is a subset of the one of the bivariate model.

  • We observe that \(\Delta _{bias}=0\) if \(\rho _{x_1,x_2}=1\). Indeed, when the two variables are perfectly (positively) correlated their coefficients in the linear regression are equal, therefore there is no loss of information in their aggregation.

  • Finally, when the two regression coefficients are equal \(w_1=w_2\) there is no increase of bias due to the aggregation, since it is enough to learn a single coefficient \(\bar{w}\) to have the same performance of the bivariate model.

4.4 Correlation threshold

This subsection concludes the analysis with two features by comparing the reduction of variance with the increase of bias when aggregating the two features \(x_1\) and \(x_2\) with their average \(\bar{x}=\frac{x_1+x_2}{2}\). In conclusion, the result shows when it is convenient to aggregate the two features with their mean, in terms of mean squared error.

Considering the asymptotic case, the following theorem compares bias and variance of the models.

Theorem 6

When the number of samples n tends to infinity and the relationship between the features and the target is linear with Gaussian noise, the decrease of variance is greater than the increase of (squared) bias when the two features \(x_1\) and \(x_2\) are aggregated with their average if and only if:

$$\begin{aligned} \rho ^2_{x_1,x_2} \ge 1-\frac{\sigma ^2\sigma ^2_{x_1+x_2}}{(n-1)\sigma ^2_{x_1}\sigma ^2_{x_2}(w_1-w_2)^2}, \end{aligned}$$
(16)

that, for \(\sigma _{x_1}=\sigma _{x_2}=1\) becomes:

$$\begin{aligned} \rho _{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$
(17)

Proof

Computing the difference between Eqs. (9) and (13) the result follows. \(\square\)

In the finite-samples setting, with the additional assumptions of Eq. (10), the following theorem shows the result of the comparison between bias and variance of the two models.

Theorem 7

Let the variance and sample variance of the features \(x_1\) and \(x_2\) be equal (Eq. (10)) and the relationship between the features and then target be linear with Gaussian noise. The decrease of variance is greater than the increase of (squared) bias when the two features \(x_1\) and \(x_2\) are aggregated with their average if and only if:

$$\begin{aligned} \hat{\rho }_{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)\hat{\sigma }^2_x(w_1-w_2)^2}, \end{aligned}$$
(18)

that, for \(\hat{\sigma }_x=1\) becomes:

$$\begin{aligned} \hat{\rho }_{x_1,x_2} \ge 1-\frac{2\sigma ^2}{(n-1)(w_1-w_2)^2}. \end{aligned}$$
(19)

Proof

Computing the difference between Equation (11) and (15) the result follows. \(\square\)

Remark 7

The results of Theorem 6 and 7 comply with the intuition that, in a linear setting with two features, they should be aggregated if their correlation is large enough.

Remark 8

Theorem 6 and 7 with unitary sample variances produce the same threshold both in the finite and the asymptotic settings.

In conclusion, the thresholds found in Theorem 6 and 7 show that it is profitable in terms of MSE to aggregate two variables in a bivariate linear setting with Gaussian noise if:

  • the variance of the noise \(\sigma ^2\) is large, which means that the process is noisy and the variance should be reduced;

  • the number of samples n is small, indeed in this case there is little knowledge about the actual model, therefore it is better to learn one parameter rather than two;

  • the difference between the two coefficients \(w_1-w_2\) is small, which implies that they are similar, and learning a single coefficient introduces a little loss of information.

5 Generalization: three-dimensional and D-dimensional analysis

In the previous section, we focused on aggregating two features in a bivariate setting. In this section, we extend that approach to three features. Starting from the related results, we will straightforwardly extend them to a general problem with D features, heuristically considering the \(D-2\) remaining features as a unique third contribution. Given the complexity of the computations, we focus on asymptotic analysis only. After the analysis, we conclude this section with the main algorithm proposed in this paper: Linear Correlated Features Aggregation (LinCFA).

5.1 Three-dimensional case

In the three-dimensional case (\(D=3\)), we consider the relationship between the three features and the target to be linear with Gaussian noise: \(y=w_1x_1+w_2x_2+w_3x_3+\epsilon\), \(\epsilon \sim \mathcal {N}(0,\sigma ^2)\). In accordance with the previous analysis, we assume the training dataset \({\textbf {X}}=[{\textbf {x}}_{{\textbf {1}}}\ {\textbf {x}}_{{\textbf {2}}}\ {\textbf {x}}_{{\textbf {3}}}]\) to be known and recalling the zero-mean assumption (\({{\,\mathrm{\mathbb {E}}\,}}[x_1]={{\,\mathrm{\mathbb {E}}\,}}[x_2]={{\,\mathrm{\mathbb {E}}\,}}[x_3]=0\)) it follows \(\mathbb {E}[y]=w_1\mathbb {E}[x_1]+w_2\mathbb {E}[x_2]+w_3\mathbb {E}[x_3]=0\), \(\sigma ^2_y = \sigma ^2\).

In this setting and for the general D dimensional setting of the next subsection, which will be a direct application of this, we compare the performance of the bivariate linear regression \(\hat{y}=\hat{w}_ix_i+\hat{w}_jx_j\) of each pair of features \(x_i,x_j\) with the univariate linear regression that considers their average \(\hat{y}=\hat{w}\frac{x_i+x_j}{2}=\hat{w}\bar{x}\), to decide whether it is convenient to aggregate them or not in terms of MSE. Indeed, extending the dimension from \(D=2\) to a general dimension D, and comparing all the possible models where groups of variables are aggregated, is combinatorial in the number of features and it would be impractical. Also, comparing the full D dimensional regression model with the \(D-1\) dimensional model where two variables are aggregated is impractical. Indeed, when the number of features is huge, in addition to a polynomial computational cost, both models suffer issues like the curse of dimensionality and risk of overfitting.

To simplify the exposition, for the theoretical analysis, we will consider \(x_i=x_1,x_j=x_2\). Moreover, in the following subsection we will directly report the asymptotic correlation threshold that guarantees the asymptotic decrease of variance to be greater than the increase of bias due to the aggregation of two features. The specific analysis of variance and bias, together with the related proofs, can be found in Appendix D.

5.1.1 Correlation threshold

The result of the following theorem extends the result of Theorem 6 for the three-dimensional setting.

Theorem 8

In the asymptotic setting, let the relationship between the features and the target be linear with Gaussian noise. Assuming unitary variances of the features \(\sigma _{x_1}=\sigma _{x_2}=\sigma _{x_3}=1\), the decrease of variance is greater than the increase of (squared) bias due to the aggregation of the features \(x_1\) and \(x_2\) with their average if and only if:

$$\begin{aligned}&1-(a-b)-\sqrt{a(a-2b)} \le \rho _{x_1,x_2}\le 1-(a-b)+\sqrt{a(a-2b)},\nonumber \\&\qquad \text {with } {\left\{ \begin{array}{ll} a=\frac{\sigma ^2}{(n-1)(w_1-w_2)^2}\\ b=\frac{(\rho _{x_1,x_3}-\rho _{x_2,x_3})w_3}{(w_1-w_2)}. \end{array}\right. } \end{aligned}$$
(20)

Proof

The result follows after algebraic computations on the difference \(\Delta _{var}^{n\rightarrow \infty } - \Delta _{Bias}^{n\rightarrow \infty } \ge 0,\) where the expression of the asymptotic difference of variances and biases can be respectively found in Remark 18 and Theorem 18 of Appendix D. \(\square\)

Remark 9

Equation (20) holds also in the case of generic variance \(\sigma ^2_{x_3}\) of the feature \(x_3\), with the only difference that b becomes:

$$\begin{aligned} b=\frac{\sigma _{x_3}(\rho _{x_1,x_3}-\rho _{x_2,x_3})w_3}{(w_1-w_2)}. \end{aligned}$$
(21)

Remark 10

The result obtained in this section with three features is more difficult to interpret than the bivariate one. However, if the two features \(x_1\) and \(x_2\) are uncorrelated with the third feature \(x_3\) or they have the same correlation with it (\(\rho _{x_1,x_3}=\rho _{x_2,x_3}\)), then Equation (20) is equal to the one found in the bivariate asymptotic analysis (Equation (17)).

Remark 11

Since the analysis is asymptotic, the theoretical quantities in Equation (20) can be substituted with their consistent estimators when the number of samples n is large.

5.2 D-dimensional case

This last subsection of the analysis shows a generalization from three to D dimensions. In particular, we assume the relationship between the D features \(x_1,...,x_D\) and the target to be linear with Gaussian noise \(y=w_1x_1+...+w_Dx_D+\epsilon\), with \(\epsilon \sim \mathcal {N}(0,\sigma ^2)\). As done throughout the paper, we assume the training dataset \({\textbf {X}}=[{\textbf {x}}_{{\textbf {1}}}\ ...\ {\textbf {x}}_{{\textbf {D}}}]\) to be known and from the zero-mean assumption \(\mathbb {E}[y]=0\) and \(\sigma ^2_y = \sigma ^2\).

As discussed for the three-dimensional case, we compare the performance (in terms of bias and variance) of the two-dimensional linear regression \(\hat{y}=\hat{w}_ix_i+\hat{w}_jx_j\) with the one-dimensional linear regression \(\hat{y}=\hat{w}\frac{x_i+x_j}{2}=\hat{w}\bar{x}\) and in the computations we consider \(x_i=x_1,x_j=x_2\) without loss of generality.

Considering the linear combination of the remaining features as a unique variable \(x=w_3x_3+...+w_Dx_D\), we directly extend the three-dimensional analysis of the previous subsection to this general case, considering the model to be \(y=w_1x_1+w_2x_2+wx+\epsilon\), with \(w=1\) and \(x=w_3x_3+...+w_Dx_D\). This way, the D-dimensional linear problem is straightforwardly reformulated as a three-dimensional one. However, this analysis can be seen as a heuristic result, in the sense that we do not fully characterize the relationship between the two features under analysis and all the remaining ones, but the focus is only the relationship between the two features under analysis \(x_1,x_2\) and the linear combination \(x=w_3x_3+...+w_Dx_D\) of the remaining ones.

Recalling that in this case the third feature x has general variance \(\sigma ^2_{x}\), the following lemma holds.

Lemma 2

Let \(y=w_1x_1+...+w_Dx_D+\epsilon =w_1x_1+w_2x_2+wx+\epsilon\) with \(\sigma ^2_{x_1}=\sigma ^2_{x_2}=1\) and \(\sigma ^2_x=\sigma ^2_{w_3x_3+...+w_Dx_D}\). Then, performing linear regression in the asymptotic setting, the decrease of variance is greater than the increase of bias when aggregating the two features \(x_1\) and \(x_2\) with their average if and only if the condition on the correlation of Equation (20) holds (with the parameter b expressed like in Equation (21) as \(b=\frac{\sigma _{x}(\rho _{x_1,x}-\rho _{x_2,x})w}{(w_1-w_2)}\)).

Proof

The lemma follows by applying the three-dimensional analysis with general variance of the third feature \(\sigma ^2_x\) (Theorem 8 and Remark 9). \(\square\)

5.3 D-dimensional algorithm

For the general D-dimensional case, as explained in the previous subsection, the three-dimensional results has be extended considering as third feature the linear combination of the \(D-2\) features not currently considered for the aggregation. A drawback of applying the obtained result in practice is that it requires the knowledge of all the coefficients \(w_1,...,w_D\), which is unrealistic, or to approximate them through an estimate, performing linear regression on the complete D-dimensional dataset. In this case, the computational cost is \(\mathcal {O}(n\cdot D^2 + D^3)\)—which becomes \(\mathcal {O}(n\cdot D^2 + D^{2.37})\) if using the Coppersmith-Winograd algorithm (Coppersmith and Winograd 1990)—and it is impractical with a huge number of features. Therefore, since the equation in the three dimensional asymptotic analysis becomes equal to the bivariate one if the two features have the same correlation with the third (Remark 10), it is reasonable, if they are highly correlated, to assume this to be valid and to apply the asymptotic bivariate result shown in Equation (17) to decide whether the two features should be aggregated or not. In this way, we iteratively try all combinations of two features, with complexity \(\mathcal {O}(n+D^2)\) in the worst case, in order to choose the groups of features that is convenient to aggregate with their mean.

Algorithm 1
figure a

LinCFA: Linear Correlated Features Aggregation

In Algorithm 1 the pseudo-code of the proposed algorithm Linear Correlated Features Aggregation (LinCFA) can be found. The proposed dimensionality reduction algorithm creates a d dimensional partition of the indices of the features \(\{1,\dots ,D\}\) by iteratively comparing couples of features and adding them to the same subset if their correlation (\(\text {correlation}(x_i,x_j)\)) is greater than the threshold (\(\text {threshold}(x_i,x_j,y)\)), obtained from Eq. (17). Then, it aggregates the features in each set k of the partition (\(\varvec{\mathcal {P}}\)) with their average, producing each output \(\bar{x}_k\).

Remark 12

(About theoretical results and the empirical algorithm) The proposed algorithm aggregates sets of features and not only couples of them, as considered in the theoretical analysis. The motivation behind this choice is to perform a single average of a set of features. A possible variation, which aggregates pairs of features as derived by the theory, is to directly aggregate a pair of features with their mean, once they respect the theoretical aggregation condition (e.g., at the first iteration we aggregate \(x_1,x_2\) producing \(\bar{x}=\frac{x_1+x_2}{2}\)). Then, considering their mean from there on as a single feature, it would be aggregated with another feature if the condition is respected (e.g., at the second iteration we aggregate \(\bar{x},x_3\) producing \(\hat{x}=\frac{\bar{x}+x_3}{2}\)), until no more aggregations are possible. This procedure adheres more with the theoretical results, however it is less interpretable in the sense discussed in Remark 2, since each reduced feature is an iterative mean of means.

Remark 13

(About the ordering of features) The output of the LinCFA algorithm may depend on the ordering of the features. Therefore, in the pseudo-code of Algorithm 1, a random shuffle of the original features is required as input, such that systematic biases due to the ordering are avoided. A greedy approach that removes the dependency of the LinCFA algorithm on the ordering of the features would be to introduce an internal ordering among pairs of features. Considering for example correlation, one may consider the pair of the two most correlated features and test if they exceed the threshold. If so, they could be added to a cluster and substituted with their mean. Iteratively proceeding in this way, until all features have been assigned to a set of the partition, produces an algorithm that becomes independent from the initial ordering of features and aggregates only features that exceed the threshold. However, this increases the memory and computational complexity, since all the correlations between each pair of features should be computed and stored.

As a further step, among the possible partitions that can be identified depending on the ordering, there is at least an optimal partition of features, which maximizes the mean squared error. Intuitively, with infinite samples, the MSE is maximized considering each feature independently. This is confirmed by the asymptotic variance analysis, where a term n shows that, with infinite samples, there is no decrease of variance with the aggregation. However, with finite samples, the identification of the optimal partition would be combinatorial, since all the possible partitions should be tested. The proposed algorithm adds one feature at time in a cluster, therefore it has no guarantees of optimality. This is in line with classical machine learning approaches such as forward feature selection, that iteratively selects a promising feature, although a combination of other two features may be more informative.

6 Numerical validation

In this section, the theoretical results obtained in Sects. 4 and 5 are exploited to perform dimensionality reduction on synthetic datasets of two, three and D dimensions. Furthermore, the proposed dimensionality reduction approach LinCFA is applied to real datasets and compared with state-of-the-art benchmark methods. To evaluate the performance of the regressions, the results will be evaluated in terms of Mean Squared Error (MSE), R-squared (\(R^2\)) and Relative RMSE (RRMSE). Code and datasets can be found at the following link: https://github.com/PaoloBonettiPolimi/PaperLinCFA.

6.1 Two-dimensional application

Table 1 Experiment on synthetic bivariate data for two combinations of weights and three different values of variance of the noise

In the bivariate setting, according to Eq. (17) and (19), it is convenient to aggregate the two features with a small number of samples n, with a small absolute value of the difference between the coefficients of the linear model \(w_1,w_2\) or with a large variance of the noise \(\sigma ^2\). The synthetic experiments (full description in Appendix E) confirm with data the theoretical result. In particular, they are performed with a fixed number of samples \(n=500\), a fixed correlation between the features \(\rho _{x_1,x_2}\approx 0.9\), comparing two combinations of weights (at small and large distances) and three different variances of the noise (small, normal, large).

Table 1 shows the results of the experiments (more detailed results can be found in Tables 6,7 of Appendix E). In line with the theory, when the weights in the linear model are consistently distant, only with a huge variance of the noise the threshold is far from 1 and the two features are aggregated, while for a reasonably small amount of variance in the noise they are kept separated. On the other hand, when the weights in the linear model are similar, the threshold of Eq. (17) is small and the conclusion is to aggregate the two features also with a small amount of variance in the noise. The confidence intervals on the \(R^2\) and on the MSE confirm that, when the correlation is above the threshold, the performance of the linear model when the two features are aggregated with their average is statistically not worse than the bivariate model where they are kept separate. It is finally important to notice that, knowing the coefficients of the regression, always leads to aggregate the two features or not in all the 500 repetitions of the experiment (row # aggregations (theo)). On the contrary, estimating the coefficients from data leads to the same action in most repetitions but not always (row # aggregations (emp)), since the limited amount of data introduces noise into the estimates.

6.2 Three-dimensional application

Table 2 Synthetic experiment in the three dimensional setting comparing the full model with three variables with the bivariate model where \(x_1,x_2\) are aggregated with their mean

Equation (20) expresses the interval for which it is convenient to aggregate the two features \(x_1\)and \(x_2\) in the three-dimensional setting. As in the bivariate case, it is related to the number of samples, the difference between weights, and the variance of the noise. In addition, it also depends on the difference of the correlations between each of the two features with the third one \(x_3\) and on the weight \(w_3\).

The experiment performed in this setting is based on synthetic data, computed with the following realistic setting: the weights \(w_1=0.4,\ w_2=0.6\) are closer than \(w_3=0.2\). Moreover, the two features are significantly correlated: \(\rho _{x_1,x_2}\approx 0.88\) (more details can be found in Appendix E).

In this setting, as shown in Table 2, it is convenient to aggregate the two features \(x_1,x_2\) with their average both in terms of MSE and \(R^2\), since the aggregation does not worsen the performances. In particular, the aggregation is already convenient with a small standard deviation of the noise (\(\sigma =0.5\)).

6.3 D-dimensional application

Table 3 Synthetic experiment in the D dimensional setting. The experiment has been repeated twice: considering the theoretical threshold with the exact coefficients (theo) and with coefficients estimated from data (emp)

This subsection introduces the D-dimensional synthetic experiment performed 500 times with \(n=500\) samples and \(D=100\) features, reduced with the proposed algorithm LinCFA (more details can be found in Appendix E).

The test results shown in Table 3 underline that knowing the real values of the coefficients of the linear model would lead to a reduced dataset of \(d=4\) features and a significant increase of performance (\(R^2\) aggregate (theo), MSE aggregate (theo)), while using the empirical coefficients the dimension is reduced to \(d=15\), still with a significant increase of performance both in terms of \(\text {MSE}\) and \(R^2\) (\(R^2\) aggregate (emp), MSE aggregate (emp)). This is a satisfactory result and it is confirmed by the real dataset application described below.

Fig. 2
figure 2

Figure 2a shows the number of reduced features for a different number of samples. Figure 2b,2c show the regression performance in terms of \(R^2\) and MSE for different number of samples. Blue lines refer to the linear regression with all the original features, while red and green lines respectively refer to linear regression on the features reduced by applying the proposed algorithm considering theoretical and empirical quantities (Color figure online)

To better understand the performance of the algorithm, in Fig. 1 we consider the number of selected features and the regression scores. From Fig. 2a it is clear that with a small number of samples, both considering theoretical and empirical quantities, the number of reduced features d becomes smaller to prevent overfitting. Moreover, considering the empirical quantities, which are the only ones available in practice, lead to a larger number of reduced features (but still significantly smaller than the original dimension D). Figure 2b, c show the performance of the linear regression considering the reduced features compared with the full dataset. When the number of samples is significantly larger than the number of features, the performance of the reduced datasets is only slightly better but, when the number of samples is of the same order of magnitude as the number of features, the reduced datasets (both considering empirical and theoretical quantities) significantly outperform the regression over the full dataset. Moreover, the regression performed with reduced datasets is much robust, since it has a score that is stable for different numbers of samples.

Real-world experiments. The main practical result introduced in this paper (the algorithm LinCFA) has been also tested on real datasets. In particular, the results of the application of the dimensionality reduction method introduced in this paper are discussed in comparison with the chosen baselines.

Table 4 Experiments on four real datasets. The total number of samples n has been divided into train (66% of data) and test (33% of data) sets
Table 5 Experiments on four additional real datasets. The total number of samples n is divided into train (66% of data) and test (33% of data) sets

Specifically, the LinCFA algorithm has been applied in comparison with classical (unsupervised) PCA, Supervised PCA, LLE, LPP, Isomap and Kernel PCA. Additionally, RReliefF has also been considered to take into account a feature selection method as baseline. The number of components selected for PCA is set to explain \(95\%\) of variance, while for Supervised PCA, LLE, LPP, Isomap, Kernel PCA and RReliefF the best result (evaluating from \(d=1\) to \(d=50\) principal components) has been considered. All the other hyperparameters of the methods have been set to default values. Linear, polynomial and sigmoidal kernels have been considered for Kernel and Supervised PCA. The mean squared error (MSE), the relative root mean squared error (RRMSE) and the coefficient of determination (\(R^2\)) of the linear regression, applied on the set of components reduced by the each algorithm under analysis, have been considered as performance measures on the test set. Confidence intervals have been produced bootstrapping the training and validation set with five different seeds. Additionally, the performances of the full dataset applying linear, Ridge (Hoerl and Kennard 1970), and Lasso (Tibshirani 1996) regression have been considered. To further test the LinCFA algorithm in comparison with non-linear regression approaches, together with linear regression, support vector machine for regression (Drucker et al. 1996)Footnote 3, XGBoost (Chen and Guestrin 2016)Footnote 4 and a neural network (Lawrence 1993)Footnote 5 have been performed and the related results are available in Appendix E.4. Moreover, in Appendix E.4, also the result of each baseline considering the same number of reduced features selected by the LinCFA algorithm is reported.

Eight datasets with different characteristics have been considered.

The first dataset focuses on the prediction of life expectancy from \(D=18\) continuous factors and 1649 samples. The dataset is available on KaggleFootnote 6. In this case, a reduction of the number of features may be unnecessary, as confirmed by the experimental results, where the full dataset have similar performances w.r.t. the majority of the benchmark methods and the LinCFA algorithm. This experiment provides an example that shows that the algorithm does not reduce too much the dimensionality, when it is not necessary.

The second dataset is a financial dataset made of \(D=75\) continuous features, 1299 samples, and a scalar output. The model predicts the cash ratio depending on other metrics, from which it is possible to derive many fundamental indicators. The dataset is available on KaggleFootnote 7. Given the consistent number of features w.r.t. the number of samples, linear regression with the full dataset provides negative results, while the application of linear regression on the reduced dataset has a significantly high score, and the LinCFA algorithm has one of the best performances among the methods considered.

Then, the algorithm is tested on two climatological dataset composed by \(D=136\) (with 1038 samples) and \(D=1991\) (with 981 samples) continuous climatological features and a scalar target, which represents the state of vegetation of a basin of the Po river. This datasets have been composed by the authors merging different sources for the vegetation index, temperature, and precipitation over different basins (see (Didan 2015; Cornes et al. 2018; Zellner and Castelli 2022)), and they are available in the repository of this work. With the first climate dataset, considering linear regression, the reduction of features performed by the baselines and the LinCFA algorithm lead again to a significant improvement w.r.t. the full dataset, which has a satisfactory performance only considering XGBoost. The second climatic dataset significantly benefits from the reduction of the dimension in all cases. In particular, the LinCFA algorithm leads to the highest score in combination with linear regression.

Additionally, we further tested the LinCFA algorithm on four datasets from the UCI repository. In particular, we considered a simple classical dataset with 13 features (and 506 samples), the Boston Housing dataset (Harrison and Rubinfeld 1978), which confirms that, as discussed for the life expectancy dataset, the proposed algorithm does not lose information when the full regression is already able to manage the entire set of features. Then, a more complex dataset related to superconductivity (Hamidieh 2018), with 81 features, provides an example with many samples (21263), showing the possibility to apply the LinCFA algorithm also in this case.

Additionally, we considered the Cifar-10 dataset (Krizhevsky et al. 2009), transformed into a regression problem by considering each pixel of each of the three color layers as a feature and removing a pixel, considered as target. This provides a significant case with 6000 features and 3071 samples, where the LinCFA algorithm provides the best absolute score w.r.t. the full dataset and the considered linear and non-linear methods.

Finally, a Gene Expression (Fiorini 2016) dataset composed of 801 samples and 19133 features has been considered, where the gene expression of one gene is the target variable and the gene expression of the other genes available are considered as features. Similarly to the climate and the Cifar-10 dataset, with many highly correlated features and the need to reduce them to gain both interpretability and performance, this dataset allows to test the LinCFA algorithm on a dataset with a large number of features and a relatively small number of samples. The results show once again that the LinCFA algorihtms obtains high scores w.r.t. the other dimensionality reduction methods and the regression on the full dataset.

Tables 4 and 5 show the MSE, RRMSE and \(R^2\) coefficients obtained with linear regression applied on the full dataset, on the dataset reduced by LinCFA, and on the dataset reduced by the best performing baseline. Additionally, the results related to Ridge and Lasso regression are reported. The extensive results for each dataset with all the baselines and regression methods considered can be found in Appendix E, Tables 910111213141516. Additionally, in the appendix, Tables 17 and 18 report the results associated to the repetition of the experiments, imposing the same number of reduced features as the one identified by LinCFA to each dimensionality reduction method. Finally, an empirical example of computational time is reported in Table 19.

From the results, as already mentioned during the description of the datasets, it is possible to notice that, when the number of features is low, the results are similar between the full regression and the regression on the reduced dataset applying the baselines or LinCFA. On the other hand, when the algorithms are applied to the large dimensional data, the algorithm that we propose always obtains similar or better performances than the other methods. Therefore, the LinCFA Algorithm is able to reduce the dimensionality of the input features improving (or not-worsening) the performance of the linear model, preserving the interpretability of the reduced features.

7 Conclusion and future work

This paper presents a dimensionality reduction algorithm in linear settings with the theoretical guarantee to produce a reduced dataset that does not perform worse than the full dataset in terms of MSE, with a decrease of variance that is larger than the increase of bias due to the aggregation of features. The main strength of the proposed approach is that it aggregates features through their mean, which reduces the dimension meanwhile preserving the interpretability of each feature, which is not common in traditional dimensionality reduction approaches like PCA. Moreover, the complexity of the proposed algorithm is lower than performing a linear regression on the full original dataset. The main weaknesses of the proposed method are that all the computations have been done assuming the features to be continuous and the relationship between the target and the features to be linear, which is a strong assumption in real-world applications. However, the empirical results show an increase in performance and a significant reduction of dimensionality when applying the proposed algorithm to real-world datasets. Therefore, as detailed in Remark 1, the algorithm is designed relying on the linear theoretical result, but it can be applied to any regression problem with continuous features, where the proposed threshold becomes a heuristic quantity. In this case, the empirical validation of the method on real-world datasets, which have no guarantee of linearity, shows a promising applicability of the proposed method outside linear contexts.

In future work it may be interesting to relax the linearity assumption in the theoretical analysis, considering the target as a general function of the input features and applying a general machine learning method to the data. Another possible way to enrich the results obtained in this paper is to consider structured data, where prior knowledge of their relationship can be useful to identify the most suitable features for the aggregation (e.g., on climatological data, features that are registered by two adjacent sensors are more likely to be aggregated).