# The spectral condition number plot for regularization parameter evaluation

- 59 Downloads

## Abstract

Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter assessment. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators.

## Keywords

Eigenvalues High-dimensional covariance (precision) estimation \(\ell _2\)-Penalization Matrix condition number## 1 Introduction

The covariance matrix \(\varvec{\Sigma }\) of a *p*-dimensional random vector \(Y_i^{\mathrm{T}} \equiv [Y_{i1}, \ldots , Y_{ip}] \in \mathbb {R}^{p}\) is of central importance in many statistical analysis procedures. Estimation of \(\varvec{\Sigma }\) or its inverse \(\varvec{\Sigma }^{-1} \equiv \varvec{\Omega }\) (generally known as the precision matrix) are central to, for example, multivariate regression, time-series analysis, canonical correlation analysis, discriminant analysis, and Gaussian graphical modeling. Let \(\mathbf{y }_{i}^{\mathrm{T}}\) be a realization of \(Y_i^{\mathrm{T}}\). It is well-known (Stein 1975) that the sample covariance matrix \(\hat{\varvec{\Sigma }} = n^{-1}\sum _{i=1}^{n}(\mathbf{y }_{i} - \bar{\mathbf{y }})(\mathbf{y }_{i} - \bar{\mathbf{y }})^{\mathrm{T}}\) is a poor estimate of \(\varvec{\Sigma }\) when *p* approaches the sample size *n* or when \(p > n\). When *p* approaches *n*, \(\hat{\varvec{\Sigma }}\) will tend to become ill-conditioned. When \(p > n\), \(\hat{\varvec{\Sigma }}\) is singular leaving \(\hat{\varvec{\Omega }}\) undefined. However, many contemporary applications in fields such as molecular biology, neuroimaging, and finance, encounter situations of interest where \(p > n\).

The estimation of \(\varvec{\Sigma }\) can be improved by shrinking the eigenvalues of \(\hat{\varvec{\Sigma }}\) towards a central value (e.g., Stein 1975) or by convexly combining \(\hat{\varvec{\Sigma }}\) with some well-conditioned target matrix (e.g., Schäfer and Strimmer 2005). These solutions define a class of estimators that can be caught under the umbrella term ‘ridge-type covariance estimation’. Such estimators depend on a penalty parameter determining the rate of shrinkage and choosing its value is of prime importance. Available procedures for choosing the penalty have some (situation dependent) disadvantages in the sense that (a) they can be computationally expensive, (b) they can be restricted to special cases within the class of ridge-type estimators, or (c) they are not guaranteed to result in a meaningful penalty-value. There is thus some demand for a generic and computationally friendly procedure on the basis of which one can (i) heuristically select an acceptable (minimal) value for the penalty parameter and (ii) assess if more formal procedures have indeed proposed an acceptable penalty-value. Here such a tool is provided on the basis of the matrix condition number.

The remainder is organized as follows. Section 2 reviews the class of ridge-type estimators of the covariance matrix. In addition, penalty parameter selection is reviewed and an exposé of the matrix condition number is given. The spectral condition number is central to the introduction of the spectral condition number plot in Sect. 3. This graphical display is posited as an exploratory tool, that may function as a fast and simple heuristic in evaluating (a range of) penalty values or in determining a (minimum) value of the penalty parameter when employing ridge estimators of the covariance or precision matrix. We emphasize that it is a generic tool, of use for all ridge-type estimators of either the covariance or precision matrix in situations in which sparsity is not a direct asset. Section 4 illustrates usage of the spectral condition number plot with data from the field of oncogenomics. This illustration exemplifies that this tool may also be of use in situations in which sparsity is indeed ultimately desired, such as in graphical modeling. The Supplementary Material contains an additional illustration regarding high-dimensional factor analysis in which sparsity is not sought after. Section 5 discourses on the software that implements the proposed graphical display. Sections 6 and 7 conclude with discussions.

## 2 Eigenstructure regularization and the condition number

### 2.1 Ridge-type shrinkage estimation

Regularization of the covariance matrix goes back to Stein (1975, 1986), who proposed shrinking the sample eigenvalues towards some central value. This work spurred a large body of literature (see, e.g., Haff 1980, 1991; Yang and Berger 1994; Won et al. 2013; Chi and Lange 2014). Of late, two encompassing forms of what is referred to as ‘ridge-type covariance estimation’ have emerged.

Clearly, one may obtain ridge estimators of the precision matrix by considering the inverses of (1), (2), and (3). For comparisons of these estimators see Lin and Perlman (1985), Daniels and Kass (2001) and van Wieringen and Peeters (2016). For expositions of other penalized covariance and precision estimators we confine by referring to Pourahmadi (2013).

### 2.2 Penalty parameter selection

The choice of penalty-value is crucial to the aforementioned estimators. Let \(\lambda \) denote a generic penalty. When choosing \(\lambda \) too small, an ill-conditioned estimate may ensue when \(p > n\) (see Sect. 2.3). When choosing \(\lambda \) too large, relevant data signal may be lost. Many options for choosing \(\lambda \) are available. The ridge estimators, in contrast to \(\ell _1\)-regularized estimators of the covariance or precision matrix (e.g., Friedman et al. 2008; Bien and Tibshirani 2011), do not generally produce sparse estimates. This implies that model-selection-consistent methods (such as usage of the BIC), are not appropriate. Rather, for \(\ell _2\)-type estimators, it is more appropriate to seek loss efficiency.

A generic strategy for determining an optimal value for \(\lambda \) that can be used for any ridge-type estimator of Sect. 2.1 is *k*-fold cross-validation (of the likelihood function). Asymptotically, such an approach can be explained in terms of minimizing Kullback-Leibler divergence. Unfortunately, this strategy is computationally prohibitive for large *p* and/or large *n*. Lian (2011) and Vujačić et al. (2015) propose approximate solutions to the leave-one-out cross-validation score. While these approximations imply gains in computational efficiency, they are not guaranteed to propose a reasonable optimal value for \(\lambda \).

Ledoit and Wolf (2004b) propose a strategy to determine analytically an optimal value for \(\lambda \) under a modified Frobenius loss for the estimator (1) under certain choices of \(\mathbf {T}\) (cf. Fisher and Sun 2011). This optimal value requires information on the unknown population matrix \(\varvec{\Sigma }\). The optimal penalty-value thus needs to be approximated with some estimate of \(\varvec{\Sigma }\). Ledoit and Wolf (2004b) utilize an *n*-consistent estimator while Schäfer and Strimmer (2005) use the unbiased estimate \(\tfrac{n}{n-1}\hat{\varvec{\Sigma }}\). In practice, this may result in overshrinkage (Daniels and Kass 2001) or even negative penalty-values (Schäfer and Strimmer 2005).

- (i)
When one wants to speedily determine a (minimal) value for \(\lambda \) for which \(\hat{\varvec{\Sigma }}(\lambda )\) is well-conditioned;

- (ii)
When one wants to determine speedily whether an optimal \(\lambda \) proposed by some other (formal) procedure indeed leads to a well-conditioned estimate \(\hat{\varvec{\Sigma }}(\lambda )\);

- (iii)
When one wants to determine speedily a reasonable minimal value for \(\lambda \) for usage in a search-grid (for an optimal such value) by other, optimization-based, procedures.

### 2.3 Spectral condition number

The estimators from Sect. 2.1 are p.d. when their penalty-values are strictly positive. However, they are not necessarily well-conditioned for any strictly positive penalty-value when \(p \gtrapprox n\), especially when the penalty takes a value in the lower range. We seek to quantify the condition of the estimators w.r.t. a given penalty-value. To this end we utilize a condition number (Von Neumann and Goldstine 1947; Turing 1948).

*spectral condition number*for its relation to the spectral decomposition. When \(\mathbf {A}\) is a symmetric p.d. matrix, it is well-known that (Von Neumann and Goldstine 1947; Gentle 2007)

### Example 1

For clarification consider the following toy example. Assume we have a sample covariance matrix \(\hat{\varvec{\Sigma }}\) whose largest eigenvalue \(d(\hat{\varvec{\Sigma }})_1 = 3\). Additionally assume that \(\hat{\varvec{\Sigma }}\) is estimated in a situation where \(p > n\) so that \(d(\hat{\varvec{\Sigma }})_p = 0\) and, hence, \(d(\hat{\varvec{\Sigma }})_1/d(\hat{\varvec{\Sigma }})_p = 3/0 = \infty \) (under the IEEE computing Standard for Floating-Point Arithmetic; IEEE Computer Society 2008). Say we are interested in regularization using the estimator (3) using a scalar target matrix with \(\varphi = 2\). Even using a very small penalty of \(\lambda _{a} = {1\times 10^{-1}}\) it is then quickly verified that \(d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_1/d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_p = 300{,}003 < d(\hat{\varvec{\Sigma }})_1/d(\hat{\varvec{\Sigma }})_p\). Under a large penalty such as \(\lambda _{a} = 10{,}000\) we find that \(d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_1/d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_p = 1.00015\). Indeed, van Wieringen and Peeters (2016) have shown that, in this rotation equivariant setting, \(d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_j \rightarrow 1/\varphi \) as \(\lambda _{a} \rightarrow \infty \) for all *j*. Hence, \(d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_1/d[\hat{\varvec{\Sigma }}^{a}(\lambda _{a})]_p \rightarrow 1\) as the penalty value grows very large. Section 1 of the Supplementary Material visualizes this behavior (in the given setting) for various scenarios of interest: \(\varphi < 1/d(\hat{\varvec{\Sigma }})_1\), \(1/d(\hat{\varvec{\Sigma }})_1< \varphi < 1\), \(\varphi = 1\), \(1< \varphi < d(\hat{\varvec{\Sigma }})_1\), and \(\varphi > d(\hat{\varvec{\Sigma }})_1\). \(\square \)

## 3 The spectral condition number plot

### 3.1 The basic plot

As one may appreciate from the exposition in Sect. 2.3, when \(\hat{\varvec{\Sigma }}(\lambda )\) moves away from near-singularity, small increments in the penalty \(\lambda \) can cause dramatic changes in \(\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda )]\). One can expect that at some point along the domain of \(\lambda \), its value will be large enough for \(\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda )]\) to stabilize (in some relative sense). We will cast these expectations regarding the behavior in a (loose) definition:

### Heuristic Definition

Let \(\hat{\varvec{\Sigma }}(\lambda )\) denote a generic ridge-type estimator of the covariance matrix under generic fixed penalty \(\lambda \). In addition, let \(\Delta \lambda \) indicate a real perturbation in \(\lambda \) as opposed to the theoretical perturbation \(\delta \lambda \). We will term the estimate \(\hat{\varvec{\Sigma }}(\lambda )\)*well-conditioned* when small increments \(\Delta \lambda \) in \(\lambda \) translate to (relatively) small changes in \(\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda + \Delta \lambda )]\) vis-à-vis \(\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda )]\).

*spectral condition number plot*.

Figure 1 indeed plots (7) against the natural logarithm of \(\lambda \). As should be clear from Sect. 2.3, the spectral condition number displays (mostly) decreasing concave upward behavior in the feasible domain of the penalty parameter with a vertical asymptote at \(\ln (0)\) and a horizontal asymptote at \(\mathcal {C}_{2}(\mathbf {T})\) (which amounts to 1 in case of a strictly positive scalar target). The logarithm is used on the *x*-axis as, especially for estimators (2) and (3), it is more natural to consider orders of magnitude for \(\lambda \). In addition, usage of the logarithm ‘decompresses’ the lower domain of \(\lambda \), which enhances the visualization of the point of relative stabilization, as it is in the lower domain of the penalty parameter where ill-conditioning usually ensues when \(p > n\). Figure 1 uses simulated data (see Section 4 of the Supplementary Material) with \(p = 100\) and \(n = 25\). The estimator used is the ad-hoc ridge-type estimator given by (2). One can observe relative stabilization of the spectral condition number—in the sense of the Heuristic Definition—at approximately \(\exp (-\,3)\approx .05\). This value can be taken as a reasonable (minimal) value for the penalty parameter. The spectral condition number plot can be a simple visual tool of interest in the situations sketched in Sect. 2.2, as will be illustrated in Sect. 4.

### 3.2 Interpretational aids

The basic spectral condition number plot can be amended with interpretational aids. The software (see Sect. 5) can add two such aids to form a panel of plots. These aids support the heuristic choice for a penalty-value and provide additional information on the basic plot.

The first aid is the visualization of \(\lfloor \log _{10}\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda )]\rfloor \) against the domain of \(\lambda \). As stated in Sect. 2.3, \(\lfloor \log _{10}\mathcal {C}_{2}[\hat{\varvec{\Sigma }}(\lambda )]\rfloor \) provides an estimate of the digits of accuracy one can expect to loose (on top of the digit loss due to inherent numerical imprecision) in operations based on \(\hat{\varvec{\Sigma }}(\lambda )\). Note that this estimate is dependent on the norm. This aid can, for example, support choosing a (miminal) penalty-value on the basis of the error propagation (in terms of approximate loss in digits of accuracy) one finds acceptable. Figure 2 gives an example.

*y*-axis of the basic plot represents a very wide range.

### 3.3 Choosing \(\lambda _{\mathrm{min}}\) and \(\lambda _{\mathrm{max}}\) for plotting

As stated, the software requires the specification of \(\lambda _{\mathrm{min}}\) and \(\lambda _{\mathrm{max}}\). Practical choices for these arguments depend on the type of ridge-estimator one wants to use. Estimator (1) has a natural upper-bound to its domain for the penalty-value: 1. It is then natural to set \(\lambda _{\mathrm{max}} = 1\) for this estimator. One needs to choose \(\lambda _{\mathrm{min}}\) strictly positive, but small such that it is indicative of ill-conditioning when present. A practical choice that abides these wishes is \(1\times 10^{-5}\). Hence, when using estimator (1) in producing the condition number plot we recommend to set \(\lambda _{\mathrm{I}} \in [1\times 10^{-5}, 1]\)

Estimators (2) and (3) do not have a natural upper-bound to the domain for their penalty-values. Hence, here one also needs to choose \(\lambda _{\mathrm{max}}\) in such a way that the domain of well-conditionedness is represented. We can do this by realizing that estimators (1) and (3) behave, when they have the same p.d. target matrix, similarly at the boundaries of the penalty domain when mapping \(\lambda _{\mathrm{I}}\) and \(\lambda _{a}\) to the same scale (van Wieringen and Peeters 2016). This mapping may be obtained as \(\lambda _{\mathrm{I}} = 1 - 1/(\lambda _{a} + 1)\). When \(\lambda _{a} = 1\times 10^{-5}\) then \(\lambda _{\mathrm{I}} \approx 1\times 10^{-5}\). When \(\lambda _{a} = 20\) then \(\lambda _{\mathrm{I}} \approx .95\), implying almost full shrinkage towards the target in the latter. Hence, for plotting one may choose \(\lambda _{a} \in [1\times 10^{-5}, 20]\). As (2) behaves similarly to (3) at the boundaries of the penalty parameter domain when the latter has the null matrix as the target, it is also a good practical choice to set \(\lambda _{\mathrm{II}} \in [1\times 10^{-5}, 20]\). Note that most illustrations abide by these recommendations.

Software implementing the spectral condition number plot is discussed in Sect. 5. The following section illustrates, using oncogenomics data, the various uses of the spectral condition number plot with regard to covariance or precision matrix regularization. Section 2 of the Supplementary Material contains a second data example to further illustrate usage of the condition number plot. Section 4 of the Supplementary Material contains all R code with which these illustrations can be reproduced (including querying the data).

## 4 Illustration

### 4.1 Context and data

Various histological variants of kidney cancer are designated with the amalgamation ‘renal cell carcinoma’ (RCC). Chromophobe RCC (ChRCC) is a rather rare and predominantly sporadic histological variant, accounting for 4-6% of RCC cases (Stec et al. 2009). ChRCC originates in the distal convoluted tubule (Shuch et al. 2015), a portion of the nephron (the basic structural unit of the kidney) that serves to maintain electrolyte balance (Subramanya and Ellison 2014). Often, histological variants of cancer have a distinct pathogenesis contingent upon the deregulation of certain molecular pathways. A pathway can be thought of as a collection of molecular features that work interdependently to regulate some biochemical function. Recent evidence suggests that (reactivation of) the Hedgehog (Hh) signaling pathway may support cancer development and progression in clear cell RCC (CCRCC) (Dormoy et al. 2009; D’Amato et al. 2014), the most common subtype of RCC. The Hh-signaling pathway is crucial in the sense that it “orchestrates tissue patterning” in embryonic development, making it “critical to normal kidney development, as it regulates the proliferation and differentiation of mesenchymal cells in the metanephric kidney” (D’Amato et al. 2014). Later in life Hh-signaling is largely silenced and constitutive reactivation may elicit and support tumor growth and vascularization (Dormoy et al. 2009; D’Amato et al. 2014). Our goal here is to explore if Hh-signaling might also be reactivated in ChRCC. The exploration will make use of network modeling (see Sect. 4.3) in which the network is taken as a representation of a biochemical pathway. This exercise hinges upon a well-conditioned precision matrix.

We attained data on RCC from the The Cancer Genome Atlas Research Network (2013) as queried through the Cancer Genomics Data Server (Cerami et al. 2012; Gao et al. 2013) using the cgdsrR-package (Jacobsen 2015). All ChRCC samples were retrieved for which messenger ribonucleic acid (mRNA) data is available, giving a total of \(n = 15\) samples. The data stem from the IlluminaHiSeq_RNASeqV2 RNA sequencing platform and consist of normalized relative gene expressions. That is, individual gene expressions are given as mRNA z-scores relative to a reference population that consists of all tumors that are diploid for the gene in question. All features were retained that map to the Hh-signaling pathway according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto 2000), giving a total of \(p = 56\) gene features. Regularization of the desired precision matrix is needed as \(p > n\). Even though features from genomic data are often measured on or mapped to (approximately) the same scale, regularization on the standardized scale is often appropriate as the variability of the features may differ substantially when \(p > n\); a point also made by Warton (2008). Note that we may use the correlation matrix \(\mathbf {R} = (\hat{\varvec{\Sigma }} \circ \mathbf {I}_{p})^{-\frac{1}{2}}\hat{\varvec{\Sigma }}(\hat{\varvec{\Sigma }} \circ \mathbf {I}_{p})^{-\frac{1}{2}}\) instead of \(\hat{\varvec{\Sigma }}\) in Eqs. (1) to (3) without loss of generality.

### 4.2 Penalty parameter evaluation and selection

The precision estimator of choice is the inverse of (3). The target matrix is chosen as \(\mathbf {T} = \varphi \mathbf {I}_{p}\), with \(\varphi \) set to the reciprocal of the average eigenvalue of \(\mathbf {R}\): 1. First, the approximate leave-one-out cross-validation (aLOOCV) procedure (Lian 2011; Vujačić et al. 2015) is used (on the negative log-likelihood) in finding an optimal value for \(\lambda _a\) under the given target and data settings. This procedure searches for the optimal value \(\lambda _{a}^{*}\) in the domain \(\lambda _{a} \in [1\times 10^{-5},20]\). A relatively fine-grained grid of 10,000 log-equidistant steps along this domain points to \(1\times 10^{-5}\) as being the optimal value for the penalty (in the chosen domain). This value seems low given the *p* / *n* ratio of the data. This calls for usage-type (ii) of the condition number plot (Sect. 2.2), where one uses it to determine if an optimal penalty as proposed by some procedure indeed leads to a well-conditioned estimate. The condition number is plotted over the same penalty-domain considered by the aLOOCV procedure. The left-hand panel of Fig. 3 depicts this condition number plot. The dashed (green) vertical line represents the penalty-value that was chosen as optimal by the aLOOCV procedure. Clearly, the precision estimate at \(\lambda _{a} = 1\times 10^{-5}\) is not well-conditioned in the sense of the Heuristic Definition. This exemplifies that the (essentially large-sample) approximation to the LOOCV score may not be suitable for non-sparse situations and/or for situations in which the *p* / *n* ratio grows more extreme (the negative log-likelihood term then tends to completely dominate the bias term). At this point one could use the condition number plot in accordance with usage-type (i), in which one seeks a reasonable minimal penalty-value. This reasonable minimal value (in accordance with the Heuristic Definition) can be found at approximately \(\exp (-\,6)\), at which \(\mathcal {C}_{2}[\hat{\varvec{\Omega }}^{a}(\exp (-\,6))] = \mathcal {C}_{2}[\hat{\varvec{\Sigma }}^{a}(\exp (-\,6))] \approx 247.66\).

One could worry that the precision estimate retains too much noise under the heuristic minimal penalty-value. To this end, a proper LOOCV procedure is implemented that makes use of the root-finding Brent algorithm (Brent 1971). The expectation is that the proper data-driven LOOCV procedure will find an optimal penalty-value in the domain of \(\lambda _{a}\) for which the estimate is well-conditioned. The penalty-space of search is thus constrained to the region of well-conditionedness for additional speed, exemplifying usage-type (iii) of the condition number plot. Hence, the LOOCV procedure is told to search for the optimal value \(\lambda _{a}^{*}\) in the domain \(\lambda _{a} \in [\exp (-\,6),20]\). The optimal penalty-value is indeed found to the right of the heuristic minimal value at 5.2. At this value, indicated by the solid (red) vertical line in Fig. 3, \(\mathcal {C}_{2}[\hat{\varvec{\Omega }}^{a}(5.2)] = \mathcal {C}_{2}[\hat{\varvec{\Sigma }}^{a}(5.2)] \approx 8.76\). The precision estimate at this penalty-value is used in further analysis.

### 4.3 Further analysis

Note that the coupling of a ridge estimate of the precision matrix with post-hoc edge selection differs from the dominant graphical lasso approach to graphical modeling (Friedman et al. 2008) which induces automatic sparsity. It is well-known that \(\ell _1\)-based estimation (and thus support recovery) is consistent only under the assumption that the true graphical model is (very) sparse. When the number of truly non-null elements exceeds the sample size the \(\ell _1\)-penalty is unable to retrieve the sparsity pattern (van Wieringen and Peeters 2016). This is undesirable as there is accumulating evidence that many networks, such as biochemical pathways involved in disease aetiology, are dense (Boyle et al. 2017). In such a situation the coupling of a non-sparsity-inducing penalty with a post-hoc selection step such as the local false discovery rate can outperform the (graphical) lasso (van Wieringen and Peeters 2016; Bilgrau et al. 2015). These considerations underlie our chosen approach.

## 5 Software

The R-package rags2ridges (Peeters et al. 2019) implements the ridge estimators of Sect. 2.1 and the condition number plot through the function CNplot. This function outputs visualizations such as Figs. 1 and 2. The condition number is efficiently determined by the calculation of (solely) the largest and smallest eigenvalues using an implicitly restarted Lanczos method (IRLM; Colvetti et al. 1994) available through the RSpectra package (Qiu and Mei 2019). For most practical purposes this calculation is fast enough, especially in rotation equivariant situations for which only a single IRLM run is required to obtain the complete solution path of the condition number. Additional computational speed is achieved by the implementation of core operations in C++ via Rcpp and RcppArmadillo (Eddelbuettel and François 2011; Eddelbuettel 2013). For example, producing the basic condition number plot for the estimator (3) on the basis of data with dimensions \(p = 1000\) and \(n = 200\), using a scalar target matrix and 1000 steps along the penalty-domain, will take approximately .35 s (see Section 3 of the Supplementary Material for a benchmark exercise). The additional computational cost of the interpretational aids is linear: producing the panel of plots (including interpretational aids) for the situation just given takes approximately .38 s. When \(\lambda \) is very small and \(p \gg n\) the calculation of the condition number may suffer from rounding problems (much like the imaginary linear system \(\varvec{\Sigma }\mathbf{x } = \mathbf{b }\)), but remains adequate in its indication of ill-conditioning.

## 6 Discussion

The condition number plot is a heuristic tool and heuristics should be handled with care. Below, some cases are presented that serve as notes of caution. They exemplify that the proposed heuristic accompanying the condition number plot should not be applied (as any statistical technique) without proper inspection of the data.

### 6.1 Artificial ill-conditioning

A first concern is that, when the variables are measured on different scales, artificial ill-conditioning may ensue (see, e.g., Gentle 2007). In case one worries if the condition number is an adequate indication of error propagation when using variables on their original scale one can ensure that the columns (or rows) of the input matrix are on the same scale. This is easily achieved by scaling the input covariance matrix to be the correlation matrix. Another issue is that it is not guaranteed that the condition number plot will give an unequivocal point of relative stabilization for every data problem (which hinges in part on the chosen domain of the penalty parameter). Such situations can be dealt with by extending the domain of the penalty parameter or by determining the value of \(\lambda \) that corresponds to the loss of \(\lfloor \log _{10}\mathcal {C}[\hat{\varvec{\Sigma }}(\lambda )]\rfloor \) digits (in the imaginary linear system \(\varvec{\Sigma }\mathbf{x } = \mathbf{b }\)) one finds acceptable.

### 6.2 Naturally high condition numbers

Some covariance matrices may have high condition numbers as their ‘natural state’. Consider the following covariance matrix: \(\varvec{\Sigma }_{ \text{ equi }} = (1- \varrho ) \mathbf {I}_{p} + \varrho \mathbf {J}_{p}\), with \(-1/(p - 1)< \varrho < 1\) and where \(\mathbf {J}_{p}\) denotes the \((p \times p)\)-dimensional all-ones matrix. The variates are thus equicorrelated with unit variance. The eigenvalues of this covariance matrix are \(p \varrho + (1-\varrho )\) and (with multiplicity \(p-1\)) \(1-\varrho \). Consequently, its condition number equals \(1 + p \varrho / (1-\varrho )\). The condition number of \(\varvec{\Sigma }_{ \text{ equi }}\) thus becomes high when the number of variates grows large and/or the (marginal) correlation \(\varrho \) approaches one (or \(-1/(p - 1)\)). The large ratio between the largest and smallest eigenvalues of \(\varvec{\Sigma }_{ \text{ equi }}\) in such situations mimics a high-dimensional setting in which any non-zero eigenvalue of the sample covariance estimate is infinitely larger than the smallest (zero) eigenvalues. However, irrespective of the number of samples, any sample covariance estimate of an \(\varvec{\Sigma }_{ \text{ equi }}\) with large *p* and \(\varrho \) close to unity (or \(-1/(p - 1)\)) exhibits such a large ratio. Would one estimate the \(\varvec{\Sigma }_{ \text{ equi }}\) in penalized fashion (even for reasonable sample sizes) and choose the penalty parameter from the condition number plot as recommended, then one would select a penalty parameter that yields a ‘well-conditioned’ estimate. Effectively, this amounts to limiting the difference between the penalized eigenvalues, which need not give a condition number representative of \(\varvec{\Sigma }_{ \text{ equi }}\). Thus, the recommendation to select the penalty parameter from the well-conditioned domain of the condition number plot may in some (perhaps exotic) cases lead to a choice that crushes too much relevant signal (shrinking the largest eigenvalue too much). For high-dimensional settings this may be unavoidable, but for larger sample sizes this is undesirable.

### 6.3 Contamination

Real data is often contaminated with outliers. To illustrate the potential effect of outliers on the usage of the condition number plot, consider data \(\mathbf{y }_{i}\) drawn from a contaminated distribution, typically modeled by a mixture distribution: \(\mathbf{y }_{i} \sim (1- \phi ) \mathcal {N}_p(\mathbf {0}, \varvec{\Sigma }) + \phi \mathcal {N}_p(\mathbf {0} , c \mathbf {I}_{p})\) for \(i = 1, \dots , n\), some positive constant \(c > 0\), and mixing proportion \(\phi \in [0,1]\). Then, the expectation of the sample covariance matrix \(\mathbb {E}(\mathbf{y }_{i}\mathbf{y }_{i}^{\mathrm{T}}) = (1- \phi ) \varvec{\Sigma } + c \phi \mathbf {I}_{p}\). Its eigenvalues are: \(d[ (1- \phi ) \varvec{\Sigma } + c \phi \mathbf {I}_{p}]_{j} = (1- \phi ) d(\varvec{\Sigma })_{j} + c \phi \), for \(j = 1, \dots , p\). In high-dimensional settings with few samples the presence of any outlier corresponds to mixing proportions clearly deviating from zero. In combination with any substantial *c* the contribution of the outlier(s) to the eigenvalues may be such that the contaminated sample covariance matrix is represented as better conditioned (vis-à-vis its uncontaminated counterpart). It is the outlier(s) that will determine the domain of well-conditionedness in such a situation. Then, when choosing the penalty parameter in accordance with the Heuristic Definition, undershrinkage may ensue. In situations in which the results are influenced by outliers one has several options at disposal. One could simply trim the data as a preprocessing step before obtaining \(\hat{\varvec{\Sigma }}\). Another option would be to use techniques for identifying (and subsequently removing) multivariate outliers such as those based on the robust Mahalanobis distance (Mahalanobis 1936). One may also opt to use a robust estimator for \(\hat{\varvec{\Sigma }}\), such as the well-known Minimum Covariance Determinant estimator (Rousseeuw 1984), in producing the condition number plot.

## 7 Conclusion

We have proposed a simple visual display that may be of aid in determining the value of the penalty parameter in ridge-type estimation of the covariance or precision matrix when the number of variables is large relative to the sample size. The visualization we propose plots the spectral condition number against the domain of the penalty parameter. As the value of the penalty parameter increases, the covariance (or precision) matrix will move away from (near) singularity. In some lower-end of the domain this will mean that small increments in the value of the penalty parameter will lead to large decreases of the condition number. At some point, the condition number can be expected to stabilize, in the sense that small increments in the value of the penalty have (relatively) little effect on the condition number. The point of relative stabilization may be deemed to indicate a reasonable (minimal) value for the penalty parameter. Hence, in analogy to usage of the scree plot in factor analysis (Cattell 1966), initial interest will lie with the assessment of the ‘elbow’ of the plot.

Usage of the condition number plot was exemplified in situations concerned with the direct estimation of covariance or precision matrices. The plot may also be of interest in situations in which (scaled versions of) these matrices are conducive to further computational procedures. For example, it may support the ridge approach to the regression problem \(\mathbf{x } = \mathbf {Y}\varvec{\beta } + \varvec{\epsilon }\). We would then assess the conditioning of \(\mathbf {Y}^{\mathrm{T}}\mathbf {Y} + \lambda \mathbf {I}_p\) for use in the ridge-solution to the regression coefficients: \(\hat{\varvec{\beta }} = (\mathbf {Y}^{\mathrm{T}}\mathbf {Y} + \lambda \mathbf {I}_p)^{-1}\mathbf {Y}^{\mathrm{T}}\mathbf{x }\).

We explicitly state that we view the proposed condition number plot as an *heuristic tool*. We emphasize ‘tool’, as it gives easy and fast access to penalty-value assessment and determination without proposing an optimal (in some sense) value. Also, in the tradition of exploratory data analysis (Tukey 1977), usage of the condition number plot requires good judgment. As any heuristic method, it is not without flaws.

Notwithstanding these concerns, the condition number plot gives access to a fast and generic (i.e., non-target and non-ridge-type specific) procedure for regularization parameter determination that is of use when analytic solutions are not available and when other procedures fail. In addition, the condition number plot may aid more formal procedures, in terms of assessing if a well-conditioned estimate is indeed obtained, and in terms of proposing a reasonable minimal value for the regularization parameter for usage in a search grid.

## Notes

### Acknowledgements

The Authors would like to thank two anonymous reviewers whose constructive comments have led to an improvement in presentation.

## Supplementary material

## References

- Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, 3rd edn. Society for Industrial and Applied Mathematics, PhiladelphiaCrossRefzbMATHGoogle Scholar
- Bien J, Tibshirani R (2011) Sparse estimation of a covariance matrix. Biometrika 98:807–820MathSciNetCrossRefzbMATHGoogle Scholar
- Bilgrau AE, Peeters CFW, Eriksen PS, Boegsted M, van Wieringen WN (2015) Targeted fused ridge estimation of inverse covariance matrices from multiple high-dimensional data classes. Technical report. arXiv:1509.07982 [stat.ME]
- Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell 169:1177–1186CrossRefGoogle Scholar
- Brent RP (1971) An algorithm with guaranteed convergence for finding a zero of a function. Comput J 14:422–425MathSciNetCrossRefzbMATHGoogle Scholar
- Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1:245–276CrossRefGoogle Scholar
- Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2:401–404CrossRefGoogle Scholar
- Cheney W, Kincaid D (2008) Numerical computing and mathematics, 6th edn. Thomson Brooks, BelmontzbMATHGoogle Scholar
- Chi EC, Lange K (2014) Stable estimation of a covariance matrix guided by nuclear norm penalties. Comput Stat Data Anal 80:117–128MathSciNetCrossRefzbMATHGoogle Scholar
- Colvetti D, Reichel L, Sorensen DC (1994) An implicitely restarted Lanczos method for large symmetric eigenvalue problems. Electron Trans Numer Anal 2:1–21MathSciNetGoogle Scholar
- D’Amato C, Rosa R, Marciano R, D’Amato V, Formisano L, Nappi L, Raimondo L, Di Mauro C, Servetto A, Fulciniti F, Cipolletta A, Bianco C, Ciardiello F, Veneziani BM, De Placido S, Bianco R (2014) Inhibition of Hedgehog signalling by NVP-LDE225 (Erismodegib) interferes with growth and invasion of human renal cell carcinoma cells. Br J Cancer 111:1168–1179CrossRefGoogle Scholar
- Daniels MJ, Kass RE (2001) Shrinkage estimators for covariance matrices. Biometrics 57:1173–1184MathSciNetCrossRefzbMATHGoogle Scholar
- Demmel JW (1987) On condition numbers and the distance to the nearest ill-posed problem. Numer Math 51:251–289MathSciNetCrossRefzbMATHGoogle Scholar
- Devlin SJ, Gnanadesikan R, Kettenring JR (1975) Robust estimation and outlier detection with correlation coefficients. Biometrika 62:531–545CrossRefzbMATHGoogle Scholar
- Dormoy V, Danilin S, Lindner V, Thomas L, Rothhut S, Coquard C, Helwig JJ, Jacqmin D, Lang H, Massfelder T (2009) The sonic hedgehog signaling pathway is reactivated in human renal cell carcinoma and plays orchestral role in tumor growth. Mol Cancer 8:123CrossRefGoogle Scholar
- Eddelbuettel D (2013) Seamless R and C++ integration with Rcpp. Springer, New YorkCrossRefzbMATHGoogle Scholar
- Eddelbuettel D, François R (2011) Rcpp: seamless R and C++ integration. J Stat Softw 40(8):1–18CrossRefGoogle Scholar
- Fisher TJ, Sun X (2011) Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput Stat Data Anal 55:1909–1918MathSciNetCrossRefzbMATHGoogle Scholar
- Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–441CrossRefzbMATHGoogle Scholar
- Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N (2013) Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6:pl1CrossRefGoogle Scholar
- Gentle JE (2007) Matrix algebra: theory, computations, and applications in statistics. Springer, New YorkCrossRefzbMATHGoogle Scholar
- Ha MJ, Sun W (2014) Partial correlation matrix estimation using ridge penalty followed by thresholding and re-estimation. Biometrics 70:765–773MathSciNetCrossRefzbMATHGoogle Scholar
- Haff LR (1980) Empirical Bayes estimation of the multivariate normal covariance matrix. Ann Stat 8:586–597MathSciNetCrossRefzbMATHGoogle Scholar
- Haff LR (1991) The variational form of certain Bayes estimators. Ann Stat 19:1163–1190MathSciNetCrossRefzbMATHGoogle Scholar
- Higham DJ (1995) Condition numbers and their condition numbers. Linear Algebra Appl 214:193–213MathSciNetCrossRefzbMATHGoogle Scholar
- Hoerl AE, Kennard R (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67CrossRefzbMATHGoogle Scholar
- IEEE Computer Society (2008) IEEE standard for floating-point arithmetic. IEEE Std 754–2008, pp 1–70Google Scholar
- Jacobsen A (2015) cgdsr: R-based API for accessing the MSKCC Cancer Genomics Data Server (CGDS). R package version 1.2.5. http://CRAN.R-project.org/package=cgdsr. Accessed 13 Apr 2019
- Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucl Acids Res 28(1):27–30CrossRefGoogle Scholar
- Ledoit O, Wolf M (2003) Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J Empir Finance 10:603–621CrossRefGoogle Scholar
- Ledoit O, Wolf M (2004a) Honey, I shrunk the sample covariance matrix. J Portf Manag 30:110–119CrossRefGoogle Scholar
- Ledoit O, Wolf M (2004b) A well-conditioned estimator for large-dimensional covariance matrices. J Multivar Anal 88:365–411MathSciNetCrossRefzbMATHGoogle Scholar
- LeVeque RJ (2007) Finite difference methods for ordinary and partial differential equations: steady state and time dependent problems. Society for Industrial and Applied Mathematics (SIAM), PhiladelphiaCrossRefzbMATHGoogle Scholar
- Lian H (2011) Shrinkage tuning parameter selection in precision matrices estimation. J Stat Plan Inference 141:2839–2848MathSciNetCrossRefzbMATHGoogle Scholar
- Lin S, Perlman M (1985) A Monte Carlo comparison of four estimators of a covariance matrix. In: Krishnaiah PR (ed) Multivariate analysis, 6th edn. North Holland, Amsterdam, pp 411–429Google Scholar
- Mahalanobis PC (1936) On the generalised distance in statistics. Proc Natl Inst Sci India 2:49–55zbMATHGoogle Scholar
- Peeters CFW, Bilgrau AE, van Wieringen WN (2019) rags2ridges: Ridge estimation of precision matrices from high-dimensional data. R package version 2.2.1. http://cran.r-project.org/package=rags2ridges. Accessed 13 Apr 2019
- Pourahmadi M (2013) High-dimensional covariance estimation. Wiley, HobokenCrossRefzbMATHGoogle Scholar
- Qiu Y, Mei J (2019) RSpectra: solvers for large-scale eigenvalue and SVD problems. R package version 0.14-0. https://CRAN.R-project.org/package=RSpectra. Accessed 13 Apr 2019
- R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. ISBN 3-900051-07-0. Accessed 13 Apr 2019
- Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79:871–880MathSciNetCrossRefzbMATHGoogle Scholar
- Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4:art. 32MathSciNetCrossRefGoogle Scholar
- Shuch B, Amin A, Armstrong AJ, Eble JN, Ficarra V, Lopez-Beltran A, Martignoni G, Rini BI, Kutikov A (2015) Understanding pathologic variants of renal cell carcinoma: distilling therapeutic opportunities from biologic complexity. Eur Urol 67:85–97CrossRefGoogle Scholar
- Stec R, Grala B, Ma̧czewski M, Bodnar L, Szczylik C (2009) Chromophobe renal cell cancer-review of the literature and potential methods of treating metastatic disease. J Exp Clin Cancer Res 28:134CrossRefGoogle Scholar
- Stein C (1975) Estimation of a covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, GeorgiaGoogle Scholar
- Stein C (1986) Lectures on the theory of estimation of many parameters. J Math Sci 34:1373–1403CrossRefzbMATHGoogle Scholar
- Subramanya AR, Ellison DH (2014) Distal convoluted tubule. Clin J Am Soc Nephrol 9:2147–2163CrossRefGoogle Scholar
- The Cancer Genome Atlas Research Network (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499:43–49CrossRefGoogle Scholar
- Tukey JW (1977) Exploratory data analysis. Addison-Wesley, BostonzbMATHGoogle Scholar
- Turing AM (1948) Rounding-off errors in matrix processes. Q J Mech Appl Math 1:287–308MathSciNetCrossRefzbMATHGoogle Scholar
- van Wieringen WN, Peeters CFW (2016) Ridge estimation of inverse covariance matrices from high-dimensional data. Comput Stat Data Anal 103:284–303MathSciNetCrossRefzbMATHGoogle Scholar
- Von Neumann J, Goldstine HH (1947) Numerical inverting of matrices of high order. Bull Am Math Soc 53:1021–1099MathSciNetCrossRefzbMATHGoogle Scholar
- Vujačić I, Abbruzzo A, Wit EC (2015) A computationally fast alternative to cross-validation in penalized Gaussian graphical models. J Stat Comput Simul 85:3628–3640MathSciNetCrossRefGoogle Scholar
- Warton DI (2008) Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J Am Stat Assoc 103:340–349MathSciNetCrossRefzbMATHGoogle Scholar
- Whittaker J (1990) Graphical models in applied multivariate statistics. Wiley, ChichesterzbMATHGoogle Scholar
- Won JH, Lim J, Kim SJ, Rajaratnam B (2013) Condition-number-regularized covariance estimation. J R Stat Soc Ser B 75:427–450MathSciNetCrossRefzbMATHGoogle Scholar
- Yang R, Berger JO (1994) Estimation of a covariance matrix using the reference prior. Ann Stat 22:1195–1211MathSciNetCrossRefzbMATHGoogle Scholar
- Yuan KH, Chan W (2008) Structural equation modeling with near singular covariance matrices. Comput Stat Data Anal 52:4842–4858MathSciNetCrossRefzbMATHGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.