1 Introduction

There are many possible ways to quantify the dependency between two numerical variables X and Y. If the user is interested in linear dependencies, Pearson’s correlation coefficient r(XY) can be used. For non-linear cases, information theory provides a well-established measure, the mutual information between X and Y (Cover and Thomas 2012). More recently, a number of novel information theoretic based measures have been proposed (Reshef et al. 2011; Sugiyama and Borgwardt 2013). In the last few years, distance based measures have also become popular (Székely and Rizzo 2009; Lopez-Paz et al. 2013) as well as methods that employ kernels to measure dependency (Gretton et al. 2005, 2012). Some of these measures, but not all of them, can also be used to quantify the dependency between two sets of numerical variables \({\mathbf {X}}\) and \({\mathbf {Y}}\). Table 1 sketches the difference between dependency measures applicable to variables and sets of variables. All such measures are estimated on samples of \({\mathbf {X}}\) and \({\mathbf {Y}}\), and a measure with high variance can encounter problems in discriminating between a strong and a weaker relationship. For example, when testing for independence between \({\mathbf {X}}\) and \({\mathbf {Y}}\), their relationship on a sample is compared to their relationship under the independence assumption (Simon and Tibshirani 2011). In the case of mutual information, the importance of reducing variance while minimizing the impact on the bias is implied by the statements in Kraskov et al. (2004), Margolin et al. (2006) and Schaffernicht et al. (2010), which can be summarized as: when comparing dependencies, systematic estimation biases cancel each other out. Therefore smaller variance for mutual information yields a more accurate ranking of relationships. In this paper, we investigate the role of bias and variance of the estimator of mutual information based on grids to compare relationships. Ranking dependencies between variables or set of variables is fundamental for a number of important applications, such as feature selection (Guyon and Elisseeff 2003) and network inference (Villaverde et al. 2013).

Table 1 Differences between dependency measures for variables and set of variables

To quantify the dependency between two sets of numerical variables, we propose a low-variance measure based on information and ensemble theory that can capture many relationship types. Our measure, named the randomized information coefficient (RIC), is computed by randomly generating K discretization grids \(G_k\) and averaging the normalized mutual information (\( NI \)) (Kvalseth 1987) over all the grids as:

$$\begin{aligned} \text{ RIC }({\mathbf {X}},{\mathbf {Y}}) \triangleq \frac{1}{K}\sum _{k=1}^{K} NI \big ( ({\mathbf {X}},{\mathbf {Y}}) | G_k \big ) \end{aligned}$$
(1)

Normalization enables us to consider grids with different cardinalities. The normalized mutual information on a grid G is defined as

$$\begin{aligned} NI \big ( ({\mathbf {X}},{\mathbf {Y}}) |G \big ) \triangleq \frac{I\big ( ({\mathbf {X}},{\mathbf {Y}}) |G\big )}{ \max { \{ H({\mathbf {X}}|G),H({\mathbf {Y}}|G) \} }} \end{aligned}$$
(2)

where I and H are respectively the mutual information and the entropy function for discrete variables. We choose to normalize by \(\max { \{ H({\mathbf {X}}), H({\mathbf {Y}}) \} }\) as it is the tightest upper bound that still preserves the metric properties of \({ NI }\) (Vinh et al. 2010).

The intuition behind this measure is that on average a random grid can encapsulate the relationship between \({\mathbf {X}}\) and \({\mathbf {Y}}\). Both random discretization and ensembles of classifiers have been shown to be effective in machine learning, for example, in random forests (Breiman 2001). Substantial randomization has been shown to be even more effective in reducing the variance of predictions (Geurts et al. 2006). Our aim is to exploit this powerful approach to develop an efficient, effective and easy-to-compute statistic for quantifying dependency between two set variables.

Our contributions in this paper are three-fold:

  • We propose a low-variance statistic (RIC) based on information and ensemble theory, which is efficient and easy to compute;

  • Via theoretical analysis and extensive experimental evaluation, we link our measure’s strong performance on (i) discrimination between strong and weak noisy relationships, and (ii) ranking of relationships, to its low variance estimation of mutual information;

  • We extensively demonstrate the competitive performance of RIC versus 16 state-of-the-art dependency measures using both simulated and real scenarios.

2 Related work

We first present a brief review of the many available dependency measures and their connections with RIC.

2.1 Correlation and kernel based measures

When the user is only interested in linear dependencies between two variables, the sample Pearson’s correlation coefficient r is powerful. This was extended in Székely and Rizzo (2009) to handle non-linear dependencies between two sets of variables using distance correlation (dCorr). More recently, random projections have been employed to achieve speed improvements (Lopez-Paz et al. 2013), yielding the randomized dependency coefficient (RDC). RDC might be seen as a randomized way to identify the maximal correlation between sets of variables and thus can also be seen as an extension of the alternative conditional expectation (ACE) algorithm proposed in Breiman and Friedman (1985). In our work, the random discretization grids used in RIC can be seen as random projections. However, we do not use a linear measure of dependency such as r because this would require optimization across projections to return a meaningful result. Instead, we compute the normalized mutual information that quantifies non-linear dependencies for each possible projection (grid). This approach allows us to take into account every single grid and each of them contributes to the computation of the average value of \( NI \) across grids. No optimization is required.

The correlation between two sets of variables can also be measured employing the joint distribution of the studied variables under kernel embeddings. The Hilbert–Schmidt independence criterion (HSIC) (Gretton et al. 2005) is an example of such measures that has been shown to be competitive in feature selection tasks (Song et al. 2007). RIC measures the dependency between two sets of variables employing their distribution without kernel embeddings: the distribution is efficiently estimated making use of the random grid and no kernels are used because the distribution estimated with the grid can be straightforwardly plugged in the normalized mutual information formula.

2.2 Mutual information

The mutual information (MI) between two sets of random variables \(I({\mathbf {X}},{\mathbf {Y}})\) is a powerful and well established dependency measure (Cover and Thomas 2012). A number of different estimators have been proposed for mutual information (Steuer et al. 2002; Kraskov et al. 2004). The standard approach however consists of discretizing the space of possible values that \({\mathbf {X}}\) and \({\mathbf {Y}}\) can take and then estimating the probability mass function using the frequency of occurrence. There are many possible approaches to discretization of random variables. For example, a single random variable can be easily discretized according to equal-width or equal-frequency binning, or according to more complex principles such as the minimum description length (Fayyad and Irani 1993). We note that there is no universally accepted optimal discretization technique. Even though, for sets of variables few sensible discretization have been proposed (Dougherty et al. 1995; Garcia et al. 2013), to our knowledge, there is no extensive survey about the estimation of mutual information with multiple variable discretization approaches.

Mutual information estimators based on discretization in equal width intervals have been discussed in Steuer et al. (2002). Particularly crucial is the choice of the number of bins used to discretize X and Y: too big values lead to overestimation of mutual information due to a finite-sample effect. To mitigate this problem, adaptive partitioning of the discretization grid on the joint distribution (XY) has been proposed (Fraser and Swinney 1986) and optimized for speed (Cellucci et al. 2005). Other competitive mutual information estimators used in practice are Kraskov’s k nearest neighbors estimator (Kraskov et al. 2004) and the kernel density estimator (Moon et al. 1995). An extensive comparison of these estimators can be found in Khan et al. (2007). Mutual information has been successfully employed for a variety of applications, such as feature selection (Nguyen et al. 2014b) and reverse engineering genetic networks (Villaverde et al. 2013). Given the evident number of application scenarios of mutual information and its undeniable efficacy, we choose to use the discretization-based MI estimator as the main building block of RIC. We further make use of normalization because it helps to deflate mutual information on finite samples, bounding the output values in [0, 1] (Romano et al. 2014).

2.3 Other information theoretic measures

More recently, new measures based on information theory, such as the maximal information coefficient (MIC) presented in Reshef et al. (2011) and the mutual information dimension (MID) (Sugiyama and Borgwardt 2013), have been proposed. MID is based on discretization and it aims to outperform other measures in white noise scenarios. In particular, it outperforms MIC under white noise. Other prominent features of MID include its efficiency with an average running time \({\mathcal {O}}(n\log {n})\), and the ability to characterize multi-functional relationships with a score of 1. MIC is another successful measure of dependence whose value is interpretable in various settings. Its value is obtained by performing discretization using grids over the joint distribution (XY). MIC satisfies a useful property called equitability, which allows it to act as a proxy for the coefficient of determination \(R^2\) of a functional relationship (Reshef et al. 2015b).

Reshef et al. (2015b) also proposed two new statistics based on grids in this recent preprint: MIC\(_e\) which is an improved estimator of the population value of MIC; and the total information coefficient (TIC\(_e\)) to achieve high power when testing for independence between variables. In a thorough study, Reshef et al. (2015a) compared many different dependency measures between variables and demonstrated that MIC\(_e\) and TIC\(_e\) are the state-of-the-art to obtain high equitability and high power respectively. MIC\(_e\) optimizes the normalized mutual information over all grid cardinalities and grid cut-offs. TIC\(_e\) still optimizes the possible cut-offs for a grid, but returns the sum over grid cardinalities instead. Independently, another statistic based on grids and normalized mutual information has been suggested in the attempt to maximize power: the generalized mean information coefficient (GMIC) (Luedtke and Tran 2013). Nonetheless, only TIC\(_e\) has been shown to be asymptotically consistent and to be the state-of-the-art to achieve power when testing for independence.

Table 2 Dependency measures available in literature compared by their applicability to sets of variables and their best and worst case computational complexity

In this paper we introduce RIC. RIC is a dependency measure to compare sets of random variables based on normalized mutual information which is efficient and easy to compute. Table 2 shows a list of dependency measures currently available in literature. Not all of them is applicable to set of variables and some show high computational complexity with regards to the number of points n. Some complexities can be obtained with particular parameter choices or clever implementation techniques. We refer to the respective papers for a detailed analysis. Moreover, recent advances in this area have delivered faster computational techniques for the most recently proposed measures of dependence. For example, the approximated estimator for the population value of MIC can be sped up (Tang et al. 2014; Zhang et al. 2014), and the new exact estimator MIC\(_e\) provides very competitive computational complexity. Moreover, very recently a new technique for fast computation of distance correlation has been proposed (Huo and Szekely 2014).

3 The randomized information coefficient

The Randomized Information Coefficient (RIC) between the set \({\mathbf {X}}\) of p variables and the set \({\mathbf {Y}}\) of q variables is defined as the expected normalized mutual information (NMI) across all possible discretization grids that encapsulate the joint probability distribution for \(({\mathbf {X}},{\mathbf {Y}})\):

$$\begin{aligned} {\mathcal {RIC}}({\mathbf {X}},{\mathbf {Y}}) \triangleq \int _{G} {\mathcal {NMI}}({\mathbf {X}},{\mathbf {Y}}|G)P(G) dG = \int _{G} \frac{\mathcal {I}({\mathbf {X}},{\mathbf {Y}}|G)}{\max { \{ \mathcal {H}({\mathbf {X}}|G),\mathcal {H}({\mathbf {Y}}|G) \} }} P(G) dG \end{aligned}$$
(3)

A grid G for the sets of variables \({\mathbf {X}}\) and \({\mathbf {Y}}\) is the Cartesian product of the two partitions \(G_X\) and \(G_Y\): i.e., \(G = G_X \times G_Y\). \(G_X\) is a partition of the domain of the variables in \({\mathbf {X}}\) in r disjoint sets \(S^X_u\). \(G_Y\) is a partition of the domain of the variables in \({\mathbf {Y}}\) in c disjoint sets \(S^Y_v\). There are infinitely many partitions \(G_X\) and \(G_Y\), therefore there exists a continuum of discretization grids G. The probability distribution of the grids G is defined via the probability distribution of the partitions \(G_X\) and \(G_Y\). Both partitions \(G_X\) and \(G_Y\) are defined via a number of discretization cut-offs chosen independently. The probability density function (pdf) of a cut-off defined on \({\mathbf {X}}\) is the pdf of \({\mathbf {X}}\), and the pdf of a cut-off defined on \({\mathbf {Y}}\) is the pdf of \({\mathbf {Y}}\). We limit the total number of cut-offs for \({\mathbf {X}}\) and for \({\mathbf {Y}}\) to \(D_{\max }\). Being G the Cartesian product of the two partitions, the probability distribution of the grids G is defined by the probability distribution of the list of cut-offs \(\gamma _1,\ldots ,\gamma _{D_{\max }^2}\). Therefore, \(P(G) = P(\gamma _1,\ldots ,\gamma _{D_{\max }^2})=P(\gamma _1)\cdots P(\gamma _{D_{\max }^2})\).

The grids \(G_X\), \(G_Y\), and G discretize the domain of \({\mathbf {X}}\), \({\mathbf {Y}}\), and \(({\mathbf {X}},{\mathbf {Y}})\) respectively so that the true mutual information \({\mathcal {I}}\) and the true entropy \({\mathcal {H}}\) can be computed with the following well know formulas:

$$\begin{aligned} \mathcal {I}({\mathbf {X}},{\mathbf {Y}}|G)\triangleq & {} \sum _{u=1}^r \sum _{v=1}^c P({\mathbf {X}} \in S_u^X, {\mathbf {Y}} \in S_v^Y) \log { \frac{P({\mathbf {X}} \in S_u^X, {\mathbf {Y}} \in S_v^Y)}{P({\mathbf {X}} \in S_u^X) \cdot P({\mathbf {Y}} \in S_v^Y)} }\end{aligned}$$
(4)
$$\begin{aligned} \mathcal {H}({\mathbf {X}}|G)\triangleq & {} - \sum _{u=1}^r P({\mathbf {X}} \in S_u^X) \log { P({\mathbf {X}} \in S_u^X) } \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {H}({\mathbf {Y}}|G)\triangleq & {} - \sum _{v=1}^c P({\mathbf {Y}} \in S_v^Y) \log { P({\mathbf {Y}} \in S_v^Y) } \end{aligned}$$
(6)

\({\mathcal {RIC}}({\mathbf {X}},{\mathbf {Y}})\) is a measure of dependence between the set of variables \({\mathbf {X}}\) and the set of variables \({\mathbf {Y}}\). Being based on mutual information, the true value of RIC is always non-negative and it is equal to 0 under independence of \({\mathbf {X}}\) and \({\mathbf {Y}}\).

Theorem 1

It holds true that:

  1. (i)

    \(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\) if and only if \({\mathbf {X}}\) and \({\mathbf {Y}}\) are independent;

  2. (ii)

    \(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) \le 1\).

Proof

  1. (i)

    (\(\mathbf {X}\) and \(\mathbf {Y}\) are independent \(\Rightarrow \) \(\mathcal {RIC}(\mathbf {X},\mathbf {Y}) = 0\)) If the variables in \(\mathbf {X}\) and independent from the variables in \({\mathbf {Y}}\), for any randomization grid G it holds true that \({\mathcal {I}}({\mathbf {X}},{\mathbf {Y}}|G) = 0\). Therefore, \(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\).

    (\(\mathcal {RIC}({\mathbf {X}},{\mathbf {Y}}) = 0\) \(\Rightarrow \) \({\mathbf {X}}\) and \(\mathbf {Y}\) are independent) For any randomization grid G, the mutual information is always non-negative: \(\mathcal {I}(\mathbf {X},\mathbf {Y}|G) \ge 0\). Thus, the normalized mutual information is also non-negative: \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}|G) \ge 0\). Therefore being RIC the expected value of a non-negative quantity, \(\mathcal {RIC}(\mathbf {X},\mathbf {Y}) = \int _{G} \mathcal {NMI}(\mathbf {X},\mathbf {Y}|G)P(G) dG = 0\) implies that \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}|G)\) is equal to 0 for any possible G. If the normalized mutual information is equal to 0 also the mutual information is equal to 0: \(\mathcal {I}(\mathbf {X},\mathbf {Y}|G) = 0\). This implies that \(\mathbf {X}\) and \(\mathbf {Y}\) are independent according to the discretization imposed by G (Cover and Thomas 2012). This is true for every possible discretization grid G, therefore also \(\mathbf {X}\) and \(\mathbf {Y}\) are independent.

  2. (ii)

    For any grid G, \(\mathcal {NMI}(\mathbf {X},\mathbf {Y}|G) \le 1\) because \(\mathcal {I}(\mathbf {X},\mathbf {Y}|G) \le \max { \{ \mathcal {H}(\mathbf {X}|G),\mathcal {H}(\mathbf {Y}|G) \}}\). Thus,

    $$\begin{aligned} \mathcal {RIC}(\mathbf {X},\mathbf {Y})&= \int _{G} \mathcal {NMI}(\mathbf {X},\mathbf {Y}|G)P(G) dG \\&\le \int _G P(G) dG = \int _{-\infty }^{\infty } \cdots \int _{-\infty }^{\infty } P(\gamma _1,\ldots ,\gamma _{D_{\max }^2}) d\gamma _1,\ldots ,d\gamma _{D_{\max }^2} \\&= \int _{-\infty }^{\infty } P(\gamma _1) d\gamma _1 \cdots \int _{-\infty }^{\infty } P(\gamma _{D_{\max }^2}) d\gamma _{D_{\max }^2} = 1. \end{aligned}$$

\(\square \)

Mind that RIC is equal to 0 when variables in \({\mathbf {X}}\) are independent from the variables in \({\mathbf {Y}}\) even if the variables in either the set \({\mathbf {X}}\) or the set \({\mathbf {Y}}\) are dependent to each others.

RIC is computed on a data set \(\{({\mathbf {X}}_i,{\mathbf {Y}}_i)\}_{i=0\ldots ,n-1}\) of n data points according to Eq. (1) making use of a finite set of K randomization grids G. When a grid G is applied to a data set, we denote with \(({\mathbf {X}},{\mathbf {Y}})|G\) the contingency table between \({\mathbf {X}}\) and \({\mathbf {Y}}\). A contingency table counts the occurrences of the data points of the studied data set in the portions of the domain defined by \(S^X_u\), \(S^Y_v\), and \(S^X_u \cap S^Y_v\) with \(1\le u \le r\) and \(1 \le v \le c\). Let \(a_u\) and \(b_v\) be the count of data points in the portion of the domain defined by \(S^X_u\) and \(S^Y_v\) respectively. Let \(n_{uv}\) be the number of data points in the portion of the domain defined by \(S^X_u \cap S^Y_v\). Table 3 shows an example of contingency table. The mutual information \(I({\mathbf {X}},{\mathbf {Y}}|G)\), the entropy \(H({\mathbf {X}}|G)\), and the entropy \(H({\mathbf {Y}}|G)\) are estimated a data set of n points as follows:

$$\begin{aligned} I({\mathbf {X}},{\mathbf {Y}}|G)\triangleq & {} \sum _{u=1}^r \sum _{v=1}^c \frac{n_{uv}}{n} \log { \frac{n_{uv} \cdot n}{a_u b_v} } \end{aligned}$$
(7)
$$\begin{aligned} H({\mathbf {X}}|G)\triangleq & {} - \sum _{u=1}^r \frac{a_{u}}{n} \log { \frac{a_{u}}{n} } \quad \quad H({\mathbf {Y}}|G) \triangleq - \sum _{v=1}^c \frac{b_{v}}{n} \log { \frac{b_{v}}{n} } \end{aligned}$$
(8)
Table 3 Contingency table \(({\mathbf {X}},{\mathbf {Y}})|G\) on a data set \(\{({\mathbf {X}}_i,{\mathbf {Y}}_i)\}_{i=0\ldots ,n-1}\) defined by the grid G

Here we propose a few practical ways to obtain contingency tables \(({\mathbf {X}},{\mathbf {Y}})|G\) based on the random grid G. First of all, by performing \(K_r\) random discretizations for both \({\mathbf {X}}\) and \({\mathbf {Y}}\), we can efficiently compute \(K = K_r^2\) random grids obtained using all pairs of random discretizations. This allows us to generate fewer random discretizations than by independently generating each grid. The other required parameter is \(D_{\max }\), which determines the maximum number of random bins to discretize one variable. Once both variables are discretized, the NMIproc procedure can be used to compute the normalized mutual information. Algorithm 1 presents the pseudo-code for RIC computation.

figure b

Discretization of random variables Next we present in Algorithm 2 the random discretization procedure for a single random variable X. A variable is discretized using a number of cut-offs D chosen at random in \([1,D_{\max }-1]\). Each cut-off is chosen by sampling a random example of the variable with uniform distribution. The bin label for each data point can easily be encoded with integer values using \(\mathbb {I}({ cut-off} < x_i)\) with D passes through the data points, where \(\mathbb {I}\) is the indicator function. The idea is inspired by random ferns (Kursa 2012; Özuysal et al. 2007): a type of random forest that achieves even higher speed. This can also be viewed as a random hash function (Wang et al. 2012) or a random projection on a finite set (Lopez-Paz et al. 2013). This procedure can be easily implemented in any programming language, for example C++. No sorting is required.

figure c

The worst case computational complexity of this procedure is \(\mathcal {O}(D_{\max } \cdot n)\).

Discretization of sets of random variables An efficient approach to randomly discretize a set of p random variables \({\mathbf {X}}\) consists not only in choosing cut-offs at random but also to randomly choose the variables to discretize: i.e., build a random fern (Kursa 2012) on the set of features \({\mathbf {X}}\). This is very computationally efficient: the worst case computational complexity is \(\mathcal {O}(D_{\max } \cdot n)\) which is independent from the number of variables p. However, the straightforward implementation of a random fern presented in Algorithm 3 does not allow to have fine control on the number of generated bins \(D_{\max }\): the number of maximum bins \(D_{\max }\) is exponential in the number of cut-offs D, i.e., \(D_{\max } = 2^{D}\). Therefore D cannot be greater than \(\log _2{D_{\max }} - 1\). Moreover, many bins can be empty due to repeated choices of the same variable.

figure d

We therefore use the following randomized approach to discretize a set of p variables \({\mathbf {X}}\) in exactly D bins, maintaining linear worst case complexity in the number of variables and records: \(\mathcal {O}(D_{\max } \cdot n \cdot p)\). By choosing D random data points as seeds, we can easily discretize a set of variables into D non-empty bins by assigning each data point to its closest seed. We make use of the Euclidean norm to find the distances between points. For the ease of implementation, both random cut-offs in Algorithm 3 and random seeds in Algorithm 4 are chosen via sampling with replacement.

figure e

The worst case computational complexity for Algorithm 1 to compute RIC between the set \({\mathbf {X}}\) of p variables and the set \({\mathbf {Y}}\) of q variables is thus determined by the discretization algorithm:

  • \(\mathcal {O} \left( K_r \cdot D_{\max } \cdot n + K_r^2(n + D_{\max }^2) \right) \) if random ferns are used;

  • \(\mathcal {O} \left( K_r \cdot D_{\max } \cdot n \cdot (p + q) + K_r^2(n + D_{\max }^2) \right) \) if random seeds are used.

\(K_r\) controls the trade-off between accuracy and computational time. The more randomizations \(K_r\) are used, the lower the variance, but the longer the computational time. Based on experimentation we consider \(K_r = 20\) a reasonable value. The number of maximum bins \(D_{\max }\) should be chosen in order to avoid increasing the grid resolution towards the limit of \( NI = 1\) where each point belongs to a single cell. In the worst case, for uniformly distributed variables and n samples we would like to have at least one point per cell of the contingency table in Table 3. This implies:

$$\begin{aligned} \frac{n}{D_{\max } D_{\max }} \ge 1 \Rightarrow \frac{n}{D_{\max }^2} \ge 1 \Rightarrow D_{\max }^2 \le n \Rightarrow D_{\max } = \lfloor \sqrt{n} \rfloor \end{aligned}$$

However, a larger value of \(D_{\max }\) might help to identify more complex relationships, at the cost of higher variance. \(D_{\max }\) can be tuned to obtain optimal performance. Given that in our analysis we used \(D_{\max } = \mathcal {O}(\sqrt{n})\), RIC’s worst case computational complexity in the number of data samples is \(\mathcal {O}(n^{1.5})\).

4 Variance analysis of grid estimators of mutual information

In this section, we theoretically justify the use of random grids to obtain small variance with the RIC statistic. Then, we prove that a lower variance is beneficial when comparing dependencies and ranking relationships according to the grid estimator of mutual information.

4.1 Ensembles for reducing the variance

The main motivation for our use of random discretization grids is that averaging across independent random grids allows reduction of variance (Geurts 2002). By using random grids, it is possible to achieve small correlation between the different estimations of \( NI \). RIC variance tends to be a small value if the estimations are uncorrelated.

Theorem 2

Let \( NI _G = NI \big ( ({\mathbf {X}},{\mathbf {Y}})|G \big )\) be the normalized mutual information estimated on a random grid G and \(\text{ RIC }\) as per Eq. (1). If \( NI \) estimations for RIC are uncorrelated then:

$$\begin{aligned} \lim _{K \rightarrow \infty }{\text{ Var } ( \text{ RIC } )} = \text{ Var }_{G} ( E [ NI _G |G] ) \end{aligned}$$

Proof

The variance of \(\text{ RIC }\) can be decomposed using Eve’s law of total variance according the i.i.d. random variables grids \(G_k\) with \(k=1\ldots K\) as follows, \(\text{ Var } ( \text{ RIC } ) \) is equal to:

$$\begin{aligned} \text{ Var(RIC) }=&\text{ Var }_{G_1 \ldots G_{K}} \Big ( E[ \text{ RIC } | G_1 \ldots G_{K}] \Big ) + E_{G_1 \ldots G_{K}} \Big [ \text{ Var } ( \text{ RIC } | G_1 \ldots G_{K}) \Big ] \\ =&\text{ Var }_{G_1 \ldots G_{K}} \Big ( E \Big [ \frac{1}{K}\sum _{k=1}^{K} NI _{G_k}|G_1 \ldots G_{K} \Big ] \Big ) \\&+ E_{G_1 \ldots G_{K}} \Big [ \text{ Var } \Big ( \frac{1}{K}\sum _{k=1}^{K} NI _{G_k}|G_1 \ldots G_{K} \Big ) \Big ]\\ =&\text{ Var }_{G_1 \ldots G_{K}} \Big ( \frac{1}{K}\sum _{k=1}^{K} E [ NI _{G_k}|G_k] \Big ) + E_{G_1 \ldots G_{K}} \Big [ \frac{1}{{K}^2} \Big ( \sum _{k=1}^{K}\text{ Var } ( NI _{G_k}| G_k) \\&+ \sum _{k \ne k'} \text{ Cov }( NI _{G_k}, NI _{G_{k'}} | G_k, G_{k'} ) \Big )\Big ] \\ =&\text{ Var }_{G} ( E [ NI _G| G] ) + E_{G_1 \ldots G_{K}} \Big [ \frac{1}{{K}^2} \Big ( \sum _{k=1}^{K}\text{ Var } ( NI _{G_k}| G_k) \\&+ \sum _{k \ne k'} \text{ Corr }( NI _{G_k}, NI _{G_{k'}} | G_k, G_{k'}) \text{ Var } ( NI _{G_k}| G_k ) \text{ Var } ( NI _{G_{k'}}| G_{k'} )\Big )\Big ] \end{aligned}$$

If \(\text{ Corr }( NI _{G_k}, NI _{G_{k'}} | G_k, G_{k'}) = 0\) for all k and \(k'\), then:

$$\begin{aligned} \text{ Var(RIC) }=&\text{ Var }_{G} ( E [ NI _G| G] ) + E_{G_1 \ldots G_{K}} \Big [ \frac{1}{{K}^2} \sum _{k=1}^{K}\text{ Var } ( NI _{G_k}| G_k) \Big ]&\\ =&\text{ Var }_{G} ( E [ NI _G| G ] ) + \frac{ E_{G} [ \text{ Var } ( NI _{G}| G) ] }{K} \end{aligned}$$

that when \(K \rightarrow \infty \) is equal to \(\text{ Var }_{G} ( E [ NI _G| G] )\). \(\square \)

The expected value \(E [ NI _G |G]\) is less dependent on the data because of the random grid G and shows small variance across grids. Intuitively, this result suggests that some variance of the data can be captured with the random grids. We empirically validate this result in Fig. 1. In practice, it is very difficult to obtain completely uncorrelated \( NI \) estimations. Nonetheless, the use of random grids allows us to strongly decrease their correlation.

Fig. 1
figure 1

Variance of RIC compared to the variance of \( NI _F\) on a fixed equal width grid F. According to Theorem 2 if estimations are uncorrelated, the variance of RIC tends to the variance of \(E [ NI _G |G]\) which is less dependent to the data. In practice, estimations are always correlated. Nonetheless, the use of random grids helps in decreasing the correlation between them

We aim to show that the decrease in variance is due to the random grid G, by comparing the variance of \( NI _F\) where F is a fixed grid with equal width bins for X and Y. The number of bins for each variable is fixed to 9 for both G and F, and cut-offs are generated in the range \([-2,2]\) and \([-3,3]\) for X and Y, respectively. The chosen joint distribution (XY) is induced on \(n = 100\) points with \(X \sim \mathcal {N}(0,1)\) and \(Y = X + \eta \) with \(\eta \sim \mathcal {N}(0,1)\). The variance of RIC decreases as K increases because the random grids enable us to decorrelate the estimations of \( NI \). In general, if we allow grids of different cardinality (different number of cut-offs) and large K, the variance can be decreased even further.

Using RIC in Algorithm 1 we can efficiently compute \(K = K_r^2\) grids. Increasing the number of random grids by increasing \(K_r\) is always beneficial. However, this is particularly important when the sample size n is small. In Fig. 2 we show the behavior of RIC’s variance at the variation of \(K_r\) for different sample size n for the same relationship discussed above. The variance reaches the plateau already at \(K_r = 50\) when \(n = 500\). On the other hand, when the sample size is small, e.g. \(n = 50\), the variance is still decreasing at \(K_r = 100\). \(K_r\) might be chosen according to the sample size n: i.e., larger if the sample size n is small and smaller if the sample size n is large. Nonetheless, having a large \(K_r\) is always beneficial in general, at the cost of computational time.

Fig. 2
figure 2

Variance of RIC in Algorithm 1 at the increase of the number of random grids for different sample size n. Increasing \(K_r\) is always beneficial. However, it is particularly important when n is small. For example, the variance of RIC still decreases for \(K_r > 50\) for this particular relationship between X and Y

4.2 Importance of variance in comparing relationships using the grid estimator of mutual information

When mutual information is used as a proxy for the strength of the relationship, a small estimation variance is likely to be more useful than a smaller bias when comparing relationships, as implied by some observations in Kraskov et al. (2004), Margolin et al. (2006) and Schaffernicht et al. (2010). The reason is that systematic biases cancel each other out. We formalize these observations as follows:

Theorem 3

Let \(\text{ bias }(\hat{\phi }) = \phi - E[\hat{\phi }]\) be the bias of the estimator \(\hat{\phi }\). Let \(\hat{\phi }(s) = \hat{\phi }_s\) and \(\hat{\phi }(w) = \hat{\phi }_w\) be estimations of \(\phi \) on the strong relationship s and the weak relationship w, where the true values are \(\phi _s > \phi _w\). The probability of making an error \(P(\hat{\phi }_s \le \hat{\phi }_w)\) is bounded above by:

$$\begin{aligned} \frac{\text{ Var }(\hat{\phi }_s) + \text{ Var }(\hat{\phi }_w)}{\text{ Var }(\hat{\phi }_s) + \text{ Var }(\hat{\phi }_w) + \Big (\phi _s - \phi _w - \big ( \text{ bias }(\hat{\phi }_s) - \text{ bias }(\hat{\phi }_w) \big ) \Big )^2} \end{aligned}$$

if \(E[\hat{\phi }_s] > E[\hat{\phi }_w]\) or equivalently if \(\phi _s - \phi _w > \text{ bias }(\hat{\phi }_s) - \text{ bias }(\hat{\phi }_w)\).

Proof

Let \(\hat{\Delta } = \hat{\phi }_w - \hat{\phi }_s\), if \(E[\hat{\Delta }] < 0\) then:

$$\begin{aligned} P (\hat{ \Delta } \ge 0) = P ( \hat{ \Delta } - E[ \hat{ \Delta }] \ge - E[ \hat{\Delta }] ) \le \frac{ \text{ Var }( \hat{ \Delta }) }{\text{ Var }(\hat{\Delta }) + E[\hat{\Delta }]^2 } \end{aligned}$$

according to the 1-sided Chebyshev inequality also known as Cantelli’s inequality (Ross 2012). \(\square \)

Remark

If there is a systematic bias component, the variance of a dependency measure is important also to identify if a relationship exists. The probability of making an error in determining if a relationship exists (independence testing between X and Y with \(\hat{\phi }\)) is just a special case of Theorem 3 where \(\phi _w = 0\).

Regarding the grid estimator of mutual information \(I_F\) on a fixed grid F with \(n_F\) bins, there is always a systematic bias component which is a function of the number of samples n and the number of bins \(n_F\) (Moddemeijer 1989). This systematic bias component cancels out in \(\text{ bias }(I_{F,s}) - \text{ bias }(I_{F,w})\). If the non-systematic estimation bias is small enough, then the denominator of the upper bound is dominated by the true difference \(\mathcal {I}_s - \mathcal {I}_w\). Therefore, the upper bound decreases because of the numerator, i.e., the sum of the variances. Of course variance is just part of the picture. It is worth to decrease the variance of an estimator if the estimand has some utility. Moreover, many estimators have a bias and variance trade-off. Deliberately reducing the variance at the expense of bias is not a good idea. Variance can be reduced if there is a strong systematic estimation bias component and if the effect on the non-systematic bias is minimal.

We empirically compare the probability of error as stated in Theorem 3 for the estimation mutual information \(\mathcal {I}\) with grids. RIC can be used to estimate mutual information if we average across grids of the same cardinality and do not normalize mutual information on the grids. Let \(s = (X,Y_s)\) and \(w = (X,Y_w)\) be the strong and the weak relationships where \(X \sim \mathcal {N}(0,1)\), \(Y_s = X + \eta _s\) and \(Y_w = X + \eta _w\) with \(\eta _s \sim \mathcal {N}(0,.7)\) and \(\eta _w \sim \mathcal {N}(0,1)\). Indeed, if \(X \sim \mathcal {N}(0,\sigma ^2_X)\) and \(Y = X + \eta \) with \(\eta \sim \mathcal {N}(0,\sigma ^2_\eta )\) it is possible to analytically compute the mutual information between X and Y: \(\mathcal {I}(X,Y) = 0.5 \log _2{ ( 1 + \sigma ^2_X / \sigma ^2_\eta )}\). In Fig. 3 we compare the probability of error \(P(\text{ RIC }_s < \text{ RIC }_w)\) for RIC as an estimator of mutual information and the probability of error \(P(I_{F,s} < I_{F,w})\) for the estimator \(I_{F}\) on a fixed equal width grid F, with an increase of the number of random grids K for RIC. We generate 13 bins for X and Y for both F and RIC’s grids. The distributions are induced on \(n = 100\) samples. The probability of error is smaller for RIC because of its small variance. Indeed, the probability of error decreases with the increase of K, just as the variance decreases with bigger K. The bias stays constant when K varies and it contributes less to a small probability of error.

Fig. 3
figure 3

Probability of error in identifying the strong relationship. RIC’s probability is smaller due to its small variance

In Fig. 4 we show in more detail the contribution of the bias and the variance to the probability of error. The upper bounds for the probability of error for RIC and \(I_F\) are respectively:

$$\begin{aligned} U(\text{ RIC })&= \frac{\text{ Var }(\text{ RIC }_s) + \text{ Var }(\text{ RIC }_w)}{\text{ Var }(\text{ RIC }_s) + \text{ Var }(\text{ RIC }_w) + \Big (\mathcal {I}_s - \mathcal {I}_w - \big ( \text{ bias }(\text{ RIC }_s) - \text{ bias }(\text{ RIC }_w) \big ) \Big )^2} \\ U(I_F)&= \frac{\text{ Var }(I_{F,s}) + \text{ Var }(I_{F,w})}{\text{ Var }(I_{F,s}) + \text{ Var }(I_{F,w}) + \Big (\mathcal {I}_s - \mathcal {I}_w - \big ( \text{ bias }(I_{F,s}) - \text{ bias }(I_{F,w}) \big ) \Big )^2} \end{aligned}$$

Figure 4 shows also the behaviour of the upper bound at the variation of K term by term. The bias difference for RIC as an estimator of mutual information \(\mathcal {I}\) is a bit bigger than the bias difference for \(I_F\). Nonetheless, the probability of error decreases mainly because of the variance decrease of RIC.

Fig. 4
figure 4

Probability of error in identifying the strong relationship. RIC’s probability is smaller due to its small variance. a Probability of error and upper bound, b terms of the upper bound

Moreover, when the dependency measure with a systematic bias is used for ranking relationships, we can still show that reducing the estimator variance plays an important role.

Corollary 1

When ranking m relationships according to the true ranking \(\phi _1> \phi _2> \cdots > \phi _m\), the probability \(P(\hat{\phi }_1> \hat{\phi }_2> \cdots > \hat{\phi }_m)\) of accurately obtaining the correct ranking using the estimators \(\hat{\phi }_i, i=1,\ldots ,m\) is bounded below by:

$$\begin{aligned} 1 - \sum _{i=1}^{m-1} \frac{\text{ Var }(\hat{\phi }_{i+1}) + \text{ Var }(\hat{\phi }_i)}{\text{ Var }(\hat{\phi }_{i+1}) + \text{ Var }(\hat{\phi }_i) + \Big (\phi _{i+1} - \phi _i - \big ( \text{ bias }(\hat{\phi }_{i+1}) - \text{ bias }(\hat{\phi }_i) \big ) \Big )^2} \end{aligned}$$

if \(E[\hat{\phi }_{i+1}] > E[\hat{\phi }_i]\) or equivalently if \( \phi _{i+1} - \phi _i > \text{ bias }(\hat{\phi }_{i+1}) - \text{ bias }(\hat{\phi }_i) \ \forall i=1\ldots m-1\).

Proof

Let \(\mathcal {E}_i = \{ \hat{\phi }_{i+1} > \hat{\phi }_i\}\) be an event then:

$$\begin{aligned} P(\hat{\phi }_1> \hat{\phi }_2> \cdots > \hat{\phi }_m)&= P(\mathcal {E}_1 \cap \mathcal {E}_2 \cap \cdots \cap \mathcal {E}_{m-1}) = 1 - P(\mathcal {E}_1^c \cup \mathcal {E}_2^c \cup \cdots \cup \mathcal {E}^c_{m-1}) \\&\ge 1 - \sum _{i=1}^{m-1} P(\mathcal {E}_i^c) \end{aligned}$$

where \(\mathcal {E}_i^c\) is the complementary event to \(\mathcal {E}_i\): \(\mathcal {E}_i^c = \{ \hat{\phi }_{i+1} \le \hat{\phi }_i\}\). The corollary follows using the upper bound for \(P(\mathcal {E}_i^c)\) proved in Theorem 3:

$$\begin{aligned} P(\mathcal {E}_i^c) \le \frac{\text{ Var }(\hat{\phi }_{i+1}) + \text{ Var }(\hat{\phi }_i)}{\text{ Var }(\hat{\phi }_{i+1}) + \text{ Var }(\hat{\phi }_i) + \Big (\phi _{i+1} - \phi _i - \big ( \text{ bias }(\hat{\phi }_{i+1}) - \text{ bias }(\hat{\phi }_i) \big ) \Big )^2} \quad \end{aligned}$$

\(\square \)

As we empirically demonstrated above for the grid estimator of mutual information, \(\text{ bias }(\hat{\phi }_{i+1}) - \text{ bias }(\hat{\phi }_i) \) tends to be small if there is some systematic bias component, and thus a small variance is the main contributor to the accuracy.

Remark about boostrapping It is also natural to consider whether using bootstrapping improves the discrimination performance of a statistic by decreasing the variance. When bootstrapping, the statistic is actually estimated on around 63% of the samples and this decreases the discrimination ability of each measure. Similarly, sampling without replacement of a smaller number of points and averaging across different estimation of a measure is not expected to perform well. The best way to decrease the variance is thus to inject randomness in the estimator itself. This is the rationale for RIC. We achieve this goal by using a strong measure such as mutual information and injects randomness in its estimation in order to decrease the global variance.

5 Experiments on dependency between two variables

In this section, we compare RICFootnote 1 with 16 other state-of-the-art statistics that quantify the dependency between two variables X and Y. We focus on three tasks: identification of noisy relationships, inference of network of variables, and feature filtering for regression. Table 4 shows the list of competitor measures compared in this paper and the parameters used in their analysis. The parameters used are the default parameters suggested by the authors of the measures in their respective papers. Indeed, only on the task of feature filtering for regression it is possible to tune parameters with cross-validation on a given data set. The tasks of inference of network of variables and identification of noisy relationships are unsupervised learning tasks and do not allow parameter tuning when applied to a new data set. Nonetheless, most of the default parameters are not tuned for hypothesis testing. Therefore, we decided to follow the approach used in Reshef et al. (2015a). In this comprehensive empirical study, leading measures of dependence are compared in terms of two important features: equitability and power against independence. Similarly in this paper, we discuss the power against independence on different noise models as well as the equitability of the measures. When testing the power of a measure for a particular noise model, we identify the best parameters for independence testing by maximizing the power on average on a set of relationships and different noise levels.

The measures in the first group of Table 4 are mutual information estimators. \(I_{ew }\) and \(I_{ef }\) are respectively the equal-width and equal-frequency bin grid estimator of mutual information. \(I_{A }\) Footnote 2 is the adaptive grid estimator of mutual information that assures the number of points for each cell to be at least 5. We chose to fix the number of bins D for \(I_{ew }\) and \(I_{ef }\) to \( \lfloor \sqrt{n/5} \rfloor \) because no universally accepted value was found in the literature. Kraskov’s k nearest neighbours estimator \(I_{k\text {NN}}\) Footnote 3 uses a fixed parameter \(k = 6\) and the kernel density estimator \(I_{\text {KDE}}\) Footnote 4 uses the parameter \(h_0 = \frac{4}{p+q + 2}^{1/(p+q+4)} n^{-1/(p + q +4)} = n^{-1/6}\) when comparing two variables given that the number of variables is \(p + q = 2\). This is one possible kernel width and suggested as a default value in Steuer et al. (2002). We also compare a novel information theoretic estimator of mutual information which has a nice analytical form and can be obtained from the average of \(I_{k\text {NN}}\) for k from 1 to \(n-1\) (Faivishevsky and Goldberger 2009). All other measures were used with the default parameters suggested in their respective papers as described in Table 4: dCorr,Footnote 5 RDC, ACE,Footnote 6 HSIC,Footnote 7 MIC,Footnote 8 GMIC, MID.Footnote 9 As we discussed above, we tuned the parameters of each measure when testing for independence. Being the state-of-the-art in this task, we also introduced TIC\(_e\) in the analysis. Regarding RIC on computing dependency between two variables, we decided to generate discretizations for X and Y according Algorithm 2; we generate for each discretization a random number of cut-offs D chosen at random uniformly in \([1,D_{\max }-1]\).

Table 4 Dependency measures compared in this paper and parameters used in the tasks of network inference, feature filtering for regression, and estimation of running times

5.1 Identification of noisy relationships

We consider the task of discriminating between noise and a noisy relationship, i.e., determining whether a dependency exists by testing for independence between X and Y, across a large number of dependency types. In Fig. 5, 12 different relationships between X and Y are induced on \(n = 320\) data points.

We use the same setting as in Simon and Tibshirani (2011). In this study, the measure performance on a relationship is assessed by power at level \(\alpha = 0.05\). For each test case, we generated 500 random data sets with X and Y being completely independent. These constitute the negative class or the complete noise class. Then, for each noise level between 1 and 30, we generate 500 other data sets to create the positive class or the noisy relationship class. We evaluate the ability of different measures to discriminate between complete noise and the noisy relationship classes by computing the power (sensitivity) for the positive class at level \(\alpha = 0.05\). Experiments were carried out on two different noise models, namely additive noise and white noise. In the first scenario we add different levels of Gaussian noise by varying the noise standard deviation \(\sigma _\eta \). In the second scenario we substitute some points of the relationship with uniform noise. Figure 5b, c show examples of noise levels for the linear relationship in the additive noise model and white noise model respectively. Given that all measures present good discrimination ability in the white noise model, level 1 (lowest noise) starts by assigning 40% of points to the relationship and 60% to uniformly distributed noise.

Fig. 5
figure 5

Relationships between two variables and example of additive and white noise. a Relationships types \(n = 320\), b additive, c white

Given that the default parameters of each measure are not tuned for independence testing, we decided to follow the approach of Reshef et al. (2015a): for a particular noise model we identify the parameters that maximize the average power for all level of noise and all relationship types. This analysis can be found in “Appendix A”. For example, a bigger number of nearest neighbors is beneficial to \(I_{kNN }\) to achieve more power under the additive noise model (Kraskov et al. 2004). Furthermore, measures which make use of kernels should employ kernels with larger width to maximize power under additive noise. Even though these parameters cannot be tuned on a new data set because the task is unsupervised, the analysis provided “Appendix A” might guide the user when the particular noise model is known for a data set. As discussed in Sect. 4.1, increasing \(K_r\) for RIC helps to decrease its variance. This is particularly important in order to achieve power when testing against independence. Figure 6a, b show the area under power curve for each relationship tested in this paper and their average at the variation of \(K_r\) for RIC. Increasing \(K_r\) is very beneficial to increase power when the number of data points is small: \(n = 100\). This is an interesting feature of RIC. Increasing \(K_r\) gives more power but also increases the computational running time. Nonetheless, higher \(K_r\) is needed only if the sample size is small.

Fig. 6
figure 6

Each line is the area under the power curve for each relationship tested in this paper. The solid black line shows their average. All results are shown at the variation of the paramenter \(K_r\) for RIC which determines the number of random discretization grids. The power against independence of RIC always increases at the increase of \(K_r\) because its variance decreases. This is particularly important when the number of data points is small: e.g., \( n = 100\). a Power of RIC at small sample size: \(n = 100\), b power of RIC on larger samples: \(n = 1000\)

We show the performance of RIC with \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \) and \(K_r = 200\) as obtained by parameter tuning. Detailed results for each relationship types are provided in “Appendix A”. Note that because not all the relationships in Fig. 5a are functional, it is not possible to plot power against a normalized x-axis as in Reshef et al. (2015a). In Reshef et al. (2015a) the power of functional relationships is plotted against the \(R^2\) between the true underline function between variables and its noisy version. In this paper, we follow the approach in Simon and Tibshirani (2011) where the x-axis represents some non-normalized amount of noise added to the relationship between the variables. Therefore, the amount of noise for a particular value on the x-axis and a particular relationship is not comparable with the amount of noise added to another relationship at the same point on the x-axis. Nonetheless on our set of relationships, we would like to point out that all the power plots are monotonically decreasing and they do not look to be intersecting each other. In particular, if a dependency measure \(\mathcal {D}_1\) shows higher power than a measure \(\mathcal {D}_2\) at a given level of noise, \(\mathcal {D}_1\) will also have higher power than \(\mathcal {D}_2\) at a higher level of noise. Please refer to Fig. 20 in “Appendix A”.

In order to compare the different measures on multiple data sets (relationships) we use the framework proposed in Demšar (2006): we show the average rank across data sets for each measure. According to this framework we compare a statistic based on ranking which is not influenced by the absolute value of the metric of performance. Therefore, this evaluation is not influenced by the fact the x-axis in the power plots cannot be normalized. Moreover, in order to provide graphical intuition about their performance, we show their average rank sorted in ascending order using bar plots. Figure 7 present the performances on the additive noise model. RIC computed with \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \) and \(K_r = 200\) shows very competitive performance.

Fig. 7
figure 7

Average rank of measures across relationships when the target is power maximization under the additive noise model. RIC with \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \) and \(K_r = 200\) is very competitive in this scenario

RIC outperforms all mutual information estimators, in particular the discretization based \(I_{ew }\), \(I_{ef }\), \(I_{A }\), and the kNN based \(I_{kNN }\). The kernel based density estimator \(I_{KDE }\) looks more competitive in noisy scenarios than all the other mutual information estimators, as also pointed out in Khan et al. (2007). The performance of \(I_{mean }\) is particularly surprising: even if \(I_{mean }\) is a smooth estimator of mutual information, which guarantees low variance, it cannot discriminate very noisy relationships well. A careful look at its derivation reveals that \(I_{mean }\) takes into account kNN with k very large, e.g. \(k = n-1\). In fact, even \(I_{kNN }\) in this case cannot discriminate between noise and noisy relationships. MIC with parameters optimized for independence testing shows to outperform distance correlation. MIC outperforms GMIC with parameter \(p=-1\) when the parameter \(\alpha \) is tuned independently for both of them. In particular, MIC obtains its optimum at \(\alpha = 0.35\) and GMIC at \(\alpha = 0.65\). The comparison carried out in Luedtke and Tran (2013) considered \(\alpha = 0.6\) for both measures, concluding that GMIC was superior with this setting. Moreover, the new version MIC\(_e\) shows to improve on MIC results. The new information theoretic based measure MID also presents less competitive discrimination ability on this set of relationships because it is better suited for the white noise model. RDC achieves good results overall, in particular on the scenarios where it seems possible to linearize the relationship via a random projection: low frequency sinusoids and circle relationships. When the relationship is linear, \(r^2\) is the best choice in terms of discrimination ability. This property reflects the motivation for dCorr, which was proposed as a distance based extension to the non-linear scenarios: its performance is very competitive on the linear, and 4th root case. However, it fails in the high frequency sinusoidal case and circle relationship. The best measure among the competitors is the newly proposed TIC\(_e\) which is explicitly designed for independence testing. These results indicate that when the purpose is to identify an arbitrary relationship in the additive noise scenario, RIC delivers extremely competitive performance on average. If the user is interested in a specific relationship type, it will be best to choose a particular dependency measure known to be specifically good for that scenario. The results in the appendix may help guide the user in this choice.

Fig. 8
figure 8

Average rank of measures across relationships when the target is power maximization under the white noise model. RIC with \(D_{\max } = \lfloor \sqrt{n*10} \rfloor \) and \(K_r = 200\) is competitive but yet it is outperformed by HSIC

RIC also shows good performance under the white noise model but it is outperformed by HSIC. Average results are shown in Fig. 8. The optimal parameters under white noise are different from the optimal parameters under additive noise for many measures as shown in “Appendix A”. Regarding the grid estimators of mutual information, RIC, TIC\(_e\), MIC, MIC\(_e\), and MID, a denser grid is better suited for the white noise scenario because points are uniformly distributed on the joint domain (XY). \(I_{kNN }\) presents competitive performance under white noise when k is small. As in the additive noise model, TIC\(_e\) proved to be strong competitor to RIC in this scenario. Instead, dCorr seems to be little competitive under the white noise model. HSIC with very small kernel width performs the best under white noise.

5.2 Equitability

In this section, we assess the equitability of the measures discussed in this paper. A dependence measure is equitable if it provides similar scores to equally noisy relationships of different kinds, relative to some measure of noise (Reshef et al. 2011, 2015a, b). For example, in the case of functional relationships, one natural instantiation of equitability is for an equitable measure of dependence to assign similar scores to relationships with the same coefficient of determination \(R^2\) between the true underlying function and its noisy version. Therefore, for functional relationships an equitable measure is 1 if the dependency between the variables is noiseless.

As in Reshef et al. (2015a), we discuss results on two sets of functional relationships shown in Table 5: functional relationships from Simon and Tibshirani (2011) shown in Fig. 5a that we employed in the power analysis in Sect. 5.1, and the a set of relationships from Reshef et al. (2011). These functional relationships are defined as follows: \(y = f(x) + \sigma \cdot \varepsilon \) where \(f(\cdot )\) is a function, \(\sigma \) is a constant, and \(\varepsilon \) is Gaussian noise with 0 mean and variance 1. In order to quantify if a measure is equitable, we estimate the \(R^2\) between a given noiseless function and its noisy version on a data set of \(n = 320\) points. We vary the amount of noise using a different constant \(\sigma \). Each value for the \(R^2\) is matched against the score obtained by a measure for the same noisy relationship at the same level of noise. Scatter-plots for each measure are shown in Fig. 9. The narrower the scatter-plot for a measure, the more equitable a measure is.

Table 5 The two different sets of functional relationships used in the equitability analysis
Fig. 9
figure 9

Equitability of each measure on the two different sets of functional relationships from Table 5. Different values for the \(R^2\) between the noiseless functional relationship and its noisy version are plotted against the value the dependency measure would obtain for that particular amount of noise. The narrower the scatter-plot is, the better the equitability. a Simon and Tibshirani (2011), b Reshef et al. (2011)

Fig. 10
figure 10

Worst case equitability for relationships in Simon and Tibshirani (2011): i.e., the maximum range of values for \(R^2\) associated to a single value for a measure. That single value corresponds to two completely different levels of noise

Fig. 11
figure 11

Worst case equitability for relationships in Reshef et al. (2011): i.e., the maximum range of values for \(R^2\) associated to a single value for a measure. That single value corresponds to two completely different levels of noise

In exploratory data analysis, often there is no ground-truth. For example, there is no ground-truth when the task is identifying the top pair of dependent variables among all the possible pairs. In this case, it is not possible to tune the parameters for a particular measure. In this analysis, we relied on the default values provided for the measures in the respective papers. These are shown in Table 4. With their default parameters, the best measures in terms of equitability are MIC and ACE. More specifically, ACE seems to consistently score noiseless functional relationships with a value 1 but seems to fail when the amount of noise increases. MIC and its improved version MIC\(_e\) instead show good equitability across the board. On the other hand, RIC is not an equitable measure. The scatter-plot for RIC is similar to GMIC and TIC\(_e\) scatter-plots. Indeed, all these measures use multiple grids to compute mutual information related statistics and aggregate their values. This aggregated grid-based approach seems to be more beneficial when the task is identify a relationship with high power.

We rank the measures in terms of equitability in Figs. 10 and 11. For each scatter-plot we identify the worst case for equitability: i.e., the maximum range of values for \(R^2\) associated to a single value for a measure. That single value corresponds to two completely different levels of noise. For example, the Pearson correlation squared \(r^2\) is equal to 0 for both a completely noiseless sinusoidal relationship and a completely noisy one. Indeed, \(r^2\) is consistently ranked as last in Figs. 10 and 11. MIC\(_e\) shows to be the best overall in this task. Note also that MIC\(_e\) seems to better match the \(R^2\) on this sets of relationships: it is very close to 0 when \(R^2\) is 0 and 1 when \(R^2\). Other work in literature proposed to enforce this property using adjustment for chance (Romano et al. 2016; Wang et al. 2017). This is an important property which enables MIC\(_e\) to be used as proxy of the \(R^2\).

5.3 Application to network inference

We next employ the measures for biological network reverse engineering, which is a popular and successful application domain for dependency measures (Villaverde et al. 2013). The applications include cellular, metabolic, gene regulatory, and signalling networks. Each of the m variables is associated with a time series of length n. In order to identify the strongest relationships between variables (e.g., genes), a dependency measure \(\mathcal {D}\) is employed. Due to the natural delay of biochemical interactions in biological networks, the strongest dependency might occur only after some time (Xuan et al. 2012). For this reason, we incorporate time delay into the dependency measures as \(\mathcal {D}_{delayed } = \max _{ \tau \in [-\tau _m,+\tau _m]}{ \mathcal {D}\left( X(t-\tau ),Y(t) \right) }\), where \(\mathcal {D}\) is any measure from Table 4 and \(\tau _m\) is the maximum time delay. We collected 10 datasets where the true interactions between the variables are known. A dependency measure is effective on this task if its output is high on real interactions (positive class) and low on non-interacting pairs of variables (negative class). We evaluate the performance of a measure with the average precision (AP), also known as the area under the precision-recall curve. In order to obtain meaningful comparisons and perform statistical hypothesis testing, we performed 50 bootstrap repetitions for each dataset and computed the mean AP (mAP) across the repetitions.

We made use of the MIDER framework (Villaverde et al. 2014) for evaluating the performance of dependency measures. The first 7 datasets were retrieved from the MIDER framework. The last 3 datasets were generated using SynTren (Bulcke et al. 2006), a generator of synthetic gene expression data. SynT1 and SynT1-s were generated starting from the Escherichia coli transcriptional regulatory network provided with the framework with default noise parameters where SynT1-s has shorter time series. SynT2 was generated starting from the synthetic direct acyclic graph provided with the framework. Based on the data sampling rate, we set \(\tau _m = 3\) for these datasets, which cover most plausible time-delayed interactions. Table 6 shows a summary of the datasets used.

Table 6 Summary of the datasets used for network inference (left) and regression (right): n is the data points and m is the number of variables

The small amount of data available and the high amount of noise in biological time series posed a very challenging task for all statistics. Mutual information estimators have been extensively employed for this task (Villaverde et al. 2013). Just recently, HSIC has been tested on network inference (Lippert et al. 2009) and even more recently dCorr has been shown to be competitive on this task (Guo et al. 2014). In this task, it is important to powerfully discriminate between independent and non-linearly dependent variables. Indeed, measures with high power as discussed in Sect. 5.1 might have an advantage (Guo et al. 2014). Of course, a measure can be even more competitive if it is also equitable. Nonetheless, this is a different task from equitability assessment and equitability is only part of the picture. This explains the performance of dCorr and HSIC in the literature (Lippert et al. 2009; Guo et al. 2014).

To our knowledge, there is no prior comprehensive survey of the performance of RDC, \(I_{mean }\), MIC, MIC\(_e\), GMIC and MID on this task. We perform a comprehensive evaluation of RIC plus 16 other dependency measures on network inference. The results are shown in Table 7.

Table 7 Mean average precision (mAP) on 10 networks: n length of time series; m number of variables

We use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) and \(K_r = 20\) because on these tasks it is important to achieve high discrimination between strong relationships as well as weak relationships. Figure 12 presents the average rank of the measures across all tested networks. Overall, RIC performs consistently well across all datasets. It outperforms by far all the discretization based mutual information estimators as well as other information theoretic based measures including MIC, GMIC and MID. Among the mutual information estimators, \(I_{KDE }\) and \(I_{kNN }\) show very good results. RIC’s main competitor was dCorr, which also shows very good performance mainly due to the crucial importance of the linear relationships between variables. Its results are very correlated with \(r^2\) results, which in some cases provides the best result for a single data set. This is mainly due to its high ability to discriminate linear relationships well. We found RIC particularly competitive on short time series with a large number of variables.

As well known within the machine learning community, there is no “free lunch”. In the context of this application, this wisdom is evident, observing in Table 7 that no method always performs the best or worst in every case. MID for example, is badly affected by additive noise commonly observed in biological time series and thus showed overall less competitive performance. Nonetheless, it achieved the best performance on Irma-on-off, an in vivo yeast semi-synthetic network.

Fig. 12
figure 12

Average rank across networks on the task of biological network with time inference. RIC outperforms on average all other measures

We want to reiterate that this is an unsupervised task. Therefore it is not possible to tune parameters with cross-validation on a given data set. The tasks of inference of network of variables and identification of noisy relationships are unsupervised learning tasks and do not allow parameter tuning when applied to a new data set. When the user is provided with a new data set, this can only rely on the default parameters of a measure. Of course these could be tweaked to identify different top pairs of relationships. Nonetheless, all different sets of top relationships obtained with different parameters should be inspected individually because no ground truth is available for validation. Moreover, the real data sets discussed in this paper are a sample of the possible real data sets that can be analysed. Yet our data sets are independent and the analysis discussed in the paper can provide a picture of the behavior of different measures with default parameters.

5.4 Feature filtering for regression

In this section, we evaluate the performance of RIC and the other statistics as feature filtering techniques. A dependency measure \(\mathcal {D}\) can be used to rank the m features \(X_i\) on a regression task based on their prediction ability for the target variable Y. Only the top \(m^{\star }\) features according to \(\mathcal {D}\) are used to build a regressor for Y. Table 8 shows the average correlation coefficient between the predicted and the actual target value using the top \(m^{\star } \le 10\) features using a k-NN regressor (with \(k=3\)). Each value is obtained by averaging 3 random trials of 10-fold cross-validation for each \(m^{\star } \le 10\).

The datasets collected have at least 10 features and in the case of \(n > 1000\) records, we randomly sampled 1000 records to speed up the running time of dCorr, HSIC, \(I_{KDE }\), \(I_{mean }\), MIC, MIC\(_e\), and GMIC with default parameters. Records with missing values were deleted. We analyzed the performance on 8 datasets: 5 from the UCI machine learning repository,Footnote 10 the Pole telecommunication data,Footnote 11 and 2 datasets Qsar and Qsar2 from the website of the 3rd Strasbourg Summer School on Chemoinformatics.Footnote 12 The list of datasets used is shown in Table 6.

Table 8 Correlation coefficient between the predicted and actual target value on 8 datasets using kNN (\(k = 3\))
Fig. 13
figure 13

Average rank of measures when the task is maximizing the correlation coefficient between the predicted and the target value of a kNN regressor. The the kNN regression is built on top of \(m^{\star }\) features. Results were averaged across \(m^{\star } \le 10 \) and all datasets

As for the task of network inference in Sect. 5.3, it is important for a measure to be both equitable and powerful when detecting relationships. Powerful measures despite being non-equitable have been shown to perform well on this task: e.g. the HSIC (Song et al. 2007).

As in Sect. 5.3 we use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) to avoid low density grids that are better suited for testing of independence tasks. Overall, as can be observed from Fig. 13, RIC performs consistently well on average. RIC is also particularly useful when the number of features m is high and especially when their relationships to the target variable Y are noisy. These represent the most challenging scenarios as can be justified by the low correlation coefficient achievable using the selected features, e.g., on the Pyrim and Triazines datasets. We also note the good performance of RIC on datasets where there are features that can take only a predefined number of values: e.g., discrete numerical features. Pole, Qsar, and Qsar2 include these type of features. For such features it is very difficult to either optimize a kernel or a grid size or find the optimal data transformation to obtain the maximal correlation with ACE, which explains the less competitive performance of HSIC, \(I_{KDE }\), \(I_{A }\), \(I_{mean }\), and ACE. RIC is not affected by this problem as there is no optimization and grids are generated at random. Note that the good performance on feature selection for RIC is also due to the fact that features with high entropy are penalized because of the normalization factor in Eq. (2).

5.5 Run time comparison

Here we compare the running times of each measure in Table 4 varying the amount of records n on two independent variables X and Y uniformly distributed. The average run time on 30 simulations is shown in Fig. 14a for each measure. RIC is very competitive in terms of speed and can be grouped with the fastest measures: \(I_{ef }\), \(I_{ew }\), \(I_{kNN }\), \(I_{A }\), \(r^2\), MID, ACE, and RDC. On the other hand, dCorr, \(I_{KDE }\), HSIC, \(I_{mean }\), MIC, MIC\(_e\), GMIC, and TIC\(_e\) appear to be slower according to the implementations discussed at the beginning of Sect. 5 and the parameter setting from Table 4. As discussed in the related work section, different parameter setting yield more competitive running times for some measures. For example, TIC\(_e\) can obtain close to linear complexity in the number of records if \(\alpha = 0.2\). In our analysis, we chose to set \(\alpha = 0.65\) because it is the choice that allows us to maximize power when testing for independence under additive noise.

Fig. 14
figure 14

Running time in seconds (best viewed in color). a Time for each measures with parameters in Table 4, b time for RIC on \(n=10^3\) records

Figure 14a shows the running time for RIC with default parameters \(K_r = 20\) and \(D_{\max } = \lfloor \sqrt{n} \rfloor \). Similarly to other measures, the running time for RIC depends to its parameter setting. Figure 14b shows the different time taken by RIC on \(n=10^3\) records according to different \(K_r\) and different c where \(D_{\max } = \lfloor \sqrt{n/c} \rfloor \). By increasing \(K_r\) we increase the number of random grids and by increasing c with decrease the grid coarsity. Figure 14b shows different plots at the variation of \(K_r\) for \(c = 4\), \(c=1\), and \(c=0.1\) which respectively yield to \(D_{\max } = \lfloor \sqrt{n/4} \rfloor \), \(D_{\max } = \lfloor \sqrt{n} \rfloor \), and \(D_{\max } = \lfloor \sqrt{n \cdot 10} \rfloor \). These settings are respectively the ones we used for: independence testing under additive noise; network inference and feature filtering; and independence testing under white noise. The latter scenario proved to be the most challenging in terms of RIC running time.

Large \(K_r\) increases the computational time. Nonetheless, large \(K_r\) is not always required. As discussed in Sect. 5.1 even though it is always beneficial to increase \(K_r\) to further decrease the variance of RIC, this is particularly important when n is small. Thus, \(K_r\) can always be tuned by the user according to the sample size of the data set analyzed and the disposable computational budget.

6 Experiments on dependency between two sets of variables

In this section, we perform comparisons between the performance of measures which quantify the dependency between two sets of p variables \({\mathbf {X}}\) and q variables \({\mathbf {Y}}\). This is different from finding a subset of variables that are significantly correlated. In that case, new advances in that area yielded interesting measures to compare (Nguyen et al. 2014a; Nguyen and Vreeken 2015). In our paper, we compare the measures discussed in Table 4. The Pearson’s correlation coefficient, ACE, \(I_{A }\), MIC, GMIC, and MID are not applicable in these scenarios and there is no straightforward method to extend them to sets of variables available in literature.

6.1 Identification of multi-variable noisy relationships

Here we extend the experiments of Sect. 5.1 to sets of variables \({\mathbf {X}}\) and \({\mathbf {Y}}\). In particular, we test the power in identifying relationships between \({\mathbf {X}}\) with \(p=3\) variables and a single variable Y with the additive noise model. In order to use the same 12 relationships displayed in Fig. 5a, we map the set of features \({\mathbf {X}}\) on a single feature \(X' = \frac{X_1+\cdots + X_p}{p}\) and obtain Y according a given relationship plus additive noise. Figure 15 shows an example of a quadratic relationships between Y and \({\mathbf {X}} = (X_1,X_2)\) (\(p=2\)) with additive noise.

Fig. 15
figure 15

Example of a quadratic relationship between Y and \({\mathbf {X}} = (X_1,X_2)\) on the left plot. The plot on the right shows how Y is obtained through the mapping of \({\mathbf {X}}\) into \(X'=\frac{X_1 + X_2}{2}\)

We fix the number of variables \(p=3\) for \({\mathbf {X}}\) because some measures require specific tuning in regards to the number of variables considered. For example, the most straightforward way to extend the discretization based estimators of mutual information \(I_{ew }\) and \(I_{ef }\) is to independently discretize all the variables in each set. This requires carefully choosing the number of discretization bins for each variable in \({\mathbf {X}}\) and each variable in \({\mathbf {Y}}\). If the same number of bins \(D_X\) is chosen for all the variables in \({\mathbf {X}}\) and the same number of bins \(D_Y\) is chosen for all the variables in \({\mathbf {Y}}\), it is possible to end up with as many as \(D_X^p \cdot D_Y^q\) total bins. This issue makes it practically infeasible to use \(I_{ew }\) and \(I_{ef }\) in high pq scenarios. Given this limitation of the discretization based estimators of mutual information, we also made use of a multi-variable discretization approach of the set of variables \({\mathbf {X}}\) which allows a more sensible choice of the total number of bins. Even if methods for multi-variable discretization are available in literature (Garcia et al. 2013; Dougherty et al. 1995) to our knowledge there is no extensive survey about the performance of estimation of mutual information with multi-variable approaches. Therefore, we chose to discretize \({\mathbf {X}}\) and \({\mathbf {Y}}\) with the clustering algorithm k-means and then compute the mutual information. We name this measure \(I_{k-mean }\). This allows us to choose the total number of bins (clusters) to be produced.

In our case, where \(p = 3\) and \(q = 1\) we chose compute \(I_{ew }\) and \(I_{ef }\) fixing \(D_Y = 5\) and compute \(D_X\) in order to limit the number of total bins in regards to the number n of data points: \(D_X^p \cdot D_Y \le \frac{n}{5} \Rightarrow D_X = \lfloor \frac{ \log { n/25 } }{\log {p}} \rfloor \). When \(n = 320\), \(D_X = 2\). We tuned the parameters of every other measure in order to maximize the average power on all relationships. Please refer to “Appendix B” for more details. Regarding RIC, in order to have full control on the number of bins produced, we compared the multi-variable dicretization approach that uses random seeds as described in Algorithm 4. More specifically, we fixed the number of random seeds to \( \lfloor \sqrt{n/c} \rfloor \) given that also choosing the number of random seeds at random might result in configurations with as few as 2 seeds, which strongly deteriorates the discrimination ability of mutual information on multiple variables. The parameter c for RIC that maximizes the power on average is \(c=6\) which generates \(\lfloor \sqrt{n/6} \rfloor \) seeds. This setting is very similar to the optimal parameter setting found for testing for independence between variables under additive noise in Sect. 5.1. Most of the measures obtain similar optimal parameters to the ones obtained when testing for independence between variables. Just \(I_{KDE }\) seems to require even larger kernel width when comparing sets of variables.

Figure 16 show average rank of each measure across different relationships. Individual results are shown in “Appendix B”. RIC with \(D_{\max } = \lfloor \sqrt{n/6} \rfloor \) and \(K_r = 200\) looks more competitive than all other measures but \(I_{KDE }\). Therefore, the strongest competitor seems to be \(I_{KDE }\) that with a careful choice of kernels achieves very good performance on simple relationships such as the linear, quadratic, and cubic. We also can see that the discretization based estimators of mutual information do not do a good job because they dramatically fail on some data set. Moreover, \(I_{k-means }\) which produces the same number of bins as RIC has clearly lower performance than the latter. The superior performance of RIC is thus due to the randomization.

Fig. 16
figure 16

Average rank across relationships for the multi-variable additive noise model. RIC with \(D_{\max } = \lfloor \sqrt{n/6} \rfloor \) and \(K_r = 200\) shares the top position with \(I_{KDE }\)

Fig. 17
figure 17

Average rank when the target is maximization of the correlation coefficient between the predicted and the target value for a kNN regressor. The kNN regression is built on top of \(m^{\star }\) features chosen by forward selection. Results are averaged across \(m^{\star } \le 10 \) and all datasets

6.2 Feature selection for regression

We also tested multi-variable measures of dependency in the task of feature selection using a similar framework to Sect. 5.4. Rather than filtering the features according to their individual importance to the target variable Y, we proceed by forward selection. The optimal set of p features according to a dependency measure is identified by finding the best set of features \({\mathbf {X}} = {\mathbf {X}}^{p-1} \cup X_i\), with \({\mathbf {X}}^{p-1}\) representing the set chosen at the previous iteration of forward selection and \(X_i\) chosen among the possible \(m -(p-1)\) features of a dataset. A multi-variable dependency measure can be fully employed in this case because we require to compute the dependency between \({\mathbf {X}}\) features and the target variable Y at each step of the iteration.

As in Sects. 5.3 and 5.4 we use RIC with parameters \(D_{\max } = \lfloor \sqrt{n} \rfloor \) to avoid low density grids that are better suited for testing of independence tasks under additive noise. We use the random seeds discretization approach of Algorithm 4 with a fixed number of random seeds. We also choose to fix \(D_X = 2\) and \(D_Y = 5\) for the naive discretization based estimators of mutual information. Average results for all the measures are shown in Fig. 17 and a table with detailed comparisons is presented in “Appendix B”. We notice that the ranking by performance of classifier changes from the one obtained using the feature filtering approach, although RIC again shows competitive performance against the other approaches. All estimators of mutual information lose positions except for the \(I_{KDE }\) kernel based estimator. It seems that on multiple variables kernels are more effective than in the univariate scenario. Indeed, HSIC also gains a few positions. RDC’s average performance stays the same and it still gets outperformed by dCorr. dCorr performs really well when computed on sets of variables. As previously noted, even in this case RIC outperforms \(I_{k-means }\) and this result is due to the randomized approach.

7 Conclusion

We presented the Randomized mutual information (RIC), an information theoretic measure of dependency between two sets random variables \({\mathbf {X}}\) and \({\mathbf {Y}}\), that makes use of an ensemble of random grids. Our theoretical analysis justifies the benefits of having a low-variance estimator of mutual information based on grids for the task of ranking relationships, where systematic biases cancel each other out. By reducing the estimation variance of mutual information with grids, RIC is extremely competitive for ranking different relationships. We experimentally demonstrated its strong performance on univariate X and Y on the task of discrimination of noisy relationships, network inference and feature filtering for regression. We have shown that RIC can be extended to multivariate \({\mathbf {X}}\) and \({\mathbf {Y}}\) with a subtle discretization scheme. We recommend RIC’s use with the default parameters: maximum number of random cut-offs \(D_{\max } = \lfloor \sqrt{n} \rfloor \) and number of random discretizations \(K_r = 20\) for both \({\mathbf {X}}\) and \({\mathbf {Y}}\) in general applications. However, \(D_{\max }\) can be decreased when testing for independence under additive noise and \(K_r\) can be increased to decrease the variance, at the cost of computational time.

Fig. 18
figure 18

Parameter tuning to maximize the power of each measure on average for the additive noise model when comparing variables. These plots show the average area under power curve and their average across relationship types

Fig. 19
figure 19

Parameter tuning to maximize the power of each measure on average for the white noise model when comparing variables. These plots show the average area under power curve and their average across relationship types