Improving network inference algorithms using resampling methods
Relatively small changes to gene expression data dramatically affect co-expression networks inferred from that data which, in turn, can significantly alter the subsequent biological interpretation. This error propagation is an underappreciated problem that, while hinted at in the literature, has not yet been thoroughly explored. Resampling methods (e.g. bootstrap aggregation, random subspace method) are hypothesized to alleviate variability in network inference methods by minimizing outlier effects and distilling persistent associations in the data. But the efficacy of the approach assumes the generalization from statistical theory holds true in biological network inference applications.
We evaluated the effect of bootstrap aggregation on inferred networks using commonly applied network inference methods in terms of stability, or resilience to perturbations in the underlying expression data, a metric for accuracy, and functional enrichment of edge interactions.
Bootstrap aggregation results in improved stability and, depending on the size of the input dataset, a marginal improvement to accuracy assessed by each method’s ability to link genes in the same functional pathway.
KeywordsGene regulatory network inference Random subspace method Resampling Bootstrapping Aggregation
(Bootstrapped) Conservative Causal Core (C3) Network
(Bootstrapped) Context Likelihood of Relatedness
Algorithm for the Reconstruction of Accurate Cellular Networks
Area under the precision-recall curve
Dialogue for Reverse Engineering Assessments and Methods
Functional edge overlap
GEne Network Inference with Ensemble of trees
Mean absolute error
Not a number
Relevance Low order Partial Correlation
Little is known about gene regulatory networks, even among the simplest bacteria: Escherichia coli. E. coli has 4,377 genes and thus almost 10 million possible binary edge interactions. If more complicated interaction motifs (e.g. feed-forward, cascade, fan-in, fan-out) are considered, this number grows to 14 billion for 3-gene interactions, and 15 trillion for 4-gene. In addition, few interactions in E. coli have been confirmed through experimentation , and even fewer in more complex organisms, such as S. cerevisiae . This results in an inability to assess the performance of inferred networks in all but the simplest organisms, and even then with a significantly limited set of known interactions with which to make a comparison. Synthetic networks have been used to circumvent the absence of fully annotated interaction networks, but performance does not necessarily generalize to real world applications .
Other methods have been developed for statistical aggregation in network inference [12, 13, 14, 15, 16]. Filosi et al. 2014 coined several indicators of stability through use of resampling by cross validation (specifically: leave-one-out, k-fold). These stability indicators offer quantitative metrics to assess the resilience of a given method to the presence/absence of conditions, but resampled networks were not aggregated in an attempt to improve predictions . In de Matos Simoes et al. 2012, bootstrap aggregation was applied to the C3NET (Conservative Causal Core network) algorithm, thus termed bootstrapped-C3NET, or BC3NET, demonstrating improvement in terms of F1 score compared to C3NET. Their approach additionally provided a statistical procedure for determining an optimal confidence threshold parameter, a nontrivial selection, using the networks generated during bootstrapping . Guo et al. 2017 employed use of partial correlations (i.e. isolation of a single gene pair at a time), extracting only the most highly correlated relationships as edges in their RLowPC (Relevance Low order Partial Correlation) method . Friedman et al. 1999 applied bootstrapping to yield a successful result, but by resampling genes, not conditions, and applying to small, synthetic datasets . GENIE3 (GEne Network Inference with Ensemble of trees)  and ARCANE (Algorithm for the Reconstruction of Accurate Cellular Networks)  implicitly perform subspace resampling, but neither evaluate the associated effects explicitly. Our work expands on these previous efforts by focusing on the relationship between the initial set of expression conditions used to infer the network and parameters of the resampling approaches to both stability and accuracy of the resulting networks, applied to real-world datasets.
We evaluate feature bootstrap aggregating as a potential remedy to the observed variability in network inference applications, in terms of both stability (i.e. resistance to variability in inferred relationships) and accuracy, as scored against the standards. We then address the validity of the assumption that stability and accuracy are correlated, as this has come up in network inference problems in which direct evaluation is not possible . Finally, we show that bootstrapping improves both stability and accuracy when inferring networks from datasets, contingent on the underlying inference method and the number of input conditions.
Sensitivity to number of conditions
To examine the effects of removing data from the dataset used for network inference we obtained a large set of transcriptomics data representing > 800 conditions (differences in growth conditions, genetic differences, time points, etc.) for E. coli used by the DREAM5 competition . We first applied a standard mutual-information based method for network inference, the context-likelihood of relatedness (CLR) algorithm , to the entire dataset to generate a network. Sets of individual conditions were removed randomly in successively larger amounts and constituent networks were inferred based on the subset of conditions, 10 times for each step. Mean absolute error (MAE), or the average of the absolute differences between two sets of observations, of each constituent network was calculated relative to the parent network. Figure 1 shows the effect of such perturbations. As we previously observed with a much smaller starting set of data in Synechoccocus , as more conditions are removed, the resulting constituent networks generated by CLR increasingly differ from the network generated with all conditions (the parent network). Though approaches have been used that purport to address this variability issue, the efficacy of these approaches have not been rigorously evaluated [4, 5, 13, 15, 16].
To replicate the bootstrapping approach commonly used by us and others, we subsample a specific fraction of the total dataset a number of times and then aggregate the results by averaging edge weights. Using the CLR method as the underlying network inference method, we refer to this approach as BCLR.
Network inference methods require data from a number of perturbations to be able to draw accurate inferences of real relationships. Based on the premise of bootstrapping, and results reported elsewhere [4, 12, 13], we hypothesized that bootstrapping would improve the stability of the inferred network relative to the parent network inferred with all the conditions. We therefore assessed the stability of BCLR with 5% subsampling fraction over a range of different initial dataset sizes.
In terms of accuracy, Fig. 3 demonstrates the effectiveness of BCLR across the majority of condition set sizes. With larger initial dataset sizes BCLR outperforms CLR in its ability to infer correct relationships among transcription factor-gene interactions. For smaller condition sets (i.e. > 50% removed), CLR appears to be more robust to the loss of conditions. However, 40% of the full set is 322 conditions, meaning BCLR with 5% subsampling fraction only “sees” 16 conditions during each iteration.
Choice of inference method
Due to the additional computational overhead introduced by bootstrap iterations, applications must consider whether the marginal gains to stability and accuracy are justified. This is largely based on the size of the dataset, the inference algorithm used, and the availability of high-performance computing resources, and will vary on a case-by-case basis.
Accuracy as evaluated by AUPR increased marginally over non-bootstrapped inference methods for certain dataset sizes and subsampling fractions. It must be maintained that these results hold for the limited datasets evaluated in this work, and with incomplete (silver) standards. Generalization to real-world datasets is assumed, but not guaranteed. Further, when performing network inference using unlabeled datasets, accuracy cannot be assessed. We can only employ the best-performing inference method, as assessed with known interactions. We can, however, assess stability without labeled data, but its use as a proxy for accuracy may not be justified, as discussed in the next section.
Stability as a proxy for accuracy
Stability, that is, resilience to changes in the underlying data, is not necessarily correlated with accuracy, despite a general inverse relationship between AUPR and MAE (see Additional file 3: Figures S2 and Additional file 4: Figure S3). BCLR at 5% subsampling showed greatest AUPR improvement where no conditions were removed, which corresponds to the only point where BCLR was less stable than CLR. Similarly, with smaller datasets (i.e. > 60% conditions removed), BCLR was more stable than CLR, but CLR resulted in higher-accuracy networks. That said, when between 10 and 60% of conditions were removed, BCLR exceeded CLR in both metrics (Fig. 3). But without prior knowledge of true interactions, and therefore no standard by which to calculate AUPR, a novel study cannot optimize by accuracy.
By pursuing stability as a proxy and making use of all conditions, CLR would be selected over BCLR, despite the improved accuracy of the latter. Thus, our recommendation would be to optimize for the most stable bootstrapped network. That is, (i) construct a parent network at each subsampling fraction, (ii) for each subsampling fraction, iteratively remove conditions and run BCLR, comparing each result to the respective parent network, (iii) select the subsampling fraction that results in the most stable condition-removal curve, and finally (iv) use the resulting subsampling fraction with all available data.
Functional edge overlap
Although AUPR is a measure of accuracy that has been used in a number of studies, including this one, we wanted to also include a more general measurement of network accuracy in addition to AUPR. Rather than focus only on regulator-target pairs we took advantage of the abundant functional annotation information available for E. coli and determined how often each network inference method could link genes in the same functional category. This approach revealed several strategies for inferring accurate networks, use as much data as possible, smaller networks are more accurate than larger ones and BCLR has a slight advantage over CLR with small networks but not with large networks.
This work explored the effects of feature bootstrap aggregation on the network inference algorithm in terms of stability and accuracy. With respect to stability, BCLR improved over CLR at recreating its parent network across condition set sizes for small subsampling fractions (e.g. 5%). Larger subsampling fractions had little effect on resulting networks, providing marginal improvements in stability. In terms of accuracy, BCLR demonstrated a noticeable improvement over CLR at low subsampling fractions and for cases in which the initial pool of conditions was sufficiently large. This is also supported by FEO analysis. In concert, these findings reveal several important considerations when using CLR (or BCLR) to infer gene regulatory networks in novel studies where true interactions are unknown.
In order to evaluate the effects of data structure and variability on network inference we chose to utilize a large transcriptional compendium previously used for the DREAM5 network inference competition . This included gene expression data from 805 perturbations of Escherichia coli. For this dataset, the underlying network is not completely known, but we considered the set of known transcription factor-target relationships used by the DREAM5 competition to be the ‘silver standard’ for evaluation. This silver standard provides a reasonable way to compare the performance of different network inference approaches. In this study we focus on the use of gene expression measurements on network inference and so refer to each individual expression set arising from an individual perturbation as a ‘condition’.
Networks were inferred using the context likelihood of relatedness (CLR) algorithm  or Pearson correlation coefficient. For CLR, default parameters were used, except for number of bins and order of the fitted splines, which were set to 10 and 3, respectively. For each possible interaction, CLR produces a z-score that corresponds to the significance of the interaction given its contextual neighborhood. These values were used as-generated in deciding the confidence of a given edge.
Network generation methods (e.g. with different bootstrapping parameters) were evaluated against respective parent networks and the silver standard to assess stability and accuracy, respectively. We define a parent network as the network inferred by a particular algorithm (CLR, Pearson, or their bootstrapped variants) using the entire set of conditions. We define stability as the similarity, evaluated as mean absolute error (MAE), between a network inferred from a subset of conditions and the network inferred from the complete set of data (the parent network). Accuracy is defined as how well a network captures the set of known, experimentally defined transcription factor-target relationships.
The provided standards contain edge information for 152,280 of approximately 10 million possible interactions, 2,066 of which are positive. To account for this sparsity when comparing to generated networks, the standard was cast as a dense matrix wherein absent edges were encoded as “not a number” (NaN), effectively masking unknown edges in performance evaluations, a practice generally accepted in the literature . It is worth noting that the generated standard is a directed graph (e.g. an edge between node A and node B does not necessarily guarantee an edge between node B and node A) whereas the networks generated by Pearson correlation and CLR are undirected graphs (i.e. edges between two nodes exist in both directions).
Stability was evaluated by MAE with the respective parent network. Lower values indicate greater similarity and, by extension and our definition, stability. Accuracy was evaluated by area under the precision-recall curve (AUPR). AUPR is parameterized by threshold selection (here, a z-score cutoff) of a given test network benchmarked against a standard. This enables performance assessment without the need for explicit threshold selection, a nontrivial decision with implications beyond the scope of this work. There are other acceptable metrics to assess network similarity—root mean squared error, Jaccard index, etc.—but AUPR was selected for its wide use in biological network inference applications, because thresholds need not be selected, and its better performance compared to other classifiers (e.g. receiver operating characteristic) when dealing with imbalanced datasets .
We also determined accuracy by examining how many edges in the network connected two genes that were in the same functional category. Functional information for E. coli genes was obtained from EcoCyc . We use a metric termed the functional edge overlap (FEO) ratio, the number of edges linking annotated genes in the same functional category divided by the number of all edges linking annotated genes. If an edge links to a gene that has no functional annotation that edge is ignored for the FEO analysis. FEO ratios can range from 0 (no edges link annotated genes in the same category) to 1.0 (all edges linked annotated genes in the same category).
Feature bootstrap aggregation
In light of the fact that varying the number of conditions in biological network inference can have substantial impact on the inferred network (Fig. 1), we explored use of bootstrap aggregating of the conditions with the goal of generating a stable network. That is, a network less sensitive to the presence or absence of certain conditions. Additionally, we hypothesized that bootstrap aggregation would result in improved network inference performance (i.e. accuracy).
We implemented condition bootstrap aggregation by randomly selecting a fraction of the full condition set without replacement, then running CLR on this subset to generate a constituent network. This process was repeated n times until convergence (n = 200; Additional file 2: Figure S1). Convergence was demonstrated by evaluating the addition of each constituent network after each iteration in terms of AUPR with the silver standard, as well as MAE with the consensus network from the previous iteration. Resulting constituent networks were aggregated by averaging their edge weights, yielding a consensus network. The bootstrap-aggregated version of CLR is henceforth referred to as BCLR. Note also that we implemented a bootstrap-aggregated variant of Pearson’s correlation method for comparison purposes, but this algorithm was not the primary focus of this work.
Assuming CLR-specific parameters do not change, bootstrap aggregation introduces two additional parameters: number of bootstrap iterations and subsampling fraction. Number of iterations is chosen to ensure convergence (Additional file 2: Figure S1), so ultimately only subsampling fraction introduces additional complexity. A sweep of subsampling fractions was performed to assess its effect on performance in terms of accuracy only, as each network is itself the associated parent, barring stability assessment (Fig. 2). This was achieved by running BCLR with subsampling fractions ranging from 0.01 to 1.0.
Limiting input conditions
Because we are assessing these methods on a very large dataset of transcriptional data from E. coli (> 800), we were also interested to see the effects of limiting the number of conditions considered to levels that would be more reasonable to expect for other organisms that have not been studied as extensively. We therefore compared performance of the methods (CLR and BCLR) on input data limited to randomly selected subsets of the initial data, from 10 to 90% of initial data. Because there is variability associated with which specific conditions are removed, this process was repeated 10 times for each fraction removed, from which a mean and confidence interval was constructed for the relevant plots. For information on how the confidence intervals were constructed, see Additional file 1: Supplemental Results.
Differences in mean AUPR and/or MAE were tested for significance using a two-sided T-test with the null hypothesis that sample means are equal. Unequal population variances were assumed when conducting the T-test, as sample variances were observed to differ between results for each algorithm (CLR, Pearson) and their bootstrapped variants (BCLR, BPearson). P-values are reported directly.
PNNL is operated for DOE by Battelle Memorial Institute under contract DE-AC05-76RL01830.
This research was supported by the Department of Energy Office of Biological and Environmental Research (BER) and is a contribution of the Scientific Focus Area “Phenotypic response of the soil microbiome to environmental perturbations”. The funding body did not influence the design of the study; the collection, analysis, and interpretation of data; or the writing of this manuscript.
Availability of data and materials
Gene expression data used in this study are available from the DREAM5 website (https://www.synapse.org/#!Synapse:syn2787209/wiki/70351). Code used for network generation and evaluation is available from the corresponding author upon reasonable request.
SMC and CCO developed the software. SMC performed the majority of the analysis, generated figures, and prepared the manuscript. RSM provided transcriptomic data and carried out FEO analysis. RR provided guidance for the approaches used. JEM conceived of and provided guidance and direction for the project. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Gama-Castro S, Salgado H, Santos-Zavaleta A, Ledezma-Tejeida D, Muñiz-Rascado L, García-Sotelo JS, Alquicira-Hernández K, Martínez-Flores I, Pannier L, Castro-Mondragón JA. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 2015;44(D1):D133–43.CrossRefGoogle Scholar
- 11.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.Google Scholar
- 14.Friedman N, Goldszmidt M, Wyner A. Data analysis with Bayesian networks: A bootstrap approach. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence: Morgan Kaufmann Publishers Inc.; 1999. p. 196–205.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.