Network analysis of gene essentiality in functional genomics experiments
- 8.1k Downloads
Many genomic techniques have been developed to study gene essentiality genome-wide, such as CRISPR and shRNA screens. Our analyses of public CRISPR screens suggest protein interaction networks, when integrated with gene expression or histone marks, are highly predictive of gene essentiality. Meanwhile, the quality of CRISPR and shRNA screen results can be significantly enhanced through network neighbor information. We also found network neighbor information to be very informative on prioritizing ChIP-seq target genes and survival indicator genes from tumor profiling. Thus, our study provides a general method for gene essentiality analysis in functional genomic experiments (http://nest.dfci.harvard.edu).
KeywordsCRISPR screen Network analysis Gene essentiality
Essential genes are those genes critical for cell viability under certain contexts. Recent years have seen the rapid development of functional genomics techniques for studying gene essentiality genome-wide. For example, large-scale shRNA screens have been used to search for essential genes in diverse cell lines . If a specific transcription factor drives the cell viability under certain condition, ChIP-seq technique can be used to profile the regulatory targets to further find essential genes . Many computational methods have also been developed to predict context specific gene essentiality through integration of gene expression, molecular alterations, and biological pathways .
Recently, the CRISPR (clustered regularly interspaced short palindromic repeats) screen emerged as an exciting new approach to profile gene essentiality at genome scale [4, 5, 6, 7, 8, 9, 10, 11]. In the CRISPR system, single-guide RNAs (sgRNA) direct Cas9 nucleases to induce double-strand breaks (DSB) at targeted genomic regions [12, 13]. When the error-prone non-homologous end-joining mechanism repairs the DSBs, insertions and deletions occur with high frequency, which produce a non-functional protein. Catalytically inactive Cas9 fused with a transcriptional activator or repressor has also been used to modulate gene expression at targeted loci [8, 9, 14, 15, 16, 17]. Combined with lentiviral delivery method, CRISPR systems enable genome-scale functional screening in a cost-effective manner [4, 5, 6, 7, 8, 9, 10, 11]. In CRISPR screens, sgRNAs targeting candidate genes are synthesized, and viral integration enables readout through next-generation sequencing . The relative abundances of each integrated sgRNA between different conditions are compared and the importance of sgRNA target gene is inferred according to its sgRNAs’ effect on cell growth.
The progress of CRISPR screen technology enabled systematic and reliable determination of gene essentiality under diverse conditions. The high quality gene essentiality profiles from CRISPR could enable a better comparison among essentiality prediction methods and better identification of distinct features of the essential genes. Such features not only facilitate a better understanding of the CRISPR screen data, but also can help prioritize the leads from CRISPR screens. From the analysis of yeast protein interactions, it is well known that highly connected proteins in a network (degree hubs) are more likely to be essential for viability [19, 20, 21]. Thus, we hypothesize that the gene essentiality outcome in CRISPR screens might depend on the gene connectivity in biological networks. Protein interaction networks have been integrated to improve the quality of RNAi screen results, which are very noisy due to off-target effect and low knockdown efficiency [22, 23, 24, 25]. These previous works on RNAi screen indicate that the CRISPR screen result quality may also be improved by integration with protein interaction networks.
In this study, we took a network perspective and developed a method called NEST (Network Essentiality Scoring Tool) to systematically analyze the recent genome wide CRISPR screen data. We found that gene essentiality determined by CRISPR screen largely depends on the expression level of interacting genes in the biological network. Moreover, the quality of CRISPR and shRNA screen data can be further improved by NEST after considering the gene neighborhood screen outcome. Besides applications on CRISPR and shRNA screens, NEST is also generally applicable on many other types of genomics data analysis, such as ChIP-seq target gene prioritization and survival gene identification from tumor profiling data.
Results and discussion
NEST predicts gene essentiality in CRISPR screen
We first collected recently published CRISPR loss-of-function screen data [4, 5, 8], and selected three cell lines (K562, HL60, and A375) with publicly available gene expression data [26, 27, 28]. The significant CRISPR screen gene hits are called with software MAGeCK . In CRISPR screens for growth phenotype, most significant genes are negatively selected, which means these genes are essential in the corresponding experimental condition (Additional file 1: Figure S1). To identify distinct features of gene essentiality in CRISPR screens, we developed a network-based method called NEST (Network Essentiality Scoring Tool), and found the following metric to give reliable performance.
For gene essentiality prediction in K562 CRISPRi screen, NEST achieved a false positive rate of 0.2 and a true positive rate of 0.8, with an area under the ROC curve (AUC) of 0.89. The AUC of NEST score is consistently better than network degree, gene expression, and shRNA screen data from the Achilles project  (Delong P value <1e-10 for all comparisons). Similar performance differences were also observed in CRISPR screen in HL60 and A375 (Additional file 1: Figure S3a). To visualize the CRISPR prediction performance in an intuitive way, we plotted the rank percentile of NEST scores for essential genes and non-essential genes in CRISPR screen (Fig. 1c and Additional file 1: Figure S3B). The NEST ranks are significantly higher for essential genes than non-essential genes (Wilcoxon rank-sum P value <1e-10 for cell lines). Besides STRING network, we also used other large-scale networks for CRISPR outcome prediction. However, we did not find any performance improvement using either other network or merged network among several data sources (Additional file 1: Figure S4).
The results above suggest that if a gene’s network neighbors are over-expressed in some conditions, the gene itself becomes more essential. We also found that genes with high NEST scores are tightly clustered in protein interaction network. The STRING network genes were grouped into 2,271 dense complexes using SPICi . Gene with high NEST scores tend to stay in fewer number of STRING clusters than clusters with gene names shuffled (Additional file 1: Figure S5). Thus, a high NEST score may indicate the gene to be member of an active protein complex.
To test the prediction specificity, we applied NEST for gene expression profiles of 56 cell lines profiled by Roadmap project . Measured by rank-sum test Z-scores, K562 CRISPRi screen data achieved the highest association with NEST score in K562 cell than all other cell lines (Fig. 1d). Similarly, HL60 and A375 CRISPR screen data also achieved higher associations with NEST scores in the same cell line (Additional file 1: Figure S3C). Housekeeping genes, such as ribosome members are often selected as essential genes in CRISPR screens [5, 8]. Thus, we further tested that the high prediction power of NEST scores was not purely derived from the same set of housekeeping genes. The prediction performance of NEST remains high after removal of housekeeping genes annotated previously  (Additional file 1: Figure S6). Notably, the majority of essential genes selected in CRISPR screen do not overlap between K562, HL60, and A375 (Additional file 1: Table S1). Thus, our NEST score is an orthogonal feature of CRISPR selected gene essentiality other than the universal housekeeping genes shared across conditions.
Confounding factors for NEST prediction performance
There have been many previous methods developed for gene essentiality prediction. Since CRISPR screen measures the gene essentiality, any previous methods can be predictive for CRISPR outcome. In a recent DREAM challenge, contenders were asked to develop algorithms to predict cell specific gene essentiality . Among cell lines included in the DREAM challenge, A375 has CRISPR screen data available. We compared the CRISPR outcome prediction performance between our method and the top three methods from the DREAM challenge, and found NEST to consistently outperform all DREAM winners (Fig. 1f). Besides the methods in DREAM, we also compared the performance of NEST with other methods using gene expression and network to predict gene essentiality [34, 35]. NEST significantly outperformed all other methods (Additional file 1: Figure S8 and Additional file 1: Methods).
Besides gene expression, we also used H3K27ac histone mark data to compute NEST scores and tested the gene essentiality prediction performance. Previously, we developed a method to calculate the regulatory potential (RP) scores of a histone modification on each gene promoter from the ChIP-seq profile [36, 37]. Based on our previous method, gene level RP scores in K562 cell were computed using the Roadmap H3K27ac ChIP-seq profile . For each gene, NEST computed neighbor H3K27ac score as the sum of H3K27ac RP scores of its neighbor genes connected in the protein interaction network, weighted by the interaction confidence (Fig. 1a). H3K27ac NEST scores could also reliably predict the gene essentiality in K562 CRISPRi screen (Additional file 1: Figure S3), suggesting the applicability of NEST analysis on histone modification data.
NEST enhances the quality of CRISPR screen results
For each gene, NEST calculated a neighbor CRISPR score by adding up the CRISPR fold change scores among neighbor genes connected in the STRING network, weighted by the interaction confidence. This NEST score is significantly more predictive on the gold standard outcome than the original CRISPR scores for both protective and sensitizing genes (Fig. 2a, b, Delong test P value = 0.010 for protective genes, P value = 9.92e-14 for sensitizing genes). Moreover, when we put different levels of Gaussian white noise into CRISPR screen scores, the prediction performance of NEST score diminishes slower than original CRISPR scores (Fig. 2c, d). As a control, if we calculated the NEST scores from randomized network, the prediction power became significantly worse (Fig. 2e, P value <0.001 for both protective and sensitizing genes). Thus, through the connectivity of protein interaction network, NEST can enhance the quality of CRISPR screen result.
As an example of gene with high NEST score, COG6 is a member of Golgi complex and its NEST score is significantly larger than expected (permutation test P value <0.001). COG6 is connected with many other members of Golgi complex (Fig. 2f), and most of them have positive CRISPR screen fold change scores. Since they are connected with each other in network, they mutually boosted each other’s NEST scores. Our result is consistent with the knowledge that cholera toxin needs to enter host cells and travel through the trans-Golgi network to take effect .
The above results suggest that if a gene’s network neighbors are under CRISPR screen selection, the gene itself is more likely to be under CRISPR screen selection in the same direction. Besides CRISPR screen, we applied NEST on the Achilles shRNA screen data . NEST can also significantly enhance the quality of shRNA screen result (Additional file 1: Figure S9). Thus, in general, the quality of functional genomic screen result can be improved by considering the gene network neighbor information.
Previously, there were methods developed to improve the quality of RNAi screen results from integration with protein interaction networks . For CRISPR enhancement, we compared our method NEST against NePhe, which was a leading method on RNAi screen network analysis . Using K562 toxin screen as the gold standard, we found that NePhe and NEST show similar performance as measured by ROC curves (Additional file 1: Figure S10AB). However, while NePhe used 14 GB memory and 6.2 h running time, NEST only used 8.3 MB memory and 10.8 s (Additional file 1: Figure S10C). Thus, NEST maintains reliable screen enhancement performance of previous method with much better computational efficiency.
NEST prioritizes ChIP-seq essential targets
Besides functional genomic CRISPR/shRNA screen, many other genomic experimental techniques can be used to search for essential genes. For example, if a specific transcription factor (TF) drives the cell viability under certain condition, ChIP-seq technique can be used to profile its regulatory targets to further find essential genes . The previous analyses demonstrate that NEST can identify the essential genes in a CRISPR screen. We further explored whether NEST can help prioritize key target genes in a ChIP-seq experiment. ChIP-seq often finds tens of thousands in vivo binding sites for a TF. Since target genes can be regulated by TF binding through long range DNA looping, often thousands of genes near the TF binding sites can be putative targets, and it is hard to prioritize the functional target genes directly from a ChIP-seq experiment. We therefore investigated using network neighbor information to prioritize the functional TF target genes.
NEST predicts cancer patient survival
To systematically evaluate the survival prediction performance, we hypothesized that a good gene-wise survival predictor should show significant higher death risk for oncogenes than for tumor suppressors. We tested this hypothesis on all the annotated oncogenes and tumor suppressors  using the TCGA GBM data (Fig. 4c). While gene expression showed no significant difference on survival risk Z-scores, NEST gave significantly higher survival risk for oncogenes than tumor suppressors (Fig. 4c). This observation was corroborated in another independent GBM cohort  (Additional file 1: Figure S11), suggesting NEST score to be a much better indicator of GBM survival than gene expression alone. To examine the survival prediction performance of NEST in other cancer types, we used the Wilcoxon rank-sum test to measure the difference of survival risk Z-scores between oncogenes and tumor suppressors. A positive rank-sum Z-score indicates oncogenes with higher survival risk than tumor suppressors, and a negative Z-score indicates the opposite. For low death rate cancers, the Cox-PH survival regression may not get accurate risk estimation for each gene. In contrast, cancer types with high death rate, such as GBM and ovarian cancer (OV), seemed to give positive rank-sum Z-scores that separate oncogenes from tumor suppressors (Fig. 4d). These results suggest that if a gene’s neighbors are over expressed in tumors, the gene itself is more likely to be an oncogene with associated survival risk.
To identify distinct features of gene essentiality in CRISPR screens, we developed a network-based method called NEST (Network Essentiality Scoring Tool). We found that essential genes selected in CRISPR screens showed characteristic higher expression level of neighbor genes connected in protein interaction network. Our analysis of Cholera toxin screen in K562 cell also suggests that the quality of CRISPR screen result can be enhanced through the neighbor CRISPR selection score. For a ChIP-seq experiment, NEST can also reliably identify the key TF target genes. Last but not least, NEST score can better predict patient survival than gene expression alone from TCGA tumor profiles. Historically, protein interaction networks were widely used to infer discrete labels such as gene functions, phenotypes [55, 56, 57], or gene categories . Our study is different from these previous works in that continuous expression or screen change fold values are integrated with the protein networks. Despite these differences, all of these studies indicate that network information can greatly help biological inference.
NEST significantly outperformed previous methods on gene essentiality prediction and functional screen result enhancement, including all winning methods in the DREAM challenge (Fig. 1f). According to the rule of DREAM challenge, all DREAM methods can gene expression as well as any other features they could utilize. However, NEST outperformed all top DREAM methods. One possible reason is that the gene essentiality gold standard of DREAM is the Achilles shRNA screen data, which is poorly correlated with CRISPR screen (Fig. 1b and Additional file 1: Figure S3A). Because we used CRISPR data as gold standard, those top DREAM methods, optimized to fit Achilles shRNA screen, may not have satisfactory performance.
Several limitations should be noted for our study. NEST computed gene activity is based on network interaction partners, which could have either an activating or a repressive effect. Meanwhile, for compensating interaction such as synthetic lethality, the activation of interaction partners indicates gene loss of function. For example, PLK1, an interaction partner of TP53 in STRING network, was consistently upregulated in cancer cells with inactivated TP53 compared with those with wild type . We currently summed all neighbor values without distinguishing between activating, repressive, or synthetic lethal relations. Thus, further categorization of network interaction types will be critical for better gene prioritization. Another limitation of our study is that current data on protein interaction network only covered a subset of well-studied genes . Because of the dependence on interaction knowledge, our method may not reliably infer the activity for under-studied genes. As a third limitation, we only tested NEST on gene loss-of-function CRISPR screens. However, for CRISPRa gain of function screen [8, 9], it remains to see whether network-based analysis can bring any predictive power and result enhancement.
In summary, we derived a network-based method, NEST, to interpret and enhance the outcome of genome-wide CRISPR screens, and NEST showed significantly better performance than previous related methods. We recommend researchers using NEST to calculate neighbor CRISPR values from their CRISPR screen result. Moreover, the candidate essential genes in a cell condition might be prioritized before running a large-scale screen to reduce the total number of genes under the screen, which might improve the results and practicality of in vivo CRISPR screens. Besides CRISPR analysis, our method can also identify key targets from ChIP-seq experiments, and find clinical outcome associated genes from tumor profiling data. Thus, we foresee NEST as generally applicable to many applications related with gene essentiality prioritization.
Materials and methods
The web application and source code of NEST are freely available under the GNU Public License v3 at http://nest.dfci.harvard.edu. The source code and testing data of NEST are additionally deposited at https://github.com/foreverdream2/NEST/releases.
For CRISPR screen data, we searched published studies with data publicly available and sgRNA coverage on genome scale for human cell lines until 1 June 2015. There are three studies fulfilling our criterion. In K562 cell, growth phenotype and toxin selection phenotype are screened with CRISPRi technology . In HL60 and A375 cell lines, growth phenotype is screened on genome scale with CRISPR technology [4, 5]. Significant gene hits are called from these datasets by MAGeCK 0.5 with default parameters and FDR threshold 0.05 . For gene essentiality prediction in each cell line, only negatively selected gene hits were considered as gold standard, because most significant gene hits are negatively selected in collected CRISPR experiments (Additional file 1: Figure S1). For gold standard control set, we extracted the same number of genes ranked by MAGeCK on bottom.
For K562, the gene expression profile was downloaded from the Roadmap project . For HL60, the gene expression profile by exon-array was downloaded from the ENCODE project  and converted to gene level values by JETTA . For A375, the gene expression profile was downloaded from the CCLE project . For each gene, we normalized the expression value by subtracting the mean across all samples in each cohort. Compared to absolute expression level, the normalized expression value can achieve a better CRISPR prediction performance of NEST (Additional file 1: Figure S12). The TCGA tumor gene expression data was downloaded from TCGA Data Portal on 27 July 2014. Only cohorts that are not embargoed are used. For each cancer cohort, the expression values of all normal control samples were averaged as background, and the difference of gene expression between tumor sample and normal background was analyzed. For NOTCH signaling pathway analysis, the NOTCH off condition is defined as gamma secretase inhibitors (GSI) treatment 3 days, and NOTCH on condition is defined as GSI wash 4 h . The differential expression value between on/off conditions was analyzed . The NOTCH1 ChIP-seq data are generated in our previous work, and the dynamic binding peaks between NOTCH on/off conditions were used .
For H3K27ac ChIP-seq profiles, we downloaded data from the Roadmap project . Among all cell lines with CRISPR data collected, K562 is the only one having H3K27ac profile available. Previously, we developed a BETA method to calculate the regulatory potential (RP) on gene promoters from the ChIP-seq profile of a transcription factor or histone modification [36, 37]. We used the implementation in RABIT package with default parameters to calculate the H3K27ac RP scores . For each gene, the RP scores were normalized, by subtracting the mean across all cell lines profiled.
Network randomization and permutation test
We used stub rewiring method to randomize unweighted STRING network, which preserves gene degree . The edges from each gene are first detached from its partners, and then randomly connected with each other. Since we do not allow self-interaction and duplicated edges, the connection process may fail to finish. In this case, we restart the rewiring process until 98 % edges are reconnected.
Based on random networks, we derived a permutation test to access whether the NEST score of each gene is significantly larger (or smaller) than expected. For each random network, we calculated the NEST values as random NEST. For each gene, we computed the Z-score as (real NEST – average random NEST)/(Stderr of random NEST). If the Z-score is positive, we computed the P value as the fraction of random NEST scores that are larger than or equal to the real NEST score. If the Z-score is negative, we computed the P values as the fraction of random NEST scores that are smaller than or equal to the real NEST score.
We used Cox-PH model to analyze the effect of gene expression or NEST scores on survival. For GBM, there are several factors that affect the survival and we included them as covariates in survival regression, including age, gender, G-CIMP status, and treatment status . So the final survival effect was corrected with the effects of these confounding factors. For TCGA pan-cancer analysis, we only included cancer types with more than 50 patients and 5 % death rate. In the Cox-PH regression, we only included age, gender, and stage (if available) to enable uniform comparison among different cancer types.
The authors would like to thank Michael Love, Eric Severson, Han Xu, Yiwei Chen, and Jing Mi for helpful discussions. The project was supported by the U01 CA180980 grant from NIH and the Claudia Adams Barr Award in Innovative Basic Cancer Research from the Dana-Farber Cancer Institute.
- 3.Broad-DREAM Gene Essentiality Prediction Challenge. Available at: https://www.synapse.org/#!Synapse:syn2384331/wiki.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.