IIIDB: a database for isoform-isoform interactions and isoform network modules
Protein-protein interactions (PPIs) are key to understanding diverse cellular processes and disease mechanisms. However, current PPI databases only provide low-resolution knowledge of PPIs, in the sense that "proteins" of currently known PPIs generally refer to "genes." It is known that alternative splicing often impacts PPI by either directly affecting protein interacting domains, or by indirectly impacting other domains, which, in turn, impacts the PPI binding. Thus, proteins translated from different isoforms of the same gene can have different interaction partners.
Due to the limitations of current experimental capacities, little data is available for PPIs at the resolution of isoforms, although such high-resolution data is crucial to map pathways and to understand protein functions. In fact, alternative splicing can often change the internal structure of a pathway by rearranging specific PPIs. To fill the gap, we systematically predicted genome-wide isoform-isoform interactions (IIIs) using RNA-seq datasets, domain-domain interaction and PPIs. Furthermore, we constructed an III database (IIIDB) that is a resource for studying PPIs at isoform resolution. To discover functional modules in the III network, we performed III network clustering, and then obtained 1025 isoform modules. To evaluate the module functionality, we performed the GO/pathway enrichment analysis for each isoform module.
The IIIDB provides predictions of human protein-protein interactions at the high resolution of transcript isoforms that can facilitate detailed understanding of protein functions and biological pathways. The web interface allows users to search for IIIs or III network modules. The IIIDB is freely available at http://syslab.nchu.edu.tw/IIIDB.
KeywordsHuman PPIs Logit Score Gold Standard Positive IntAct Database Gold Standard Negative
Protein-protein interactions (PPIs) perform and regulate fundamental cellular processes. As a consequence, identifying interacting partners for a protein is essential to understand its functions. In recent years, remarkable progress has been made in the annotation of all functional interactions among proteins in the cell. However, in both experimentally derived and computationally predicted protein-protein interactions, a "protein" generally refers to "all isoforms of the respective gene." Yet, it is known that alternative splicing often impacts PPI by either directly affecting protein interacting domains, or by indirectly impacting other domains, which, in turn, impact the PPI binding . That is, alternative splicing can modulate the PPIs by altering the protein structures and the domain compositions, leading to the gain or loss of specific molecular interactions that could be key links of pathways (reviewed in reference ). It is very likely that different isoforms of the same protein interact with different proteins, thus exerting different functional roles. For example, the protein BCL2L1 is alternatively spliced into two isoforms: Bcl-xL (long form) and Bcl-xS (short form) , in which Bcl-xL inhibits apoptosis whereas Bcl-xS promotes apoptosis . Vogler et al. reported that the interaction of Bcl-xL and BAK1 in platelets ensures platelet survival . Therefore, comprehensively identifying protein-protein interactions at the isoform level is important to systematically dissect cellular roles of proteins, to elucidate the exact composition of protein complexes, and to gain insights into metabolic pathways and a wide range of direct and indirect regulatory interactions.
Thus far, a series of studies have systematically predicted PPIs [6, 7, 8, 9, 10] and established PPI databases, e.g., OPHID , POINT , STRING  and PIPs . With the exception that the IntAct database  contains 116 human PPIs with isoform specification, currently, none of those PPI databases has isoform-level PPI data. This is a huge knowledge gap yet to be filled. The rapid accumulation of RNA-seq data provides unprecedented opportunities to study the structures and topological dynamics of PPI networks at the isoform resolution. RNA-seq data provides two unique informative sources for Isoform-Isoform Interaction (III) reconstruction: the absence or presence of specific isoforms under specific conditions, and the co-expression of two isoforms that may contribute to their interaction propensity. In this study, we seize this opportunity to comprehensively predict the possible interactions between splicing isoforms by integrating a series of RNA-seq data with domain-domain interaction data and PPI database. The resulting III network presents a high-resolution map of PPIs, which could be invaluable in studying biological processes and understanding cellular functions.
The IIIDB allows users to easily search IIIs and isoform modules, and then provides the evidence that led to each III prediction. To visualize the interactions with the isoforms of the input gene, we integrated CytoscapeWeb  to generate the interactive web-based III network (Figure 1B). Interestingly, the different isoforms within the same gene can be involved with different isoform modules that may open a new door to study differential functionality of isoforms of the gene. The IIIDB also provided GO/pathway enrichment analysis results for each isoform module, which helps biologists to study the biological insights of network modules at isoform level.
19 RNA-seq datasets from SRA
Illumina bodyMap2 transcriptome
Widespread splicing changes in human brain development and aging
Gene expression profile in postmortem hippocampus using RNAseq for addicted human samples
Integrative genome-wide analysis reveals cooperative regulation of alternative splicing by hnRNP proteins
Comparative transcriptomic analysis of prostate cancer and matched normal tissue using RNA-seq
Complete transcriptomic landscape of prostate cancer in the Chinese population using RNA-seq
A Comparison of Single Molecule and Amplification Based Sequencing of Cancer Transcriptomes: RNA-Seq Comparison
GSE20301: Dynamic transcriptomes during neural differentiation of human embryonic stem cells
The effect of estrogen and progesterone and their antagonists in Ishikawa cell line compared to MCF7 and T47D cells
Alternative Isoform Regulation in Human Tissue Transcriptomes
GSE30017: Widespread regulated alternative splicing of single codons accelerates proteome evolution
GSE34914: Deep Sequence Analysis of non-small cell lung cancer: Integrated analysis of gene expression, alternative splicing, and single nucleotide variations in lung adenocarcinomas with and without oncogenic KRAS mutations
Transciptome profiling of ovarian cancer cell lines
RNA-Seq Quantification of the Complete Transcriptome of Genes Expressed in the Small Airway Epithelium of Nonsmokers and Smokers
GSE29155: RNA-Seq anlalysis of prostate cancer cell lines using Next Generation Sequencing
GSE38006: Next-generation sequencing reveals HIV-1-mediated suppression of T cell activation and RNA processing and the regulation of non-coding RNA expression in a CD4+ T cell line
Gene expression profiles between normal and breast tumor genomes
RNA and chromatin structure
GSE35296: The human pancreatic islet transcriptome: impact of pro-inflammatory cytokines
High-confidence and low-confidence prediction of IIIs
Isoform module discovery
To discover functional modules in the III network, we applied MODES network clustering method  on low-confidence III network to discover isoform modules with the given parameters (minimum module size 3, maximum module size 30, and density cutoff 0.7). An important feature of MODES is that it can discover overlapping dense isoform modules which allows one isoform to belong to multiple modules. We obtained 1025 modules with size of 5.08 isoforms on average. To provide functional annotation and evaluate the module functional enrichment, we performed the enrichment analyses with GO  and KEGG pathway  databases. These databases are protein-level annotation which provides approximately functional annotation of isoforms.
The enrichment rate of isoform modules based on GO and pathway enrichments
% modules enriched with GOa
% modules enriched with pathwayb
Integrating diverse data sources for III prediction
Currently, most PPI databases do not provide information at the level of isoforms, which thus presents challenges for constructing a gold standard positive set (GSP) for the III prediction. Fortunately, in the June 2013 version of the IntAct database http://www.ebi.ac.uk/intact/, we identified 116 human PPIs with isoform specification (out of the total 43,508 distinct human PPIs). For example, IntAct has III between P29590-5 and P03243-1, which correspond to the 5th isoform of the protein P29590 and the 1st isoform of P03243. In addition, to obtain more IIIs for the GSP, we applied the following rule: given a PPI between protein P1 and P2, if both P1 and P2 only have single isoforms we also take it as the GSP. It resulted in 11,356 IIIs in the GSP set.
GSP covered 5,503 RefSeq IDs, and we used these RefSeq IDs to construct gold standard negative set (GSN). The GSN was defined as isoform pairs in which one isoform was assigned to the plasma membrane cellular component, and the other was assigned to the nuclear cellular component by the isoform-specific sub-cellular localization, in which we performed sequence-based predictions using the CELLO (subCELlular LOcalization predictor) . To obtain the accurate isoform-specific annotation, we only used the cellular localization prediction results that consist of UniProt GO annotations. It resulted in 36 RefSeq IDs for plasma membrane cellular component and 739 RefSeq IDs for nuclear cellular component. In addition, isoforms that are assigned to both the plasma membrane and the nuclear cellular component are excluded in GSN.
To calculate precision and recall, we used timestamp to divided GSP into training and test GSP sets, in which if an interaction with timestamp after 1st Jan 2012, it will be assigned to test GSP set (10,408 IIIs); otherwise, it will be assign to training GSP set (948 IIIs). When the GSP is decided, we used the RefSeq IDs covered in GSP to build GSN. Figure 3 shows the precision and recall curve for the logistic regression model.
To demonstrate the biological importance of the IIIDB, we searched for isoform-associated reports in the literature. Although isoform-specific protein function studies are very rare, we found isoform-specific biological evidences with BCL2L1, which validated our III prediction of BCL2L1. In addition, we also found diverse biological functions with Ras association domain family in our isoform modules.
BCL2L1 isoform interaction partners.
Interaction partner (high confidence)
BCL2L1 isoform 1 (Bcl-xL)
BAK1 BAX NLRP1
BCL2L1 isoform 2 (Bcl-xS)
Isoform coexpression network construction
To construct the Isoform coexpression networks, we selected 19 RNA-seq datasets with at least 10 experiments from Sequence Read Archive (SRA) database (Table 1) . We performed the eXpress  with Bowtie2 aligner  to obtain isoform expression values. NCBI Reference Sequences (RefSeq) mRNAs with protein sequences was used as transcriptome annotation, which included 31,454 RefSeq IDs (Jan 2013 version) .
Logistic regression model
where α0,α1,...,α20 are regression coefficients and y ij is the probability of the isoform interaction between isoform i and isoform j. DDI, E1, E2, ..., E19 are described as follows: (a) Domain-domain interaction (DDI) score: For all combinations of human isoform pairs, if a isoform pair has domain-domain interaction in the DOMINE database [32, 33], then we assign the DDI score by the confidence level in the DOMINE database as follows: high-confidence prediction: 3, medium-confidence prediction: 2, and low-confidence prediction: 1. If the isoform pair has several DDI scores, we take the highest score. (b) 19 RNA-seq datasets (E1, E2, ..., E19): the absolute values of Pearson correlations for all isoform pairs derived from RNA-seq datasets. Since each RNA-seq dataset may have different quality and data type, we used the logistic regression model to integrate 19 RNA-seq datasets, and then the coefficients of datasets reflected quality of the RNA-seq datasets.
The National Institutes of Health (NIH) [NHLBI MAPGEN U01HL108634 and NIGMS R01GM105431] and the National Science Foundation [0747475 to X.J.Z.]; the Taiwan Ministry of Science and Technology grant [MOST 103-2320-B-005-004] and the Ministry of Education, Taiwan, R.O.C. under the ATU plan (to C.C.L.).
This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S2
- 5.Vogler M, Hamali HA, Sun XM, Bampton ET, Dinsdale D, Snowden RT, Dyer MJ, Goodall AH, Cohen GM: BCL2/BCL-XL inhibition induces apoptosis, disrupts cellular calcium homeostasis, and prevents platelet activation. Blood. 2011, 117 (26): 7145-7154. 10.1182/blood-2011-03-344812.CrossRefPubMedGoogle Scholar
- 7.McDowall MD, Scott MS, Barton GJ: PIPs: human protein-protein interaction prediction database. Nucleic Acids Res. 2009, D651-656. 37 DatabaseGoogle Scholar
- 10.Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, et al: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, D561-568. 39 DatabaseGoogle Scholar
- 13.Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al: The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010, D525-531. 38 DatabaseGoogle Scholar
- 15.Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, D130-135. 40 DatabaseGoogle Scholar
- 23.Park J, Kang SI, Lee SY, Zhang XF, Kim MS, Beers LF, Lim DS, Avruch J, Kim HS, Lee SB: Tumor suppressor ras association domain family 5 (RASSF5/NORE1) mediates death receptor ligand-induced apoptosis. J Biol Chem. 2010, 285 (45): 35029-35038. 10.1074/jbc.M110.165506.PubMedCentralCrossRefPubMedGoogle Scholar
- 27.Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database C: The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, D54-56. 40 DatabaseGoogle Scholar
- 32.Raghavachari B, Tasneem A, Przytycka TM, Jothi R: DOMINE: a database of protein domain interactions. Nucleic Acids Res. 2008, D656-661. 36 DatabaseGoogle Scholar
- 33.Yellaboina S, Tasneem A, Zaykin DV, Raghavachari B, Jothi R: DOMINE: a comprehensive collection of known and predicted domain-domain interactions. Nucleic Acids Res. 2011, D730-735. 39 DatabaseGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.