Network-based analysis reveals distinct association patterns in a semantic MEDLINE-based drug-disease-gene network
- 2.9k Downloads
A huge amount of associations among different biological entities (e.g., disease, drug, and gene) are scattered in millions of biomedical articles. Systematic analysis of such heterogeneous data can infer novel associations among different biological entities in the context of personalized medicine and translational research. Recently, network-based computational approaches have gained popularity in investigating such heterogeneous data, proposing novel therapeutic targets and deciphering disease mechanisms. However, little effort has been devoted to investigating associations among drugs, diseases, and genes in an integrative manner.
We propose a novel network-based computational framework to identify statistically over-expressed subnetwork patterns, called network motifs, in an integrated disease-drug-gene network extracted from Semantic MEDLINE. The framework consists of two steps. The first step is to construct an association network by extracting pair-wise associations between diseases, drugs and genes in Semantic MEDLINE using a domain pattern driven strategy. A Resource Description Framework (RDF)-linked data approach is used to re-organize the data to increase the flexibility of data integration, the interoperability within domain ontologies, and the efficiency of data storage. Unique associations among drugs, diseases, and genes are extracted for downstream network-based analysis. The second step is to apply a network-based approach to mine the local network structure of this heterogeneous network. Significant network motifs are then identified as the backbone of the network. A simplified network based on those significant motifs is then constructed to facilitate discovery. We implemented our computational framework and identified five network motifs, each of which corresponds to specific biological meanings. Three case studies demonstrate that novel associations are derived from the network topology analysis of reconstructed networks of significant network motifs, further validated by expert knowledge and functional enrichment analyses.
We have developed a novel network-based computational approach to investigate the heterogeneous drug-gene-disease network extracted from Semantic MEDLINE. We demonstrate the power of this approach by prioritizing candidate disease genes, inferring potential disease relationships, and proposing novel drug targets, within the context of the entire knowledge. The results indicate that such approach will facilitate the formulization of novel research hypotheses, which is critical for translational medicine research and personalized medicine.
KeywordsResource Description Framework Heterogeneous Network Association Network Network Motif Unify Medical Language System
A large amount of associations among biomedical entities are scattered in biomedical literature. Systematic analysis of such heterogeneous data provides biomedical scientists with unprecedented opportunities to infer novel associations among different biological entities in the context of personalized medicine and translational research studies. MEDLINE (http://www.nlm.nih.gov/bsd/pmresources.html), for instance, currently contains more than 22 million citations of biomedical literature. Semantic MEDLINE is a knowledge base consisting of associations automatically extracted from MEDLINE by integrating document retrieval, advanced natural language processing (NLP), and automatic summarization and visualization . However, it is computationally challenging to perform queries directly from Semantic MEDLINE where associations among different biomedical entities are very complex yet sparse. It is also very difficult to investigate those associations at a large scale. Advance informatics approaches have the potential to fill gaps between knowledge needs of translational researchers and existing knowledge discovery services.
In Semantic MEDLINE, biomedical entities and associations are semantically annotated using concepts in the Unified Medical Language System (UMLS) . The semantic information defined in the UMLS can be further leveraged to extract associations among concepts in specific domains and identify domain patterns for specific studies through advanced computational methods such as network-based analysis.
In the last decade, network-based computational approaches have gained popularity and become a new paradigm to investigate associations among drugs, diseases, and genes. Applications of these approaches include disease gene prioritization [3, 4, 5], identification of disease relationships [6, 7] and drug repositioning [8, 9]. However, majority of these approaches focus on relationships between only two kinds of entities (e.g., association between gene and disease). For instance, Hu and Agarwar  created a human disease-drug network based on genomic expression profiles collected from the Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/). In total, 170,027 interactions between diseases and drugs were considered significant, including 645 disease-disease, 5,008 disease-drug, and 164,374 drug-drug associations. These expression-based associations among diseases and drugs could serve as a backend knowledge base to facilitate discovery. Bauer-Mehren et al.  developed a comprehensive disease-gene association network by integrating associations from several sources that cover different biomedical aspects of diseases. The results indicate a highly shared genetic origin of human diseases. Functional modules were also detected in several Mendelian disorders as well as in common diseases. To systematically analyze drug-disease-gene relationships, Daminelli et al.  proposed a network-based approach to predict novel drug-gene and drug-disease associations by completing incomplete bi-cliques in the network. This approach holds great potential for drug repositioning and discovery of novel associations. However, the analysis was limited to only certain associations among drugs, genes, and diseases (e.g., drug-disease and drug-gene associations). A network-based investigation of all pair-wise associations among these entities is necessary to understand the complexity of existing associations and to infer novel associations within the context of the whole knowledgebase.
Network-based computational approaches enable us to analyze heterogeneous networks such as drug-disease-gene networks by decomposing them into small subnetworks, called network motifs (NMs) . NMs are statistically significant recurring structural patterns found more often in real networks than would be expected in random networks with the same network topologies. They are the smallest basic functional and evolutionarily conserved units in biological networks. Our hypothesis is that NMs of a network are the significant sub-patterns that represent the backbone of the network, which serves as the focused portion out of thousands of nodes (e.g., drugs, diseases, and genes,) [14, 15]. These NMs could also form large aggregated modules that perform specific functions by forming associations in overlapping NMs.
In this paper, we propose a network-based computational framework to analyze the complex network formed by a large amount of associations. We focus on a heterogeneous drug-disease-gene network derived from Semantic MEDLINE and investigated underlying associations using network-based systems biology approaches. Three case studies demonstrate that our approach has potential to facilitate formulization of novel research hypotheses, which is critical for translational medicine research. In the following, we first present Materials and methods. We then describe the results and case studies in detail.
Materials and methods
Data sources and preprocessing
Extraction of association data from Semantic MEDLINE
Semantic MEDLINE currently contains more than 56 million associations extracted from MEDLINE citations and consists of eight tables, including concepts, concept semantic types, concept translations, predication, predication arguments, and sentences. Data from different tables need to be joined in order to obtain information for a particular association between two entities. The database contains an all-embracing joined table that provides information about associations (source concept, predicate, and object concept), and their source PubMed IDs (PMIDs).
We optimize and reorganize the relevant data in Semantic MEDLINE into the Resource Description Framework (RDF) format. Based on the UMLS semantic types and groups , we extract unique associations among drugs, diseases, and genes, and represent them in six views in relational database tables. We then use the Web RDF transformation tool D2R server to convert the six views into RDF triples through a D2RQ mapping file (http://d2rq.org/d2r-server). This mapping file specifies the mappings between those six relational database table schemas and the output RDF graphs . A detailed description of this approach is described in our previous work . These six tables are used as preliminary association data resources including all unique associations from Semantic MEDLINE.
Data preprocessing using FDA-approved drugs in DrugBank
Since the extraction accuracy of associations in Semantic MEDLINE is about 77% (precision is 76% to 96%, and recall is 55-70%) , a filtering strategy is applied to extract high-confidence association data using the FDA-approved drug list from DrugBank, a database containing drug information and the corresponding drug target and treatment indication information . As of July 31 2012, the database contains 1,578 FDA-approved drug entries, including 131 FDA-approved biotech drugs, and 1,447 FDA-approved small molecule drugs. We extract associations involving these FDA-approved drugs from each drug-related association table. After manually removing generic and nonsensical terms in the association tables (e.g., gene, homologous gene, and protein), we limit the drug-drug, drug-gene, and drug-disease associations to those involved in the 1,578 FDA-approved drugs. Based on the filtered drug-gene and drug-disease associations, we generate related gene and disease lists and then obtained gene-gene, disease-disease, and gene-disease associations using these genes and diseases. This filtering strategy enables us to focus on associations related to FDA-approved drugs only in this study. These associations are then analyzed by the proposed network-based approach.
Network motif analysis
Network motifs are topologically distinct subnetwork patterns that are present more frequently in true networks than in random networks . They are usually well conserved and possess specific processing tasks in same types of networks. For example, in gene regulatory networks, the same set of network motifs have been repeatly identified in diverse organisms from bacteria to human . The hypothesis is that network motifs were independently selected by evolutionary processes in a converging manner and have characteristic dynamical functions . This suggests that network motifs serve as building blocks of in gene regulatory networks that are beneficial to the organism.
where N real is the number of times one three-node subnetwork is detected in the real network, N rand is the mean number of times this subnetwork is detected in 1000 randomized networks, and σ rand is the standard deviation of the number of times this subnetwork is detected in randomized networks. The p value of a motif is the number of random networks in which it occurs more often than in the original networks, divided by the total number of random networks. A pattern with p ≤ 0.05 is considered statistically significant. This network motif discovery procedure is performed using the FANMOD tool .
Construction of the core drug-disease-gene network
It has been shown that in gene regulatory networks, for each network motif, the majority of matches overlap and aggregate into homologous motif clusters . Many of these motif clusters largely overlap with modules of known biological processes . The clusters of overlapping matches of these motifs aggregate into a superstructure that presents the backbone of the network and is assumed to play a central role in defining the global topological organization. Accordingly, we aggregate matches of significant network motifs into a core drug-disease-gene network. In this core network, we investigate the distribution of the connectivity degree of different types of nodes. Nodes with significantly larger number of links in the network are called hub nodes, which is critical in the information flow exchange throughout the entire network.
An integrated drug-disease-gene network reconstructed from Semantic MEDLINE
We constructed a drug-disease-gene network with the following two steps:
Statistics of the six extracted association types
Record in Semantic MEDLINE
Associations involving FDA-approved drugs
Unique entity number
Second, we constructed association related data involving FDA-approved drugs. We applied the filtering strategy discribed in the Materials and methods section to extract association data involving FDA-approved drugs from the unique association data set. As shown in the “Associations Involving FDA-approved Drugs” column in Table 1, the association number of each table was further reduced. We used this focused association data to construct an integrated disease-drug-gene network for downstream network-based analysis.
Network topology analysis of the core drug-disease-gene network
Local network structure: from network to network motif
The five significant network motif patterns in Figure 2 have strong biological meanings and could suggest scientists future directions in their research field. We provided three case studies in the following sections to illustrate results based on three significant network motifs.
Case study 1 - prioritization of disease genes
Case study 2 - inference of disease relationships
Enriched disease and disorder categories in IPA analysis
Congenital Heart Anomaly
Liver Necrosis/Cell Death
Cardiac Necrosis/Cell Death
Renal Necrosis/Cell Death
Increased Levels of AST
Increased Levels of Albumin
Case study 3 – Drug repositioning
Three-gene network motif (NM 3) was also identified in this heterogeneous network. This NM is a very common motif pattern in the protein-protein interaction network or gene regulatory network [37, 38], indicating that NM detection analysis of heterogeneous networks can identify significant NMs even enriched in a single type of associations in a heterogeneous association network.
Comparisons of network motifs from different networks
Since all five network motifs identified involve only two out of three node types, we further investigated whether the networks involving only two node types can generate the same NMs. To accomplish that, we performed NM analysis on disease-gene, disease-drug and gene networks respectively. Not all NMs detected in the complete network can be detected in disease-gene, disease-drug and gene networks respectively (Additional file 4: File S4). The results indicate that although the NMs don’t contain all three different node types due to small NM size, the additional associations still introduce additional information in the NM detection analysis.
Literature mining approaches have been successful to extract associations among biological entities in the last decade. However, such information is usually large, complex and multidimentional, making it impossible for biomedical researchers to directly investigate such data. To leverage the gap between knowledge needs of translational researchers and existing knowledge discovery services, we have proposed a network-based informatics approach to investigate the underlying relationships among different biological entities based on associations automatically extracted from literature. The proposed approach has advantages in several aspects.
Our approach is one of the first attempts to investigate the disease-drug-gene associations in an integrative manner. To demonstrate the superiority of NM analysis on the heterogeneous network, we performed NM analysis on disease-gene, disease-drug and gene networks respectively and compared results with the ones derived from the complete disease-drug-gene network. Not all network motifs detected in the complete network can be detected in disease-gene, disease-drug and gene networks respectively. The results indicates that although NMs doesn’t contain all three different node types due to their small size in this study, the additional associations still introduce additional information in the analysis. In addition, NM analysis of such heterogeneous networks can extract and highlight the hotspots in the network, leading experts in different fields to generate testable hypotheses in their future research.We are aware that there are many other network analysis approaches for both social networks and biological networks. These approaches are designed for different purposes. For instance, biological networks can be interrogated by their overall properties (e.g., average clustering coefficient and overall distributions of node degrees), significant NMs, or clustered subnetworks/modules. In this work, we focus on identifying statistically significant three-node NM patterns that can help infer novel disease-drug-gene relationships. The NM analysis can decompose the whole heterogeneous network into smallest network patterns that recurrently discovered in the network, considered as the backbone associations of diseases, drugs, and genes. For instance, in NM 1 instances in Figure 2, most of these NMs contain the first two same diseases, while the third gene is different. By extracting all the associations involving these two diseases from the original association network, we found that while these two diseases share a significant number of associated genes, they also have some unique associations with other genes respectively. Based on the assumption that similar diseases are more likely to associate with same group (s) of genes or involve same biological processes, the genes associated only with one disease can be prioritized as candidate disease genes of the second disease. Such inference could only be possible through NM level analysis by considering significant network patterns (i.e., NMs) as well as their neighborhood in the whole network. In addition, since these NMs are statistically significant subnetworks, they represent the “real” signal from the network which usually contains considerable amount of false positive associations, especially those from literature mining techniques. Due to the limitation of computational resource, we didn’t include the NMs with more than three nodes. We plan to extend our work to NMs with more nodes (i.e., >3) when the computational resource become available. We believe that the proposed network-based approach can complement other existing network analysis methods and provide researchers a unique way to look at these huge heterogeneous networks.
From our preliminary study , we found that Semantic MEDLINE lacks of gene-gene associations since such information usually are illustrated in the main text of literature. Semantic MEDLINE contains gene-gene interaction data from PubMed literature abstracts (Figure 2). We included all the associations in Figure 2 in our analysis. However, the number of gene-gene association in Semantic MEDLINE (2,169 high-confidence pairs) is relevantly small comparing to other public databases (e.g., HPRD ). For instance, we compared the gene-gene associations in Semantic MEDLINE with those in HPRD, a manually curated gene-gene association database in human . The overlap between these two databases is very small (about 10% associations of Semantic MEDLINE can be found in HPRD). HPRD contains many more associations than Semantic MEDLINE (41,327 versus 2,169). Therefore, we believe that combining Semantic MEDLINE with other public resources (such as HPRD  and STRING ) will increase the coverage of associations and build a more comprehensive association database. Using linked data approach, it will be relatively easier to link our data graph with such databases.
Conclusions and future work
In this paper, we proposed a network-based computational framework to investigate integrated heterogeneous network extracted from MEDLINE literature, including associations among three major entity categories: drug, gene, and disease. Five significant NMs were identified and considered as the backbone of the entire network. The potential biological meanings of each network motif were further investigated. The results demonstrated that the proposed approach holds the potential to 1) prioritize candidate disease genes, 2) identify potential disease relationships, and 3) propose novel drug targets, within the context of the entire knowledge. We believe that such analyses can facilitate the process of inferring novel relationships between drugs, genes, and diseases. One future direction is to develop module-based approaches to understand associations between different biomedical entities. Modules are condensed subnetworks in a network. Modules identified in heterogeneous networks are a group of related diseases, drugs and genes, which gives researchers a focused network view of the association relationships among these entities. Topology analysis of heterogeneous networks using graphic theory can also be applied in future studies, which can lead to the identification of diseases/drugs/genes in the context of association networks. Pathway level information could also be integrated in future analyses to extend current association network.
This project was supported by the National Institute Health grant P30 CA 134274-04 to the University of Maryland Baltimore, the National Science Foundation award 0937060 and the National Center for Biomedical Ontologies (NCBO) to C.T., and the National Institute of Health grant R01LM009959 and National Science Foundation award 0845523 to H.L.
- 1.Rindflesch TC, Kilicoglu H, Fiszman M, Rosemblat G, Shin D, Kilicoglu H, Fiszman M, Rosemblat G, Shin D: Semantic MEDLINE: an advanced information management application for biomedicine. Information Services & Use. 2011, 31 (1/2): 15-21.Google Scholar
- 2.Unified Medical Language System® (UMLS). Available from: http://www.nlm.nih.gov/research/umls/
- 7.Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ: Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput Biol. 2010, 6 (2): e1000662-10.1371/journal.pcbi.1000662.CrossRefGoogle Scholar
- 11.Bauer-Mehren A, Bundschus M, Rautschka M, Mayer MA, Sanz F, Furlong LI, Bundschus M, Rautschka M, Mayer MA, Sanz F, Furlong LI: Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLoS One. 2011, 6 (6): e20284-10.1371/journal.pone.0020284.CrossRefGoogle Scholar
- 17.RDF graph definition. cited 2012; Available from: http://www.w3.org/TR/rdf-mt/#graphdefs
- 18.Tao C, Zhang Y, Jiang G, Bouamrane M-M, Chute CG, Zhang Y, Jiang G, Bouamrane M-M, Chute CG: Optimizing semantic MEDLINE for translational science studies using semantic web technologies, in Proceedings of the 2nd international workshop on Managing interoperability and compleXity in health systems. 2012, Maui, Hawaii, USA: ACM, 53-58.Google Scholar
- 20.Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS: DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011, 39 (Database issue): D1035-D1041.CrossRefGoogle Scholar
- 22.Alon U: Network motifs: theory and experimental approaches. Nature reviews. Genetics. 2007, 8 (6): 450-461.Google Scholar
- 24.Yeger-Lotem E, Sattath S, Kashtan N, Itzkovitz S, Milo R, Pinter RY, Alon U, Margalit H, Sattath S, Kashtan N, Itzkovitz S, Milo R, Pinter RY, Alon U, Margalit H: Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. Proc Natl Acad Sci U S A. 2004, 101 (16): 5934-5939. 10.1073/pnas.0306752101.CrossRefGoogle Scholar
- 27.Barabasi AL, Oltvai ZN: Network biology: understanding the cell’s functional organization. Nature reviews. Genetics. 2004, 5 (2): 101-113.Google Scholar
- 29.Kilicoglu H, Beg QK, Barabasi AL, Oltvai ZN, Beg QK, Barabasi AL, Oltvai ZN: Semantic MEDLINE: a Web application to manage the results of PubMed searches. 2008, Biomedicine: Proceeings of the Third International Symposium for Semantic Mining in, 69-76.Google Scholar
- 30.Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122 (6): 957-968. 10.1016/j.cell.2005.08.029.CrossRefGoogle Scholar
- 31.Igor U, Krishnamurthy A, Karp RM, Shamir R, Krishnamurthy A, Karp RM, Shamir R: DEGAS: De Novo Discovery of Dysregulated Pathways in Human Diseases. PLoS One. 2010, 5 (10):Google Scholar
- 33.Kuypers DR: Skin problems in chronic kidney disease. Nature clinical practice. Nephrology. 2009, 5 (3): 157-170.Google Scholar
- 36.Cardinale D, Colombo A, Sandri MT, Lamantia G, Colombo N, Civelli M, Martinelli G, Veglia F, Fiorentini C, Cipolla CM, Colombo A, Sandri MT, Lamantia G, Colombo N, Civelli M, Martinelli G, Veglia F, Fiorentini C, Cipolla CM: Prevention of high-dose chemotherapy-induced cardiotoxicity in high-risk patients by angiotensin-converting enzyme inhibition. Circulation. 2006, 114 (23): 2474-2481. 10.1161/CIRCULATIONAHA.106.635144.CrossRefGoogle Scholar
- 37.Zhang Y, Xuan J, Reyes BG d l, Clarke R, Ressom HW, Xuan J, Reyes BG d l, Clarke R, Ressom HW: Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data. BMC Bioinformatics. 2008, 9: 203-10.1186/1471-2105-9-203.CrossRefGoogle Scholar
- 38.Zhang Y, Xuan J, de Los Reyes BG, Clarke R, Ressom HW: Network motif-based identification of breast cancer susceptibility genes. in Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2008, 5696-5699.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.