An integrated database-pipeline system for studying single nucleotide polymorphisms and diseases
- 5k Downloads
Studies on the relationship between disease and genetic variations such as single nucleotide polymorphisms (SNPs) are important. Genetic variations can cause disease by influencing important biological regulation processes. Despite the needs for analyzing SNP and disease correlation, most existing databases provide information only on functional variants at specific locations on the genome, or deal with only a few genes associated with disease. There is no combined resource to widely support gene-, SNP-, and disease-related information, and to capture relationships among such data. Therefore, we developed an integrated database-pipeline system for studying SNPs and diseases.
To implement the pipeline system for the integrated database, we first unified complicated and redundant disease terms and gene names using the Unified Medical Language System (UMLS) for classification and noun modification, and the HUGO Gene Nomenclature Committee (HGNC) and NCBI gene databases. Next, we collected and integrated representative databases for three categories of information. For genes and proteins, we examined the NCBI mRNA, UniProt, UCSC Table Track and MitoDat databases. For genetic variants we used the dbSNP, JSNP, ALFRED, and HGVbase databases. For disease, we employed OMIM, GAD, and HGMD databases. The database-pipeline system provides a disease thesaurus, including genes and SNPs associated with disease. The search results for these categories are available on the web page http://diseasome.kobic.re.kr/, and a genome browser is also available to highlight findings, as well as to permit the convenient review of potentially deleterious SNPs among genes strongly associated with specific diseases and clinical phenotypes.
Our system is designed to capture the relationships between SNPs associated with disease and disease-causing genes. The integrated database-pipeline provides a list of candidate genes and SNP markers for evaluation in both epidemiological and molecular biological approaches to diseases-gene association studies. Furthermore, researchers then can decide semi-automatically the data set for association studies while considering the relationships between genetic variation and diseases. The database can also be economical for disease-association studies, as well as to facilitate an understanding of the processes which cause disease. Currently, the database contains 14,674 SNP records and 109,715 gene records associated with human diseases and it is updated at regular intervals.
KeywordsAmino Acid Change Unify Medical Language System Human Gene Mutation Database Disease Term HUGO Gene Nomenclature Committee
Many researchers have studied the relationships between disease and biological variations such as single nucleotide polymorphisms (SNPs), copy number variation, sequence repeats and genetic rearrangement [1, 2, 3]. Recently, work on genetic (SNP) variation associated with diseases has become intense, as many genetic variations are thought to affect the structure and function of proteins, as a result of amino acid substitutions [4, 5]. Significantly, SNPs, which report over 90% of genetic variation in the human genome , can have a major impact on how humans respond to disease, to drugs, and to other therapies. Therefore, SNP information is a great resource in biomedical studies, diagnostics, and drug development .
Many researchers studying disease associated SNPs require integrated information on SNPs and disease for two reasons. First, in order to capture relationships between SNPs and diseases, and then, to understand which genes cause disease and how that is impacted by SNPs. Second, disease-association researchers can save much time and effort in identifying the candidate disease-causing genes.
Despite the needs, existing servers contain insufficient information about SNP-disease associations. Because public databases for SNPs and diseases are large, complicated, and difficult to use, their integration is challenging. Therefore, we developed an integrated database-pipeline system for studying SNP and disease-association. We constructed a large database with comprehensive data on genes and SNPs associated with disease. In particular, the database-pipeline system allows biologists to retrieve integrated information on diseases, SNPs, and amino acid changes, along with functional annotation.
Methods and results
Automatic collection and update of public resources
The integrated database-pipeline system uses file transfer protocol, hypertext transfer protocol, and JAVA-based data-extracting modules. The system also has a support function to design the database schema and to create the modules based on a graphic user interface. The integration pipeline system checks the updated data and downloads such data automatically from 13 public and private resource servers, and then informs the system administrator by e-mail. We selected the following representative databases for the disease, SNP, and gene resources: The disease category is updated from the databases Unified Medical Language System (UMLS) , Online Mendelian Inheritance in Man (OMIM) , Gene Association Database (GAD) , and Human Gene Mutation Database (HGMD) . The gene and protein category is updated from the databases NCBI , HUGO Gene Nomenclature Committee (HGNC) , UniProt (, UCSC, and MitoDat (Mendelian Inheritance and the Mitochondrion) . The genetic variation category (SNPs) is updated from the databases dbSNP , JSNP , ALFRED (Allele Frequency Database) , and HGVbase (Human Genome Variation database) . The system is updated regularly, as the pipeline aquires data automatically.
Defining disease terms based on the UMLS
The disease terms commonly used in many research articles and several disease databases such as OMIM, GAD, and HGMD have the character of natural language; there are many synonyms and slightly different expressions which refer to the same concept. We required a unified controlled vocabulary of disease names and their synonyms, to construct a non-redundant disease database. We accomplished this by using UMLS (Release Archive 2007AA), which is a very large, multi-purpose vocabulary database containing information about biomedical and health-related concepts, the various terms used, and the relationships among them. Moreover, UMLS successfully integrates widely used clinical terms, in sub-bases such as "Systematized Nomenclature of Medicine – Clinical Terms and Medical Subject Headings," so UMLS was an excellent resource allowing us to relate our database terms to medical informatics.
In addition, disease terms are associated with various word formations (in particular, noun modification). To solve this text stemmer problem, we defined disease terms expressed in public disease databases using four steps. First, we removed stop words employing a stop word list provided by OMIM. Second, we removed suffixes such as ", -es and -s. Third, we removed typographical errors and special characters. Finally, we mapped these processed disease terms to unique clinical concepts by comparing the terms with their several synonyms and different expressions as provided by UMLS .
Defining genes according to HGNC and NCBI data
To permit the exploration of SNP effects on genetic variation, we adopted various gene annotations including gene Information from NCBI, RefSeq mRNA of UCSC Table Track, protein information from UniProt, and the mitochondrial biogenesis and function criteria of MitoDat (Mendelian Inheritance and the Mitochondrion). Because the mitochondrion has a central role in cellular metabolism, the mitochondrion is involved in many human diseases . We integrated gene and protein data into the SNP and diseases resources based on a gene-synonym table from HGNC and gene information at NCBI. Next, we mapped UniProt proteins onto NCBI genes by BLAST search. Finally, we added mitochondrial gene and protein data from MitoDat, because this database predominantly contains information on human nuclear-encoded mitochondrial proteins.
Integration of genes, SNPs, and diseases
To construct SNP-related information, we collected representative genetic variation (SNP), resources from dbSNP, JSNP, ALFRED, HGVbase, POLYPHEN (Polymorphism Phenotyping) , and SIFT (Sorting Intolerant From Tolerant) . Finally, we integrated the information to show the interrelationships among SNPs located in genes, genes associated with diseases, and SNPs associated with diseases. The HGVbase database was adapted to integrate a curated resource describing human DNA variation and phenotype relationships. To predict a possible SNP impact of amino acid substitution on protein structure and function, we also linked to PolyPhen and SIFT. ALFRED contains data on allele frequencies at particular SNP loci for diverse populations, with reference to SNPs in dbSNP and JSNP. Our system can update primary databases automatically in real time. However, to integrate the various databases, we need to note the variations and partially update manually.
Next, we analyzed the influence of SNP location (e.g., CDS, UTR, Intron, or Promoter) on gene structure, and the effects of synonymous or non-synonymous SNPs on genetic variation and genes associated with disease, employing BLAST . We explored amino acid changes caused by codon changes, and identified the locations of the altered amino acids in proteins. In addition, we determined whether SNPs were synonymous or non-synonymous, and identified the relationships of SNPs to candidate disease-causing genes.
To show a disease search, we present query results for diabetes. The results from the integrated disease- and genetic variation-related databases are more helpful to researchers than results from one database only. It provides more comprehensive information on the genes and SNP markers associated with the disease. For example, when using only one disease-related database for diabetes, researchers can obtain either disease-association study information from GAD or information on disease-related literatures from OMIM. Conversely, when using the integrated database-pipeline system, we obtain a list of genes associated with diabetes, and an SNP marker (rs1805097) associated with diabetes by making both the integrated disease and the genetic variation information available simultaneously. This integrated information allows researchers to consider the SNP effects on the gene along with relationships between SNPs and disease. The SNP marker (rs1805097) is located in the human insulin receptor substrate-2 (IRS-2) gene, which is a primary progesterone response gene. This SNP can affect amino acid change (GLY1057ASP), which has the possible impact of an amino acid substitution on the structure and function of a human protein . Because this also includes the genome locations of disease-associated genes, effects of non-synonymous SNPs at the protein level, and disease-causing risk scores, users can expect to have a better understanding of the molecular causes of the disease.
Conclusion and future direction
We constructed the integrated database for the study of genetic variation in disease, using an automatic integration pipeline system. Specifically, the database contains information on 124,389 disease, 12,445,925 SNP markers, and 38,597 genes, and includes 14,674 SNP records and 109,715 gene records associated with human diseases. A total of 1,319 SNPs cause amino acid changes, inevitably leading to severe disruptions of protein structure or function.
Consequently, the integrated database-pipeline system can be an indispensable resource. The system can economically facilitate disease-association studies by identifying candidate genes associated with disease, and genetic variation. It can aid the understanding of the genes which cause diseases and the impact of SNPs on diseases, by showing the relationships among genes, SNPs and diseases. The tool uses unified disease terms, which facilitates the outreach and extension of this database to various other medical sources. As the resources in this database-pipeline system are expanding continuously, we are planning to collect validated resources used in the detection of genetic variation for comparative studies.
We thank our colleagues at KOBIC, especially Areum Han, Jung-Sun Park, and Woo-Yeon Kim. The system was co-developed as a part of Diseasome pipeline by E-Gitec Inc. This work was supported by a grant from KRIBB Research Initiative Program, and the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) (No.M10869030002-08N6903-00210).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 12, 2008: Asia Pacific Bioinformatics Network (APBioNet) Seventh International Conference on Bioinformatics (InCoB2008). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S12.
- 2.Bae JS, Cheong HS, Kim JO, Lee SO, Kim EM, Lee HW, Kim S, Kim JW, Cui T, Inoue I, et al.: Identification of SNP markers for common CNV regions and association analysis of risk of subarachnoid aneurysmal hemorrhage in Japanese population. Biochem Biophys Res Commun 2008.Google Scholar
- 5.Han A, Kang HJ, Cho Y, Lee S, Kim YJ, Gong S: SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences. Nucleic Acids Res 2006, (34 Web Server):W642–644.Google Scholar
- 8.Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, (32 Database):D267–270.Google Scholar
- 9.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, (33 Database):D514–517.Google Scholar
- 12.Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, (36 Database):D13–21.Google Scholar
- 13.Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res 2006, (34 Database):D319–321.Google Scholar
- 15.Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, et al.: The UCSC genome browser database: update 2007. Nucleic Acids Res 2007, (35 Database):D668–673.Google Scholar
- 20.Fredman D, Munns G, Rios D, Sjoholm F, Siegfried M, Lenhard B, Lehvaslaiho H, Brookes AJ: HGVbase: a curated resource describing human DNA variation and phenotype relationships. Nucleic Acids Res 2004, (32 Database):D516–519.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.