Advertisement

Data Integration and Pattern-Finding in Biological Sequence with TESS’s Annotation Grammar and Extraction Language (AnGEL)

  • Jonathan Schug
  • Max Mintz
  • Christian J. StoeckertJr.
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4544)

Abstract

Decoding the functional elements in an organism’s genome requires the integration of a wide variety of experimental and computational data from a wide range of sources. The location of this data, viewed as sequence features in the genome, must serve as one of the essential organizing principles for this integration. It is therefore important to have a data integration system that takes advantage of this fact. As part of the TESS project, we have developed a grammar-based data integration and pattern search tool, Annotation Grammar and Extraction Language (AnGEL), that follows this principle. AnGEL can represent most of the current work in cis-regulatory module (CRM) modelling in an intuitive way and can process data extracted from a variety of sources simultaneously. Here we describe AnGEL’s capabilities and illustrate its use by querying for gene arrangements, CRMs, and protein domain structure.

Keywords

Hide Markov Model Path Expression Grammar Formalism Collection Production Positional Weight Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Schug, J.: Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence. In: Baxevanis, A.D. (ed.) Current Protocols in Bioinformatics, J. Wiley and Sons, New York (2003)Google Scholar
  2. Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., Kent, W.J.: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32 Database issue, D493–496 (2004)CrossRefGoogle Scholar
  3. Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2), 512–531 (2001)Google Scholar
  4. Buneman, P., Naqvi, S., Tannen, V., Wong, L.S.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1), 3–48 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  5. Searls, D.B.: The Linguistics of DNA. American Scientist 80(6), 579–591 (1992)Google Scholar
  6. Dong, S., Searls, D.B.: Gene Structure Prediction by Linguistic Methods. Genomics 23(3), 540–551 (1994)CrossRefGoogle Scholar
  7. Searls, D.B.: String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA. Journal of Logic Programming 73–102 (1995)Google Scholar
  8. Searls, D.B.: Languages, automata, and macromolecules. Biophysical Journal 76(1), A272–A272 (1999)Google Scholar
  9. Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E.: Meta-MEME: motif-based hidden Markov models of protein families. Comput Appl Biosci. 13(4), 397–406 (1997)Google Scholar
  10. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)CrossRefGoogle Scholar
  11. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26(1), 320–322 (1998)CrossRefGoogle Scholar
  12. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)CrossRefGoogle Scholar
  13. Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D.: Improved splice site detection in Genie. J. Comput. Biol. 4(3), 311–323 (1997)Google Scholar
  14. Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a Hidden Markov Model. J. Comput. Biol. 4(2), 127–141 (1997)Google Scholar
  15. Yada, T., Nakao, M., Totoki, Y., Nakai, K.: Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics 15(12), 987–993 (1999)CrossRefGoogle Scholar
  16. Pedersen, A.G., Baldi, P., Brunak, S., Chauvin, Y.: Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. In: Proc. Int Conf. Intell. Syst. Mol. Biol. vol. 4, pp. 182–191 (1996)Google Scholar
  17. Chen, Q.K., Hertz, G.Z., Stormo, G.D.: MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11(5), 563–566 (1995)Google Scholar
  18. Schug, J.: Integrating Gene Expression Signals with Bounded Collection Grammars. PhD thesis, University of Pennsylvania (2005)Google Scholar
  19. Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., Birney, E.: The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 12(10), 1611–1618 (2002)CrossRefGoogle Scholar
  20. Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16(6), 276–277 (2000)CrossRefGoogle Scholar
  21. Wang, J., Hannenhalli, S.: Generalizations of markov model to characterize biological sequences. BMC Bioinformatics 6(1), 219 (2005)CrossRefGoogle Scholar
  22. Wei, C., Wu, Q., Vega, V., Chiu, K., Ng, P., Zhang, T., Shahab, A., Yong, H., Fu, Y., Weng, Z., et al.: A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome. Cell 124(1), 207–219 (2006)CrossRefGoogle Scholar
  23. Kreiman, G.: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res. 32(9), 2889–2900 (2004)CrossRefGoogle Scholar
  24. Hannenhalli, S., Levy, S.: Transcriptional regulation of protein complexes and biological pathways. Mamm. Genome 14(9), 611–619 (2003)CrossRefGoogle Scholar
  25. Alkema, W.B., Johansson, O., Lagergren, J., Wasserman, W.W.: MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. 32(Web Server issue), 195–198 (2004)CrossRefGoogle Scholar
  26. Aerts, S., Van Loo, P., Moreau, Y., De Moor, B.: A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics 20(12), 1974–1976 (2004)CrossRefGoogle Scholar
  27. Frith, M.C., Hansen, U., Weng, Z.: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17(10), 878–889 (2001)CrossRefGoogle Scholar
  28. Frith, M.C., Spouge, J.L., Hansen, U., Weng, Z.: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res. 30(14), 3214–3224 (2002)CrossRefGoogle Scholar
  29. Wasserman, W.W., Fickett, J.W.: Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278(1), 167–181 (1998)CrossRefGoogle Scholar
  30. Krivan, W., Wasserman, W.W.: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11(9), 1559–1566 (2001)CrossRefGoogle Scholar
  31. Frech, K., Werner, T.: Specific modelling of regulatory units in DNA sequences. Pac. Symp. Biocomput. 151–62 (1997)Google Scholar
  32. Klingenhoff, A., Frech, K., Quandt, K., Werner, T.: Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15(3), 180–186 (1999)CrossRefGoogle Scholar
  33. Gailus-Durner, V., Scherf, M., Werner, T.: Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm. Genome 12(1), 67–72 (2001)CrossRefGoogle Scholar
  34. Dohr, S., Klingenhoff, A., Maier, H., de Angelis, M.H., Werner, T., Schneider, R.: Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res. 33(3), 864–872 (2005)CrossRefGoogle Scholar
  35. Terai, G., Takagi, T.: Predicting rules on organization of cis-regulatory elements, taking the order of elements into account. Bioinformatics 20(7), 1119–1128 (2004)CrossRefGoogle Scholar
  36. Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., Lawrence, C.E.: Decoding human regulatory circuits. Genome Res. 14(10A), 1967–1974 (2004)CrossRefGoogle Scholar
  37. Phuc, L.P., Friedman, J.R., Schug, J., Brestelli, J.E., Parker, J.B., Bochkis, I.M., Kaestner, K.H.: Glucocorticoid receptor-dependent gene regulatory networks. PLoS Genetics 1(2) (2005)Google Scholar
  38. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: PRINTS and its automatic supplement, prePRINTS. Nucl. Acids Res. 31(1), 400–402 (2003)CrossRefGoogle Scholar
  39. Mazzarelli, J.M., Brestelli, J., Gorski, R.K., Liu, J., Manduchi, E., Pinney, D.F., Schug, J., White, P., Kaestner, K.H., Stoeckert, C.J.J.: EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes. Nucl. Acids Res. gkl748 (2006)Google Scholar
  40. CBIL: AllGenes: a web site providing access to an integrated database of known and predicted human (release 9.0, 2004) and mouse genes (release 9.0, 2004) (2004)Google Scholar
  41. Friedman, J.R., Larris, B., Le, P.P., Peiris, T.H., Arsenlis, A., Schug, J., Tobias, J.W., Kaestner, K.H., Greenbaum, L.E.: Orthogonal analysis of C/EBPbeta targets in vivo during liver proliferation. Proc. Natl. Acad. Sci. 101(35), 12986–12991 (2004)CrossRefGoogle Scholar
  42. Yuh, C.H., Bolouri, H., Davidson, E.H.: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279(5358), 1896–1902 (1998)CrossRefGoogle Scholar
  43. Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., Collins, F.S.: Dnase-chip: a high-resolution method to identify dnase i hypersensitive sites using tiled microarrays. Nat. Meth. 3(7), 503–509 (2006)CrossRefGoogle Scholar
  44. Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., Ren, B.: A high-resolution map of active promoters in the human genome. Nature 436(7052), 876–880 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Jonathan Schug
    • 1
  • Max Mintz
    • 2
  • Christian J. StoeckertJr.
    • 1
  1. 1.Department of Genetics in the School of Medicine 
  2. 2.Department of Computer and Information Science in the School of Engineering, University of Pennsylvania, Philadelphia PA, 19104USA

Personalised recommendations