Skip to main content

Data Integration and Pattern-Finding in Biological Sequence with TESS’s Annotation Grammar and Extraction Language (AnGEL)

  • Conference paper
Data Integration in the Life Sciences (DILS 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4544))

Included in the following conference series:

  • 693 Accesses

Abstract

Decoding the functional elements in an organism’s genome requires the integration of a wide variety of experimental and computational data from a wide range of sources. The location of this data, viewed as sequence features in the genome, must serve as one of the essential organizing principles for this integration. It is therefore important to have a data integration system that takes advantage of this fact. As part of the TESS project, we have developed a grammar-based data integration and pattern search tool, Annotation Grammar and Extraction Language (AnGEL), that follows this principle. AnGEL can represent most of the current work in cis-regulatory module (CRM) modelling in an intuitive way and can process data extracted from a variety of sources simultaneously. Here we describe AnGEL’s capabilities and illustrate its use by querying for gene arrangements, CRMs, and protein domain structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Schug, J.: Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence. In: Baxevanis, A.D. (ed.) Current Protocols in Bioinformatics, J. Wiley and Sons, New York (2003)

    Google Scholar 

  • Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., Kent, W.J.: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32 Database issue, D493–496 (2004)

    Article  Google Scholar 

  • Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2), 512–531 (2001)

    Google Scholar 

  • Buneman, P., Naqvi, S., Tannen, V., Wong, L.S.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1), 3–48 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  • Searls, D.B.: The Linguistics of DNA. American Scientist 80(6), 579–591 (1992)

    Google Scholar 

  • Dong, S., Searls, D.B.: Gene Structure Prediction by Linguistic Methods. Genomics 23(3), 540–551 (1994)

    Article  Google Scholar 

  • Searls, D.B.: String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA. Journal of Logic Programming 73–102 (1995)

    Google Scholar 

  • Searls, D.B.: Languages, automata, and macromolecules. Biophysical Journal 76(1), A272–A272 (1999)

    Google Scholar 

  • Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E.: Meta-MEME: motif-based hidden Markov models of protein families. Comput Appl Biosci. 13(4), 397–406 (1997)

    Google Scholar 

  • Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)

    Article  Google Scholar 

  • Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26(1), 320–322 (1998)

    Article  Google Scholar 

  • Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)

    Article  Google Scholar 

  • Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D.: Improved splice site detection in Genie. J. Comput. Biol. 4(3), 311–323 (1997)

    Google Scholar 

  • Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a Hidden Markov Model. J. Comput. Biol. 4(2), 127–141 (1997)

    Google Scholar 

  • Yada, T., Nakao, M., Totoki, Y., Nakai, K.: Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics 15(12), 987–993 (1999)

    Article  Google Scholar 

  • Pedersen, A.G., Baldi, P., Brunak, S., Chauvin, Y.: Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. In: Proc. Int Conf. Intell. Syst. Mol. Biol. vol. 4, pp. 182–191 (1996)

    Google Scholar 

  • Chen, Q.K., Hertz, G.Z., Stormo, G.D.: MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11(5), 563–566 (1995)

    Google Scholar 

  • Schug, J.: Integrating Gene Expression Signals with Bounded Collection Grammars. PhD thesis, University of Pennsylvania (2005)

    Google Scholar 

  • Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., Birney, E.: The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 12(10), 1611–1618 (2002)

    Article  Google Scholar 

  • Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16(6), 276–277 (2000)

    Article  Google Scholar 

  • Wang, J., Hannenhalli, S.: Generalizations of markov model to characterize biological sequences. BMC Bioinformatics 6(1), 219 (2005)

    Article  Google Scholar 

  • Wei, C., Wu, Q., Vega, V., Chiu, K., Ng, P., Zhang, T., Shahab, A., Yong, H., Fu, Y., Weng, Z., et al.: A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome. Cell 124(1), 207–219 (2006)

    Article  Google Scholar 

  • Kreiman, G.: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res. 32(9), 2889–2900 (2004)

    Article  Google Scholar 

  • Hannenhalli, S., Levy, S.: Transcriptional regulation of protein complexes and biological pathways. Mamm. Genome 14(9), 611–619 (2003)

    Article  Google Scholar 

  • Alkema, W.B., Johansson, O., Lagergren, J., Wasserman, W.W.: MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. 32(Web Server issue), 195–198 (2004)

    Article  Google Scholar 

  • Aerts, S., Van Loo, P., Moreau, Y., De Moor, B.: A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics 20(12), 1974–1976 (2004)

    Article  Google Scholar 

  • Frith, M.C., Hansen, U., Weng, Z.: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17(10), 878–889 (2001)

    Article  Google Scholar 

  • Frith, M.C., Spouge, J.L., Hansen, U., Weng, Z.: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res. 30(14), 3214–3224 (2002)

    Article  Google Scholar 

  • Wasserman, W.W., Fickett, J.W.: Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278(1), 167–181 (1998)

    Article  Google Scholar 

  • Krivan, W., Wasserman, W.W.: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11(9), 1559–1566 (2001)

    Article  Google Scholar 

  • Frech, K., Werner, T.: Specific modelling of regulatory units in DNA sequences. Pac. Symp. Biocomput. 151–62 (1997)

    Google Scholar 

  • Klingenhoff, A., Frech, K., Quandt, K., Werner, T.: Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15(3), 180–186 (1999)

    Article  Google Scholar 

  • Gailus-Durner, V., Scherf, M., Werner, T.: Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm. Genome 12(1), 67–72 (2001)

    Article  Google Scholar 

  • Dohr, S., Klingenhoff, A., Maier, H., de Angelis, M.H., Werner, T., Schneider, R.: Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res. 33(3), 864–872 (2005)

    Article  Google Scholar 

  • Terai, G., Takagi, T.: Predicting rules on organization of cis-regulatory elements, taking the order of elements into account. Bioinformatics 20(7), 1119–1128 (2004)

    Article  Google Scholar 

  • Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., Lawrence, C.E.: Decoding human regulatory circuits. Genome Res. 14(10A), 1967–1974 (2004)

    Article  Google Scholar 

  • Phuc, L.P., Friedman, J.R., Schug, J., Brestelli, J.E., Parker, J.B., Bochkis, I.M., Kaestner, K.H.: Glucocorticoid receptor-dependent gene regulatory networks. PLoS Genetics 1(2) (2005)

    Google Scholar 

  • Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: PRINTS and its automatic supplement, prePRINTS. Nucl. Acids Res. 31(1), 400–402 (2003)

    Article  Google Scholar 

  • Mazzarelli, J.M., Brestelli, J., Gorski, R.K., Liu, J., Manduchi, E., Pinney, D.F., Schug, J., White, P., Kaestner, K.H., Stoeckert, C.J.J.: EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes. Nucl. Acids Res. gkl748 (2006)

    Google Scholar 

  • CBIL: AllGenes: a web site providing access to an integrated database of known and predicted human (release 9.0, 2004) and mouse genes (release 9.0, 2004) (2004)

    Google Scholar 

  • Friedman, J.R., Larris, B., Le, P.P., Peiris, T.H., Arsenlis, A., Schug, J., Tobias, J.W., Kaestner, K.H., Greenbaum, L.E.: Orthogonal analysis of C/EBPbeta targets in vivo during liver proliferation. Proc. Natl. Acad. Sci. 101(35), 12986–12991 (2004)

    Article  Google Scholar 

  • Yuh, C.H., Bolouri, H., Davidson, E.H.: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279(5358), 1896–1902 (1998)

    Article  Google Scholar 

  • Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., Collins, F.S.: Dnase-chip: a high-resolution method to identify dnase i hypersensitive sites using tiled microarrays. Nat. Meth. 3(7), 503–509 (2006)

    Article  Google Scholar 

  • Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., Ren, B.: A high-resolution map of active promoters in the human genome. Nature 436(7052), 876–880 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Sarah Cohen-Boulakia Val Tannen

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Schug, J., Mintz, M., Stoeckert, C.J. (2007). Data Integration and Pattern-Finding in Biological Sequence with TESS’s Annotation Grammar and Extraction Language (AnGEL). In: Cohen-Boulakia, S., Tannen, V. (eds) Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(), vol 4544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73255-6_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73254-9

  • Online ISBN: 978-3-540-73255-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics