Data Integration and Pattern-Finding in Biological Sequence with TESS’s Annotation Grammar and Extraction Language (AnGEL)

Schug, Jonathan; Mintz, Max; Stoeckert, Christian J.

doi:10.1007/978-3-540-73255-6_16

Jonathan Schug¹,
Max Mintz² &
Christian J. Stoeckert Jr.¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4544))

Included in the following conference series:

International Conference on Data Integration in the Life Sciences

693 Accesses

Abstract

Decoding the functional elements in an organism’s genome requires the integration of a wide variety of experimental and computational data from a wide range of sources. The location of this data, viewed as sequence features in the genome, must serve as one of the essential organizing principles for this integration. It is therefore important to have a data integration system that takes advantage of this fact. As part of the TESS project, we have developed a grammar-based data integration and pattern search tool, Annotation Grammar and Extraction Language (AnGEL), that follows this principle. AnGEL can represent most of the current work in cis-regulatory module (CRM) modelling in an intuitive way and can process data extracted from a variety of sources simultaneously. Here we describe AnGEL’s capabilities and illustrate its use by querying for gene arrangements, CRMs, and protein domain structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Schug, J.: Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence. In: Baxevanis, A.D. (ed.) Current Protocols in Bioinformatics, J. Wiley and Sons, New York (2003)
Google Scholar
Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., Kent, W.J.: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32 Database issue, D493–496 (2004)
Article Google Scholar
Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2), 512–531 (2001)
Google Scholar
Buneman, P., Naqvi, S., Tannen, V., Wong, L.S.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1), 3–48 (1995)
Article MATH MathSciNet Google Scholar
Searls, D.B.: The Linguistics of DNA. American Scientist 80(6), 579–591 (1992)
Google Scholar
Dong, S., Searls, D.B.: Gene Structure Prediction by Linguistic Methods. Genomics 23(3), 540–551 (1994)
Article Google Scholar
Searls, D.B.: String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA. Journal of Logic Programming 73–102 (1995)
Google Scholar
Searls, D.B.: Languages, automata, and macromolecules. Biophysical Journal 76(1), A272–A272 (1999)
Google Scholar
Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E.: Meta-MEME: motif-based hidden Markov models of protein families. Comput Appl Biosci. 13(4), 397–406 (1997)
Google Scholar
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
Article Google Scholar
Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A., Durbin, R.: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26(1), 320–322 (1998)
Article Google Scholar
Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1), 78–94 (1997)
Article Google Scholar
Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D.: Improved splice site detection in Genie. J. Comput. Biol. 4(3), 311–323 (1997)
Google Scholar
Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a Hidden Markov Model. J. Comput. Biol. 4(2), 127–141 (1997)
Google Scholar
Yada, T., Nakao, M., Totoki, Y., Nakai, K.: Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics 15(12), 987–993 (1999)
Article Google Scholar
Pedersen, A.G., Baldi, P., Brunak, S., Chauvin, Y.: Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. In: Proc. Int Conf. Intell. Syst. Mol. Biol. vol. 4, pp. 182–191 (1996)
Google Scholar
Chen, Q.K., Hertz, G.Z., Stormo, G.D.: MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11(5), 563–566 (1995)
Google Scholar
Schug, J.: Integrating Gene Expression Signals with Bounded Collection Grammars. PhD thesis, University of Pennsylvania (2005)
Google Scholar
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., Birney, E.: The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res. 12(10), 1611–1618 (2002)
Article Google Scholar
Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16(6), 276–277 (2000)
Article Google Scholar
Wang, J., Hannenhalli, S.: Generalizations of markov model to characterize biological sequences. BMC Bioinformatics 6(1), 219 (2005)
Article Google Scholar
Wei, C., Wu, Q., Vega, V., Chiu, K., Ng, P., Zhang, T., Shahab, A., Yong, H., Fu, Y., Weng, Z., et al.: A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome. Cell 124(1), 207–219 (2006)
Article Google Scholar
Kreiman, G.: Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic Acids Res. 32(9), 2889–2900 (2004)
Article Google Scholar
Hannenhalli, S., Levy, S.: Transcriptional regulation of protein complexes and biological pathways. Mamm. Genome 14(9), 611–619 (2003)
Article Google Scholar
Alkema, W.B., Johansson, O., Lagergren, J., Wasserman, W.W.: MSCAN: identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. 32(Web Server issue), 195–198 (2004)
Article Google Scholar
Aerts, S., Van Loo, P., Moreau, Y., De Moor, B.: A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics 20(12), 1974–1976 (2004)
Article Google Scholar
Frith, M.C., Hansen, U., Weng, Z.: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17(10), 878–889 (2001)
Article Google Scholar
Frith, M.C., Spouge, J.L., Hansen, U., Weng, Z.: Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res. 30(14), 3214–3224 (2002)
Article Google Scholar
Wasserman, W.W., Fickett, J.W.: Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278(1), 167–181 (1998)
Article Google Scholar
Krivan, W., Wasserman, W.W.: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11(9), 1559–1566 (2001)
Article Google Scholar
Frech, K., Werner, T.: Specific modelling of regulatory units in DNA sequences. Pac. Symp. Biocomput. 151–62 (1997)
Google Scholar
Klingenhoff, A., Frech, K., Quandt, K., Werner, T.: Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15(3), 180–186 (1999)
Article Google Scholar
Gailus-Durner, V., Scherf, M., Werner, T.: Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm. Genome 12(1), 67–72 (2001)
Article Google Scholar
Dohr, S., Klingenhoff, A., Maier, H., de Angelis, M.H., Werner, T., Schneider, R.: Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res. 33(3), 864–872 (2005)
Article Google Scholar
Terai, G., Takagi, T.: Predicting rules on organization of cis-regulatory elements, taking the order of elements into account. Bioinformatics 20(7), 1119–1128 (2004)
Article Google Scholar
Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., Lawrence, C.E.: Decoding human regulatory circuits. Genome Res. 14(10A), 1967–1974 (2004)
Article Google Scholar
Phuc, L.P., Friedman, J.R., Schug, J., Brestelli, J.E., Parker, J.B., Bochkis, I.M., Kaestner, K.H.: Glucocorticoid receptor-dependent gene regulatory networks. PLoS Genetics 1(2) (2005)
Google Scholar
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C.: PRINTS and its automatic supplement, prePRINTS. Nucl. Acids Res. 31(1), 400–402 (2003)
Article Google Scholar
Mazzarelli, J.M., Brestelli, J., Gorski, R.K., Liu, J., Manduchi, E., Pinney, D.F., Schug, J., White, P., Kaestner, K.H., Stoeckert, C.J.J.: EPConDB: a web resource for gene expression related to pancreatic development, beta-cell function and diabetes. Nucl. Acids Res. gkl748 (2006)
Google Scholar
CBIL: AllGenes: a web site providing access to an integrated database of known and predicted human (release 9.0, 2004) and mouse genes (release 9.0, 2004) (2004)
Google Scholar
Friedman, J.R., Larris, B., Le, P.P., Peiris, T.H., Arsenlis, A., Schug, J., Tobias, J.W., Kaestner, K.H., Greenbaum, L.E.: Orthogonal analysis of C/EBPbeta targets in vivo during liver proliferation. Proc. Natl. Acad. Sci. 101(35), 12986–12991 (2004)
Article Google Scholar
Yuh, C.H., Bolouri, H., Davidson, E.H.: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science 279(5358), 1896–1902 (1998)
Article Google Scholar
Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., Collins, F.S.: Dnase-chip: a high-resolution method to identify dnase i hypersensitive sites using tiled microarrays. Nat. Meth. 3(7), 503–509 (2006)
Article Google Scholar
Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., Ren, B.: A high-resolution map of active promoters in the human genome. Nature 436(7052), 876–880 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Genetics in the School of Medicine,
Jonathan Schug & Christian J. Stoeckert Jr.
Department of Computer and Information Science in the School of Engineering, University of Pennsylvania, Philadelphia PA, 19104, USA
Max Mintz

Authors

Jonathan Schug
View author publications
You can also search for this author in PubMed Google Scholar
Max Mintz
View author publications
You can also search for this author in PubMed Google Scholar
Christian J. Stoeckert Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sarah Cohen-Boulakia Val Tannen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schug, J., Mintz, M., Stoeckert, C.J. (2007). Data Integration and Pattern-Finding in Biological Sequence with TESS’s Annotation Grammar and Extraction Language (AnGEL). In: Cohen-Boulakia, S., Tannen, V. (eds) Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(), vol 4544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-73255-6_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73254-9
Online ISBN: 978-3-540-73255-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics