Skip to main content

Statistical Models for Biological Sequence Motif Discovery

  • Conference paper
Case Studies in Bayesian Statistics

Part of the book series: Lecture Notes in Statistics ((LNS,volume 167))

Abstract

With the completion of genomes of many species and the advances of microarray technologies, we begin to possess a tremendous amount of valuable biological data — but these raw products are still far from usable. One of the most challenging problems of this century is to decipher this huge amount of biological information, turning the data into knowledge. The past decade has witnessed a number of successful applications of statistical models in computational biology. This article focuses on one of these success stories: Using Bayesian models and Monte Carlo methods to find short repetitive patterns in a set of DNA or protein sequences, a task often referred to as themotif discovery. We review a few probabilistic models that have recently been shown useful for motif discovery and provide a novel framework based on a Bayesian segmentation model to unify these approaches. We show how to combine the dictionary model with the Gibbs sampler and how a segmentation-based data augmentation scheme can be implemented. A few interesting open problems are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bailey, T. L. and Elkan, C. P. (1994). Fitting a mixture model by expectation-maximization to discover motifs in biopolymers.ISMB, pages 28–36, 1994.

    Google Scholar 

  • Bussemaker, H. J., Li, H. and Siggia, E.D. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis.Proc. Nat’l Acad. Sci. USA, 97 (18): 10096–10100, 2000.

    Article  MathSciNet  Google Scholar 

  • Krogh, A., Brown, M., Mian, I.S., Sjolander, K. and Haussler, D. Hidden markov-models in computational biology: Applications to protein modeling.Journal of Molecular Biology, 235 (5): 1501–1531, 1994.

    Article  Google Scholar 

  • Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment.Science, 262 (5131): 208 – 14, 1993.

    Article  Google Scholar 

  • Lawrence, C.E., and Reilly, A.A. An expectation maximization (em) algorithm for the idenification and characterization of common sites in unaligned biopolymer sequences.Proteins, 7: 41 – 51, 1990.

    Article  Google Scholar 

  • Lieb, J.D., Liu, X., Botstein, D. and Brown, P.O. Promoter-specific binding of rap1 revealed by genome-wise maps of protein-dna association.Nature Genetics, 28: 327 – 334, 2001.

    Article  Google Scholar 

  • Liu, J.S. The collapsed gibbs sampler in bayesian computations with ap-plications to a gene-regulation problem.Journal of the American Statistical Association, 89 (427): 958 – 966, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  • Liu, J.S., and Lawrence, C.E. Bayesian inference on biopolymer models.Bioinformatics, 15 (l): 38 – 52, 1999.

    Article  Google Scholar 

  • Liu, J.S., Neuwald, A.F. and Lawrence, C.E. Bayesian models for multiple local sequence alignment and gibbs sampling strategies.Journal of the American Statistical Association, 90 (432): 1156 – 1170, 1995.

    Article  MATH  Google Scholar 

  • Liu, X., Brutlag, DL. and Liu, JS. Bioprospector: Discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. InPacific Symposium on Bioinformatics, volume 6, pages 127–138, Hawaii, 2001.

    Google Scholar 

  • Liu, X., Brutlag, D.L. and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.

    Google Scholar 

  • McCue, L.A., Thompson, W., Carmack, C.S., Ryan, M.P., Liu, J.S., Der-byshire, V. and Lawrence, C.E. Phylogenetic footprinting of tran-scription factor binding sites in proteobacterial genomes.Nucleic Acids Research, 29 (3): 774 – 782, 2001.

    Article  Google Scholar 

  • Neuwald, A.F., Liu, J.S. and Lawrence, C.E. Gibbs motif sampling: de-tection of bacterial outer membrane protein repeats.Protein Sci, 4 (8): 1618 – 32, 1995.

    Article  Google Scholar 

  • Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P. and Young, R.A. Genome-wide location and function of dna binding proteins.Science, 290: 2306 – 2309, 2000.

    Article  Google Scholar 

  • Roth, F.P., Hughes, J.D., Estep, P.W. and Church, G.M. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation.Nature Biotechnology, 16 (10): 939 – 945, 1998.

    Article  Google Scholar 

  • Schmidler, S.C., Liu, J.S. and Brutlag, D.L. Bayesian segmentation of protein secondary structure.Journal of Computational Biology, 7 (1- 2): 233 – 248, 2000.

    Article  Google Scholar 

  • Schneider, TD. and Stephens, RM. Sequence logos: A new way to display consensus sequences.Nucleic Acids Res., 18: 6097–6100, 1990.

    Article  Google Scholar 

  • Stormo, G.D. and Hartzell III, G.W.. Identifying protein-binding sites from unaligned dna fragments.Proceedings of the Nathional Academy of Science, USA, 86: 1183 – 1187, 1989.

    Article  Google Scholar 

  • Tanner, M. and Wong, W.H.. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82: 528 – 550, 1987.

    Article  MATH  MathSciNet  Google Scholar 

  • Geyer, C.J. Markov chain monte carlo maximum likelihood. In E.M. Keramigas, editor, Computing Science and Statistics:he 23rd symposium on the interface, pages 156–163, Fairfax, 1991. Interface Foundation.

    Google Scholar 

  • Jukes, T.H. and Cantor, C.R. Evolution of protein molecules. In H. N. Hunro, editor,Mammalian Protein Metabolism, pages 21 – 132. Academic Press, New York, 1969.

    Chapter  Google Scholar 

  • Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.J. Mol. Evol, 16: 111 – 120, 1980.

    Article  Google Scholar 

  • Lange, K.Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag, New York, 1997.

    Book  MATH  Google Scholar 

  • Liu, X., Brutlag, D. L and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.

    Google Scholar 

  • McCue, LA., Thompson, W., Carmack, CS., Ryan, MP., Lui, JS., Der byshire, V. and Lawrence, CE. Phylogenetic footprinting of transcription factor binding sies in proteobacterial genomes. Nucleic Acids Research, 29(3):774–782, 2001.

    Article  Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M. N., Teller, A. H. and Teller, E. Equations of state calculations by fast computing machines.J. Chem. Phys., 21: 1087 – 1091, 1953.

    Article  Google Scholar 

  • Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J. W. and Lawrence, C. E. Human-mouse genome comparisons to locate regulatory sites.Nature Genetics, 26 (2): 225 – 228, 2000.

    Article  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media New York

About this paper

Cite this paper

Liu, J.S., Gupta, M., Liu, X., Mayerhofere, L., Lawrence, C.E. (2002). Statistical Models for Biological Sequence Motif Discovery. In: Gatsonis, C., et al. Case Studies in Bayesian Statistics. Lecture Notes in Statistics, vol 167. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2078-7_1

Download citation

Publish with us

Policies and ethics