Statistical Models for Biological Sequence Motif Discovery

Liu, Jin S.; Gupta, Mayetri; Liu, Xiaole; Mayerhofere, Linda; Lawrence, Charles E.

doi:10.1007/978-1-4612-2078-7_1

Jin S. Liu,
Mayetri Gupta,
Xiaole Liu,
Linda Mayerhofere &
…
Charles E. Lawrence

Part of the book series: Lecture Notes in Statistics ((LNS,volume 167))

96 Accesses
4 Citations

Abstract

With the completion of genomes of many species and the advances of microarray technologies, we begin to possess a tremendous amount of valuable biological data — but these raw products are still far from usable. One of the most challenging problems of this century is to decipher this huge amount of biological information, turning the data into knowledge. The past decade has witnessed a number of successful applications of statistical models in computational biology. This article focuses on one of these success stories: Using Bayesian models and Monte Carlo methods to find short repetitive patterns in a set of DNA or protein sequences, a task often referred to as themotif discovery. We review a few probabilistic models that have recently been shown useful for motif discovery and provide a novel framework based on a Bayesian segmentation model to unify these approaches. We show how to combine the dictionary model with the Gibbs sampler and how a segmentation-based data augmentation scheme can be implemented. A few interesting open problems are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailey, T. L. and Elkan, C. P. (1994). Fitting a mixture model by expectation-maximization to discover motifs in biopolymers.ISMB, pages 28–36, 1994.
Google Scholar
Bussemaker, H. J., Li, H. and Siggia, E.D. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis.Proc. Nat’l Acad. Sci. USA, 97 (18): 10096–10100, 2000.
Article MathSciNet Google Scholar
Krogh, A., Brown, M., Mian, I.S., Sjolander, K. and Haussler, D. Hidden markov-models in computational biology: Applications to protein modeling.Journal of Molecular Biology, 235 (5): 1501–1531, 1994.
Article Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment.Science, 262 (5131): 208 – 14, 1993.
Article Google Scholar
Lawrence, C.E., and Reilly, A.A. An expectation maximization (em) algorithm for the idenification and characterization of common sites in unaligned biopolymer sequences.Proteins, 7: 41 – 51, 1990.
Article Google Scholar
Lieb, J.D., Liu, X., Botstein, D. and Brown, P.O. Promoter-specific binding of rap1 revealed by genome-wise maps of protein-dna association.Nature Genetics, 28: 327 – 334, 2001.
Article Google Scholar
Liu, J.S. The collapsed gibbs sampler in bayesian computations with ap-plications to a gene-regulation problem.Journal of the American Statistical Association, 89 (427): 958 – 966, 1994.
Article MATH MathSciNet Google Scholar
Liu, J.S., and Lawrence, C.E. Bayesian inference on biopolymer models.Bioinformatics, 15 (l): 38 – 52, 1999.
Article Google Scholar
Liu, J.S., Neuwald, A.F. and Lawrence, C.E. Bayesian models for multiple local sequence alignment and gibbs sampling strategies.Journal of the American Statistical Association, 90 (432): 1156 – 1170, 1995.
Article MATH Google Scholar
Liu, X., Brutlag, DL. and Liu, JS. Bioprospector: Discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. InPacific Symposium on Bioinformatics, volume 6, pages 127–138, Hawaii, 2001.
Google Scholar
Liu, X., Brutlag, D.L. and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.
Google Scholar
McCue, L.A., Thompson, W., Carmack, C.S., Ryan, M.P., Liu, J.S., Der-byshire, V. and Lawrence, C.E. Phylogenetic footprinting of tran-scription factor binding sites in proteobacterial genomes.Nucleic Acids Research, 29 (3): 774 – 782, 2001.
Article Google Scholar
Neuwald, A.F., Liu, J.S. and Lawrence, C.E. Gibbs motif sampling: de-tection of bacterial outer membrane protein repeats.Protein Sci, 4 (8): 1618 – 32, 1995.
Article Google Scholar
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P. and Young, R.A. Genome-wide location and function of dna binding proteins.Science, 290: 2306 – 2309, 2000.
Article Google Scholar
Roth, F.P., Hughes, J.D., Estep, P.W. and Church, G.M. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation.Nature Biotechnology, 16 (10): 939 – 945, 1998.
Article Google Scholar
Schmidler, S.C., Liu, J.S. and Brutlag, D.L. Bayesian segmentation of protein secondary structure.Journal of Computational Biology, 7 (1- 2): 233 – 248, 2000.
Article Google Scholar
Schneider, TD. and Stephens, RM. Sequence logos: A new way to display consensus sequences.Nucleic Acids Res., 18: 6097–6100, 1990.
Article Google Scholar
Stormo, G.D. and Hartzell III, G.W.. Identifying protein-binding sites from unaligned dna fragments.Proceedings of the Nathional Academy of Science, USA, 86: 1183 – 1187, 1989.
Article Google Scholar
Tanner, M. and Wong, W.H.. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82: 528 – 550, 1987.
Article MATH MathSciNet Google Scholar
Geyer, C.J. Markov chain monte carlo maximum likelihood. In E.M. Keramigas, editor, Computing Science and Statistics:he 23rd symposium on the interface, pages 156–163, Fairfax, 1991. Interface Foundation.
Google Scholar
Jukes, T.H. and Cantor, C.R. Evolution of protein molecules. In H. N. Hunro, editor,Mammalian Protein Metabolism, pages 21 – 132. Academic Press, New York, 1969.
Chapter Google Scholar
Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.J. Mol. Evol, 16: 111 – 120, 1980.
Article Google Scholar
Lange, K.Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag, New York, 1997.
Book MATH Google Scholar
Liu, X., Brutlag, D. L and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.
Google Scholar
McCue, LA., Thompson, W., Carmack, CS., Ryan, MP., Lui, JS., Der byshire, V. and Lawrence, CE. Phylogenetic footprinting of transcription factor binding sies in proteobacterial genomes. Nucleic Acids Research, 29(3):774–782, 2001.
Article Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M. N., Teller, A. H. and Teller, E. Equations of state calculations by fast computing machines.J. Chem. Phys., 21: 1087 – 1091, 1953.
Article Google Scholar
Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J. W. and Lawrence, C. E. Human-mouse genome comparisons to locate regulatory sites.Nature Genetics, 26 (2): 225 – 228, 2000.
Article Google Scholar

Download references

Authors

Jin S. Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mayetri Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Xiaole Liu
View author publications
You can also search for this author in PubMed Google Scholar
Linda Mayerhofere
View author publications
You can also search for this author in PubMed Google Scholar
Charles E. Lawrence
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Statistical Science, Brown University, Box G-A416, 02912, Providence, RI, USA
Constantine Gatsonis
Department Of Statistics, Carnegie Mellon University, Baker Hall 229, 15213, Pittsburgh, PA, USA
Robert E. Kass
Department of Statistic, Iowa State University, 222 Snedecor Hall, 50011-1210, Ames, IA, USA
Alicia Carriquiry
Department of Statistics, Columbia University, 618 Mathematics Building, New York, NY, 10027, USA
Andrew Gelman
Institute of Statistics and Decision Sciences, Duke University, 27708-0251, Durham, NC, USA
David Higdon
Program in Biostatistics, Fred Hutchinson Center Research Center, 1100 Fairview Avenue North, MP-557, 98109-1024, Seattle, WA, 98109-1024, USA
Donna K. Pauler
Department of Statistics, Carnegie Mellon University, Baker Hall 232, 15213, Pittsburgh, PA, USA
Isabella Verdinelli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J.S., Gupta, M., Liu, X., Mayerhofere, L., Lawrence, C.E. (2002). Statistical Models for Biological Sequence Motif Discovery. In: Gatsonis, C., et al. Case Studies in Bayesian Statistics. Lecture Notes in Statistics, vol 167. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2078-7_1

Download citation

DOI: https://doi.org/10.1007/978-1-4612-2078-7_1
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-95472-1
Online ISBN: 978-1-4612-2078-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics