Abstract
With the completion of genomes of many species and the advances of microarray technologies, we begin to possess a tremendous amount of valuable biological data — but these raw products are still far from usable. One of the most challenging problems of this century is to decipher this huge amount of biological information, turning the data into knowledge. The past decade has witnessed a number of successful applications of statistical models in computational biology. This article focuses on one of these success stories: Using Bayesian models and Monte Carlo methods to find short repetitive patterns in a set of DNA or protein sequences, a task often referred to as themotif discovery. We review a few probabilistic models that have recently been shown useful for motif discovery and provide a novel framework based on a Bayesian segmentation model to unify these approaches. We show how to combine the dictionary model with the Gibbs sampler and how a segmentation-based data augmentation scheme can be implemented. A few interesting open problems are also discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bailey, T. L. and Elkan, C. P. (1994). Fitting a mixture model by expectation-maximization to discover motifs in biopolymers.ISMB, pages 28–36, 1994.
Bussemaker, H. J., Li, H. and Siggia, E.D. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis.Proc. Nat’l Acad. Sci. USA, 97 (18): 10096–10100, 2000.
Krogh, A., Brown, M., Mian, I.S., Sjolander, K. and Haussler, D. Hidden markov-models in computational biology: Applications to protein modeling.Journal of Molecular Biology, 235 (5): 1501–1531, 1994.
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment.Science, 262 (5131): 208 – 14, 1993.
Lawrence, C.E., and Reilly, A.A. An expectation maximization (em) algorithm for the idenification and characterization of common sites in unaligned biopolymer sequences.Proteins, 7: 41 – 51, 1990.
Lieb, J.D., Liu, X., Botstein, D. and Brown, P.O. Promoter-specific binding of rap1 revealed by genome-wise maps of protein-dna association.Nature Genetics, 28: 327 – 334, 2001.
Liu, J.S. The collapsed gibbs sampler in bayesian computations with ap-plications to a gene-regulation problem.Journal of the American Statistical Association, 89 (427): 958 – 966, 1994.
Liu, J.S., and Lawrence, C.E. Bayesian inference on biopolymer models.Bioinformatics, 15 (l): 38 – 52, 1999.
Liu, J.S., Neuwald, A.F. and Lawrence, C.E. Bayesian models for multiple local sequence alignment and gibbs sampling strategies.Journal of the American Statistical Association, 90 (432): 1156 – 1170, 1995.
Liu, X., Brutlag, DL. and Liu, JS. Bioprospector: Discovering conserved dna motifs in upstream regulatory regions of co-expressed genes. InPacific Symposium on Bioinformatics, volume 6, pages 127–138, Hawaii, 2001.
Liu, X., Brutlag, D.L. and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.
McCue, L.A., Thompson, W., Carmack, C.S., Ryan, M.P., Liu, J.S., Der-byshire, V. and Lawrence, C.E. Phylogenetic footprinting of tran-scription factor binding sites in proteobacterial genomes.Nucleic Acids Research, 29 (3): 774 – 782, 2001.
Neuwald, A.F., Liu, J.S. and Lawrence, C.E. Gibbs motif sampling: de-tection of bacterial outer membrane protein repeats.Protein Sci, 4 (8): 1618 – 32, 1995.
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P. and Young, R.A. Genome-wide location and function of dna binding proteins.Science, 290: 2306 – 2309, 2000.
Roth, F.P., Hughes, J.D., Estep, P.W. and Church, G.M. Finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation.Nature Biotechnology, 16 (10): 939 – 945, 1998.
Schmidler, S.C., Liu, J.S. and Brutlag, D.L. Bayesian segmentation of protein secondary structure.Journal of Computational Biology, 7 (1- 2): 233 – 248, 2000.
Schneider, TD. and Stephens, RM. Sequence logos: A new way to display consensus sequences.Nucleic Acids Res., 18: 6097–6100, 1990.
Stormo, G.D. and Hartzell III, G.W.. Identifying protein-binding sites from unaligned dna fragments.Proceedings of the Nathional Academy of Science, USA, 86: 1183 – 1187, 1989.
Tanner, M. and Wong, W.H.. The calculation of posterior distributions by data augmentation.Journal of the American Statistical Association, 82: 528 – 550, 1987.
Geyer, C.J. Markov chain monte carlo maximum likelihood. In E.M. Keramigas, editor, Computing Science and Statistics:he 23rd symposium on the interface, pages 156–163, Fairfax, 1991. Interface Foundation.
Jukes, T.H. and Cantor, C.R. Evolution of protein molecules. In H. N. Hunro, editor,Mammalian Protein Metabolism, pages 21 – 132. Academic Press, New York, 1969.
Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.J. Mol. Evol, 16: 111 – 120, 1980.
Lange, K.Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag, New York, 1997.
Liu, X., Brutlag, D. L and Liu, J.S. A fast computational method for finding protein-dna interaction sites from chromatin immunoprecipitation microarray experiments. Technical report, Department of Statistics, Harvard University, 2001.
McCue, LA., Thompson, W., Carmack, CS., Ryan, MP., Lui, JS., Der byshire, V. and Lawrence, CE. Phylogenetic footprinting of transcription factor binding sies in proteobacterial genomes. Nucleic Acids Research, 29(3):774–782, 2001.
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M. N., Teller, A. H. and Teller, E. Equations of state calculations by fast computing machines.J. Chem. Phys., 21: 1087 – 1091, 1953.
Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J. W. and Lawrence, C. E. Human-mouse genome comparisons to locate regulatory sites.Nature Genetics, 26 (2): 225 – 228, 2000.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media New York
About this paper
Cite this paper
Liu, J.S., Gupta, M., Liu, X., Mayerhofere, L., Lawrence, C.E. (2002). Statistical Models for Biological Sequence Motif Discovery. In: Gatsonis, C., et al. Case Studies in Bayesian Statistics. Lecture Notes in Statistics, vol 167. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2078-7_1
Download citation
DOI: https://doi.org/10.1007/978-1-4612-2078-7_1
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-95472-1
Online ISBN: 978-1-4612-2078-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)