Abstract
Biologists have determined that the control and regulation of gene expression is primarily determined by relatively short sequences in the region surrounding a gene. These sequences vary in length, position, redundancy, orientation, and bases. Finding these short sequences is a fundamental problem in molecular biology with important applications. Though there exist many different approaches to signal/motif (i.e. short sequence) finding, in 2000 Pevzner and Sze reported that most current motif finding algorithms are incapable of detecting the target signals in their so-called Challenge Problem. In this paper, we show that using an iterative-restart design, our new algorithm can correctly find the targets. Furthermore, taking into account the fact that some transcription factors form a dimer or even more complex structures, and transcription process can sometimes involve multiple factors, we extend the original problem to an even more challenging one. We address the issue of combinatorial signals with gaps of variable lengths. To demonstrate the efficacy of our algorithm, we tested it on a series of the original and the new challenge problems, and compared it with some representative motif-finding algorithms. In addition, to verify its feasibility in real-world applications, we also tested it on several regulatory families of yeast genes with known motifs. The purpose of this paper is two-fold. One is to introduce an improved biological data mining algorithm that is capable of dealing with more variable regulatory signals in DNA sequences. The other is to propose a new research direction for the general KDD community.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
DeRisi, J., Iyer, V. and Brown, P., “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale”, Science, Vol 278, (1997) pp. 680–696.
Wodicak, L., Dong, H., Mittmann, M., Ho, M. and Lockhart, D., “Genome-wide Expression Monitoring in Saccharomyces cerevisiae”, Nature Biotechnology, Vol 15, (1997) pp. 1359–1367.
Bailey, T. and Elkan, C., “Unsup ervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization”, Machine Learning, 21, (1995) pp. 51–80.
Hertz, G., Hartzell III, G. and Stormo, G., “Identification of Consensus Patterns in Unaligned DNA Sequences Known to be Functionally Related”, Computer Applications in Biosciences, Vol 6, No 2, (1990) pp. 81–92.
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A. and Wootton, J., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignments”, Science, Vol 262, (1993) pp. 208–214.
van Helden, J., Andre, B, and Collado-Vides, J., “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies”, Journal of Molecular Biology, 281, (1998) pp. 827–842.
Hu, Y., Sandmeyer, S. and Kibler, D., “Detecting Motifs from Sequences”, in Proceedings of the 16th International Conference on Machine Learning, (1999) pp. 181–190.
Gelfand, M., Koonin, E. and Mironov, A., “Prediction of Transcription Regulatory Sites in Archaea by a Comparative Genomic Approach”, Nucleic Acids Research, Vol 28(3), (2000), pp. 695–705.
Li, M., Ma, B. and Wang, L. “Finding Similar Regions in Many Strings”, in Proceedings of the 31st ACM Annual Symposium on Theory of Computing, (1999) pp. 473–482.
Pevzner, P. and Sze, S. “Combinatorial Approaches to Finding Subtle Signals in DNA Sequences”, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (2000).
Rocke, E. and Tompa, M. “An Algorithm for Finding Novel Gapped Motifs in DNA Sequences”, in RECOMM-98, (1998) pp. 228–233.
Sinha, S. and Tompa, M. “A Statistical Method for Finding Transcription Binding Sites”, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (2000).
van Helden, J., Rios, A. F. and Collado-Vides, J., “Disco vering Regulatory Elements in Non-coding Sequences by Analysis of Spaced Dyads”, Nucleic Acids Research, Vol 28, (2000) pp. 1808–1818.
Bairoch, A. “PROSITE: a dictionary of sites and patterns in proteins”, Nucleic Acids Research, 20, (1992) pp. 2013–2018.
Jonassen, I. “Methods for Finding Motifs in Sets of Related Biosequences”, Dept. of Informatics, Univ. of Bergen, Norway, PhD thesis, 1996.
Hu, Y., Sandmeyer, S., McLaughlin, C. and Kibler, D., “Com binatorial Motif Analysis and Hypothesis Generation on A Genomic Scale”, Bioinformatics, Vol 16, (2000) pp. 222–232.
Stormo, G. “Computer Methods for Analyzing Sequence Recognition of Nucleic Acids”, Annual Review of Biophysic and Biophysical Chemistry, 17, (1988) p241–263.
Lawrence, C. and Reilly, A. “An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences”, Protein: Structure Function and Genetics, 7, (1990) p 41–51.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hu, YJ. (2001). Biological Sequence Data Mining. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_19
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive