Biological Sequence Data Mining

Hu, Yuh-Jyh

doi:10.1007/3-540-44794-6_19

Yuh-Jyh Hu³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2519 Accesses
1 Citations

Abstract

Biologists have determined that the control and regulation of gene expression is primarily determined by relatively short sequences in the region surrounding a gene. These sequences vary in length, position, redundancy, orientation, and bases. Finding these short sequences is a fundamental problem in molecular biology with important applications. Though there exist many different approaches to signal/motif (i.e. short sequence) finding, in 2000 Pevzner and Sze reported that most current motif finding algorithms are incapable of detecting the target signals in their so-called Challenge Problem. In this paper, we show that using an iterative-restart design, our new algorithm can correctly find the targets. Furthermore, taking into account the fact that some transcription factors form a dimer or even more complex structures, and transcription process can sometimes involve multiple factors, we extend the original problem to an even more challenging one. We address the issue of combinatorial signals with gaps of variable lengths. To demonstrate the efficacy of our algorithm, we tested it on a series of the original and the new challenge problems, and compared it with some representative motif-finding algorithms. In addition, to verify its feasibility in real-world applications, we also tested it on several regulatory families of yeast genes with known motifs. The purpose of this paper is two-fold. One is to introduce an improved biological data mining algorithm that is capable of dealing with more variable regulatory signals in DNA sequences. The other is to propose a new research direction for the general KDD community.

Download to read the full chapter text

Chapter PDF

Regulatory Motif Identification in Biological Sequences: An Overview of Computational Methodologies

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Article Open access 08 July 2014

Novel algorithms for LDD motif search

Article Open access 06 June 2019

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

DeRisi, J., Iyer, V. and Brown, P., “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale”, Science, Vol 278, (1997) pp. 680–696.
Article Google Scholar
Wodicak, L., Dong, H., Mittmann, M., Ho, M. and Lockhart, D., “Genome-wide Expression Monitoring in Saccharomyces cerevisiae”, Nature Biotechnology, Vol 15, (1997) pp. 1359–1367.
Article Google Scholar
Bailey, T. and Elkan, C., “Unsup ervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization”, Machine Learning, 21, (1995) pp. 51–80.
Google Scholar
Hertz, G., Hartzell III, G. and Stormo, G., “Identification of Consensus Patterns in Unaligned DNA Sequences Known to be Functionally Related”, Computer Applications in Biosciences, Vol 6, No 2, (1990) pp. 81–92.
Google Scholar
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A. and Wootton, J., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignments”, Science, Vol 262, (1993) pp. 208–214.
Article Google Scholar
van Helden, J., Andre, B, and Collado-Vides, J., “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies”, Journal of Molecular Biology, 281, (1998) pp. 827–842.
Article Google Scholar
Hu, Y., Sandmeyer, S. and Kibler, D., “Detecting Motifs from Sequences”, in Proceedings of the 16th International Conference on Machine Learning, (1999) pp. 181–190.
Google Scholar
Gelfand, M., Koonin, E. and Mironov, A., “Prediction of Transcription Regulatory Sites in Archaea by a Comparative Genomic Approach”, Nucleic Acids Research, Vol 28(3), (2000), pp. 695–705.
Article Google Scholar
Li, M., Ma, B. and Wang, L. “Finding Similar Regions in Many Strings”, in Proceedings of the 31st ACM Annual Symposium on Theory of Computing, (1999) pp. 473–482.
Google Scholar
Pevzner, P. and Sze, S. “Combinatorial Approaches to Finding Subtle Signals in DNA Sequences”, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (2000).
Google Scholar
Rocke, E. and Tompa, M. “An Algorithm for Finding Novel Gapped Motifs in DNA Sequences”, in RECOMM-98, (1998) pp. 228–233.
Google Scholar
Sinha, S. and Tompa, M. “A Statistical Method for Finding Transcription Binding Sites”, in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, (2000).
Google Scholar
van Helden, J., Rios, A. F. and Collado-Vides, J., “Disco vering Regulatory Elements in Non-coding Sequences by Analysis of Spaced Dyads”, Nucleic Acids Research, Vol 28, (2000) pp. 1808–1818.
Article Google Scholar
Bairoch, A. “PROSITE: a dictionary of sites and patterns in proteins”, Nucleic Acids Research, 20, (1992) pp. 2013–2018.
Google Scholar
Jonassen, I. “Methods for Finding Motifs in Sets of Related Biosequences”, Dept. of Informatics, Univ. of Bergen, Norway, PhD thesis, 1996.
Google Scholar
Hu, Y., Sandmeyer, S., McLaughlin, C. and Kibler, D., “Com binatorial Motif Analysis and Hypothesis Generation on A Genomic Scale”, Bioinformatics, Vol 16, (2000) pp. 222–232.
Article Google Scholar
Stormo, G. “Computer Methods for Analyzing Sequence Recognition of Nucleic Acids”, Annual Review of Biophysic and Biophysical Chemistry, 17, (1988) p241–263.
Article Google Scholar
Lawrence, C. and Reilly, A. “An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences”, Protein: Structure Function and Genetics, 7, (1990) p 41–51.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer and Information Science Department, National Chiao-Tung University, 1001 Ta Shueh Rd., Hsinchu, Taiwan
Yuh-Jyh Hu

Authors

Yuh-Jyh Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, YJ. (2001). Biological Sequence Data Mining. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_19

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_19
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Biological Sequence Data Mining

Abstract

Chapter PDF

Similar content being viewed by others

Regulatory Motif Identification in Biological Sequences: An Overview of Computational Methodologies

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Novel algorithms for LDD motif search

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Biological Sequence Data Mining

Abstract

Chapter PDF

Similar content being viewed by others

Regulatory Motif Identification in Biological Sequences: An Overview of Computational Methodologies

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Novel algorithms for LDD motif search

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation