Advertisement

Algorithms for Finding Maximal-Scoring Segment Sets

  • Miklós Csűrös
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)

Abstract

We examine the problem of finding maximal-scoring sets of disjoint regions in a sequence of scores. The problem arises in DNA and protein segmentation, and in post-processing of sequence alignments. Our key result states a simple recursive relationship between maximal-scoring segment sets. The statement leads to an algorithm that finds such a k-set of segments in a sequence of length n in O(nk) time. We describe linear-time algorithms for finding optimal segment sets using different criteria for choosing k, as well as an algorithm for finding an optimal set of k segments in O(nlog n) time, independently of k. We apply our methods to the identification of non-coding RNA genes in thermophiles.

Keywords

Hide Markov Model Maximal Cover Maximal Chain Minimum Description Length Optimal Cover 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bentley, J.: Programming pearls: algorithm design techniques. Comm. ACM 27, 865–873 (1984)CrossRefGoogle Scholar
  2. 2.
    Braun, J.V., Müller, H.G.: Statistical methods for DNA sequence segmentation. Statist. Sci. 13, 142–162 (1998)zbMATHCrossRefGoogle Scholar
  3. 3.
    Karlin, S., Brendel, V.: Chance and significance in protein and DNA analysis. Science 257, 39–49 (1992)CrossRefGoogle Scholar
  4. 4.
    Fu, Y.X., Curnow, R.N.: Maximum likelihood estimation of multiple change points. Biometrika 77, 563–573 (1990)zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Li, W., Bernaola-Galván, P., Haghighi, F., Grosse, I.: Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem. 26, 491–510 (2002)CrossRefGoogle Scholar
  6. 6.
    Ruzzo, W.L., Tompa, M.: A linear time algorithm for finding all maximal scoring subsequences. In: Proc. 7th Intl. Conf. Intelligent Systems in Molecular Biology, pp. 234–241. AAAI Press, Menlo Park (1999)Google Scholar
  7. 7.
    Klein, R.J., Misulovin, Z., Eddy, S.R.: Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc. Natl. Acad. Sci. USA 99, 7542–7547 (2002)CrossRefGoogle Scholar
  8. 8.
    Churchill, G.A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79–94 (1989)zbMATHMathSciNetGoogle Scholar
  9. 9.
    Zhang, Z., Berman, P., Wiehe, T., Miller, W.: Post-processing long pairwise alignments. Bioinformatics 15, 1012–1019 (1999)CrossRefGoogle Scholar
  10. 10.
    Barron, A., Rissanen, J., Yu, B.: The Minimum Description Length principle in coding and modeling. IEEE Trans. Inform. Theory 44, 2743–2760 (1998)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)zbMATHCrossRefGoogle Scholar
  12. 12.
    Karlin, S., Dembo, A., Kawabata, T.: Statistical composition of high-scoring segments from molecular sequences. Ann. Statist. 18, 571–581 (1990)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Rabiner, L.R.: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)CrossRefGoogle Scholar
  14. 14.
    Schattner, P.: Searching for RNA genes using base composition statistics. Nucleic Acids Res 30, 2076–2082 (2002)CrossRefGoogle Scholar
  15. 15.
    Galtier, N., Lobry, J.: Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in Prokaryotes. J. Mol. Evol. 44, 632–636 (1997)CrossRefGoogle Scholar
  16. 16.
    Wang, H.C., Hickey, D.A.: Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. Nucleic Acids Res. 30, 2501–2507 (2002)CrossRefGoogle Scholar
  17. 17.
    Bao, Q., et al.: A complete sequence of the T. tengcongensis genome. Genome Res. 12, 689–700 (2002)CrossRefGoogle Scholar
  18. 18.
    Lowe, T.M., Eddy, S.R.: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997)CrossRefGoogle Scholar
  19. 19.
    Waters, E., et al.: The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proc. Natl. Acad. Sci. USA 100 (2003)Google Scholar
  20. 20.
    Kawarabayashi, Y., et al.: Complete genome sequence of an aerobic thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain7. DNA Research 8, 123–140 (2001)CrossRefGoogle Scholar
  21. 21.
    Brown, J.W.: The ribonuclease P database. Nucleic Acids Res. 27, 314 (1999)CrossRefGoogle Scholar
  22. 22.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Miklós Csűrös
    • 1
  1. 1.Département d’informatique et de recherche opérationnelleUniversité de MontréalMontréalCanada

Personalised recommendations