Pattern discovery in biosequences

  • Alvis Brāzma
  • Inge Jonassen
  • Jaak Vilo
  • Esko Ukkonen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1433)


We discuss the problem of algorithmic discovery of patterns common to sets of sequences and its applications to computational biology. We formulate a three step paradigm for pattern discovery, which is based on choosing the hypothesis space, designing the function rating a pattern in respect to the given sequences, and developing an algorithm finding the highest rating patterns. We give some examples of implementing this paradigm, and present experimental results of discovering new patterns in sets of biosequences. In these experiments the sets of given sequences are noisy, that is, many of the sequences given as belonging to the family, actually do not belong to the family. Nevertheless our algorithms have been able to identify biologically sound patterns. In particular we present novel results of discovering transcription factor binding sites from the complete set of over 6000 sequences, taken from the yeast genome upstream to the potential genes.


Yeast Genome Suffix Tree Hypothesis Space Pattern Drive Minimum Description Length Principle 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    R. Aasland and F. A. Stewart. The chromo shadow domain, a second chromo domain in heterchromatin-binding protein 1, HP1. Nucleic Acids Research, 23:3168–3173, 1995.Google Scholar
  2. 2.
    D. Angluin. Finding patterns common to a set of strings. J. of Comp. and Syst. Sei., 21:46–62, 1980.zbMATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, and T. Shinohara. A Machine Discovery from Amino Acid Sequences by Decision Trees over Regular Patterns. New Generation Computing, pages 361–375, 1993.Google Scholar
  4. 4.
    A. Bairoch. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research, 20:2013–2018, 1992.Google Scholar
  5. 5.
    A. Brazma and K. Cerans. Noise-tolerant inductive synthesis of regular expressions from good examples. New Generation Computing, 15(1):105–140, 1997.CrossRefGoogle Scholar
  6. 6.
    A. Brazma, I. Jonassen, I.Eidhammer, and D. Gilbert. Approaches to automatic discovery of patterns in biosequences. Journal of Computational Biology, (2): (to appear), 1998.Google Scholar
  7. 7.
    A. Brazma, I. Jonassen, E. Ukkonen, and J. Vilo. Discovering patterns and subfamilies in biosequences. In Proc. of Fourth International Conference on Intelligent Systems for Molecular Biology, pages 34–43. AAAI Press, 1996.Google Scholar
  8. 8.
    A. Brazma, E. Ukkonen, and J. Vilo. Discovering unbounded unions of regular pattern languages from positive examples. In Proceedings of 7th Annual International Symposium on Algorithms and Computation (ISAAC-96), Lect. Notes in Computer Science, volume 1178, pages 95–104, December 1996.Google Scholar
  9. 9.
    V. Chvátal. A greedy heuristic for the set-covering problem. Math. Oper. Res., 4:233–235, 1979.zbMATHMathSciNetGoogle Scholar
  10. 10.
    J. L. DeRisi, V. R. Iyer, and P. O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686, 1997.CrossRefGoogle Scholar
  11. 11.
    S. Dong and D. B. Searls. Gene structure prediction by linguistic methods. Genomics, 23:540–551, 1992.CrossRefGoogle Scholar
  12. 12.
    R. Giegerich and S. Kurtz. A comparison of imperative and purely functional suffix tree constructions. Science of Computer Programming, 25(2–3): 187–218, 1995.zbMATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Gal-ibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes. Science, 274:546–567, 1996.CrossRefGoogle Scholar
  14. 14.
    E. M. Gold. Language identification in the limit. Information and Control, 10:447–474, 1967.zbMATHCrossRefGoogle Scholar
  15. 15.
    I. Jonassen. Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci., 13:509–522, 1997.Google Scholar
  16. 16.
    I. Jonassen, J. F. Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Prot. Sci.,4(8):1587–1595, 1995.CrossRefGoogle Scholar
  17. 17.
    A. Krogh, M. Brown, I. S. Mian, K. Sjoelander, and D. Haussler. Hidden Markov model in computational biology. Applications to protein modelling. Journal of Molecular Biology, 235:1501–1531, 1994.CrossRefGoogle Scholar
  18. 18.
    R. Lathrop, T. Webster, R. Smith, P. Winston, and T. Smith. Integrating AI with sequence analysis. In L. Hunter, editor, Artificial Intelligence and Molecular Biology, pages 211–258. AAAI Press/The MIT Press, 1993.Google Scholar
  19. 19.
    C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262:208–214, Oct 1993.Google Scholar
  20. 20.
    M. Li and P. Vitanyi. An introduction to Kolmogorov complexity and its applications. Springer-Verlag, New York, 1993.zbMATHGoogle Scholar
  21. 21.
    E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23:262–272, 1976.zbMATHMathSciNetCrossRefGoogle Scholar
  22. 22.
    P. J. Mitchell and R. Tijan. Transcription regulation in mammalian cells by sequence-specific DNA binding proteins. Science, 245:371–378, 1989.Google Scholar
  23. 23.
    A. F. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, 239:689–712, 1994.CrossRefGoogle Scholar
  24. 24.
    R. Paro and D. H. Hogness. The polycomb protein shares a homologous domain with a heterochromatin-associated protein of drosophila. In Proc. Ntatl. Acad. Sci. USA, pages 263–267, Jan 1991.Google Scholar
  25. 25.
    G. Ramsay. DNA chips: State-of-the-art. Nature Biotechnology, 16:40–44, 1998.CrossRefGoogle Scholar
  26. 26.
    J. Rissanen. Modeling by the shortest data description. Automatica-J.IFAC, 14:465–471, 1978.zbMATHCrossRefGoogle Scholar
  27. 27.
    M-F. Sagot, A. Viari, and H. Soldano. Multiple sequence comparison: a peptide matching approach. In Z. Galil and E. Ukkonen, editors, Proc. of 6th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 937, pages 366–385. Springer, July 1995.Google Scholar
  28. 28.
    R. F. Sewell and R. Durbin. Method for calculation of probability of matching a bounded regular expression in a random data string. Journal of Computational Biology, 2:25–31, 1995.CrossRefGoogle Scholar
  29. 29.
    T. Shinohara. Polynomial time inference of extended regular pattern languages. Lect. Notes in Computer Science, 147:115–127, 1983.zbMATHGoogle Scholar
  30. 30.
    H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally related proteins. In Proc. Natl. Acad. Sci. USA, pages 826–830, Jan 1990.Google Scholar
  31. 31.
    R. F. Smith and T. F. Smith. Automatic generation of primary sequence patterns from sets of related protein sequences. In Proc. Natl. Acad. Sci. USA, pages 118–122, Jan 1990.Google Scholar
  32. 32.
    R. Staden. Methods for calculating the probabilities of finding patterns in sequences. CABIOS, 5:89–96, 1989.Google Scholar
  33. 33.
    R. Staden. Methods for discovering novel motifs in nucleic acid sequences. CABIOS, 5(4):293–298, 1989.Google Scholar
  34. 34.
    T. G. Turi and J. C. Loper. Multiple regulatory elements control expression of the gene encoding the Saccharomyces cerevisiae cytochrome P450, lanosterol 14 alpha-demethylase (ERG11). Journal of Biological Chemistry, 267:2046–2056, 1992.Google Scholar
  35. 35.
    E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995.zbMATHMathSciNetCrossRefGoogle Scholar
  36. 36.
    J. Vilo. Discovering frequent patterns from strings. Technical Report C-1998-9, Department of Computer Science, University of Helsinki, P. O. Bo 26, FIN-00014, University of Helsinki, May 1998.Google Scholar
  37. 37.
    M. Vingron and P. Argos. Motif Recognition and Alignment for Many Sequences by Comparison of Dot-matrices. Journal of Molecular Biology, 218:33–43, 1991.CrossRefGoogle Scholar
  38. 38.
    L. Wang and T. Jiang. One the complexity of multiple sequence alignment. Journal of Computational Biology, 1(4):337–348, 1994.CrossRefGoogle Scholar
  39. 39.
    M. S. Waterman, R. Arratia, and D. J. Galas. Pattern Recognition in Several Sequences: Consensus and Alignment. Bulletin of Mathematical Biology, 46(4):515–527, 1984.zbMATHMathSciNetCrossRefGoogle Scholar
  40. 40.
    E. Wingender, P. Dietze, H. Karas, and R. Knuppel. TRANSFAC: a database of transcriptional factors and their DNA binding sites. Nucleic Acids Research, 24:238–241, 1996.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Alvis Brāzma
    • 1
  • Inge Jonassen
    • 2
  • Jaak Vilo
    • 3
  • Esko Ukkonen
    • 3
  1. 1.EMBL Outstation - HinxtonEuropean Bioinformatics InstituteHinxtonUK
  2. 2.Department of InformaticsUniversity of Bergen, HIBBergenNorway
  3. 3.Department of Computer Science, University of HelsinkiUniversity of HelsinkiFinland

Personalised recommendations