Advertisement

Distributed and Paged Suffix Trees for Large Genetic Databases

  • Raphaël Clifford
  • Marek Sergot
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2676)

Abstract

We present two new variants of the suffix tree which allow much larger genome sequence databases to be handled efficiently. The method is based on a new linear time construction algorithm for “sparse” suffix trees, which are subtrees of the whole suffix tree. The new data structures are called the paged suffix tree (PST) and the distributed suffix tree (DST). Both tackle the memory bottleneck by constructing subtrees of the full suffix tree independently and are designed for single processor and distributed memory parallel computing environments (e.g. Beowulf clusters), respectively. The standard operations on suffix trees of biological importance are shown to be easily translatable to these new data structures. While none of these operations on the DST require interprocess communication, many have optimal expected parallel running times.

Keywords

Computing Node Construction Algorithm Maximal Repeat Suffix Tree Input String 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    A. Andersson, N. Larsson, Jesper, and K. Swanson. Suffix trees on words. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, LNCS 1075, pages 102–115. Springer-Verlag, 1996.Google Scholar
  2. [2]
    A. Andersson and S. Nilsson. Improved behaviour of tries by adaptive branching. Information Processing Letters, 46:293–300, 1993.CrossRefMathSciNetGoogle Scholar
  3. [3]
    A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, pages 85–96. Springer-Verlag, 1985.Google Scholar
  4. [4]
    W. I. Chang and E. L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:327–344, 1994.zbMATHCrossRefMathSciNetGoogle Scholar
  5. [5]
    R. Clifford. Indexed strings for large-scale genomic analysis. PhD thesis, Imperial College of Science Technology and Medicine, London, April 2001.Google Scholar
  6. [6]
    A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27(11):2369–2376, 1999.CrossRefGoogle Scholar
  7. [7]
    B. Dorohonceanu and C. Nevill-Manning. Accelerating protein classification using suffix trees. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB), pages 126–133, 2000.Google Scholar
  8. [8]
    P. Ferragina and R. Grossi. A fully-dynamic data structure for external substring search. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 693–702, Las Vegas, Nevada, 1995.Google Scholar
  9. [9]
    P. Ferragina and R. Grossi. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the Seventh Annual Symposium on Discrete Algorithms, pages 373–382, Atlanta, Georgia, 1996.Google Scholar
  10. [10]
    P. Ferragina and R. Grossi. The string B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):238–280, 1999.CrossRefMathSciNetGoogle Scholar
  11. [11]
    R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 1997.Google Scholar
  12. [12]
    D. Gusfield. Algorithms on strings, trees and sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.Google Scholar
  13. [13]
    D. Gusfield, G. M. Landau, and D. Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters, 41:181–185, 1992.zbMATHCrossRefMathSciNetGoogle Scholar
  14. [14]
    J. Kärkkäinen. Suffix cactus: a cross between suffix tree and suffix array. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, pages 191–204. Springer-Verlag, 1995.Google Scholar
  15. [15]
    J. Kärkkäinen and E. Ukkonen. Sparse suffix trees. In COCOON’ 96, Hong Kong, LNCS 1090, pages 219–230. Springer-Verlag, 1996.Google Scholar
  16. [16]
    S. Kurtz. Reducing the space requirement of suffix trees. Report 98–03. Technical report, Technische Fakultat, Universität Bielefeld, 1998.Google Scholar
  17. [17]
    S. Kurtz and C. Schleiermacher. Reputer: Fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426–427, 1999.CrossRefGoogle Scholar
  18. [18]
    U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, 1990.Google Scholar
  19. [19]
    E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249–260, 1995.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Raphaël Clifford
    • 1
  • Marek Sergot
    • 1
  1. 1.Department of ComputingImperial CollegeLondon

Personalised recommendations