LATIN 1998: LATIN'98: Theoretical Informatics pp 374-390

# Spelling approximate repeated or common motifs using a suffix tree

• Marie -France Sagot
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1380)

## Abstract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet σ. For instance, σ may be equal to {A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being “external” objects and denote them by the expression “valid models” if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply “spell” the models. Assuming an alphabet of fixed size, the total time needed is O(nN2V(e, k)) using O(nN2/w) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e, k) is the number of words of length k; at a Hamming distance at most e from another k-length word. V(e, k) may be majored by k e ¦σ¦ e . This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k¦σ¦, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e, k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

## Keywords

Valid Model Space Complexity Distinct Sequence Common Motif Suffix Tree
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## References

1. 1.
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35:74–82, 1992.
2. 2.
P. Bieganski, J. Riedl, J. V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Sci., pages 35–44. IEEE Computer Society Press, 1994.Google Scholar
3. 3.
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.Google Scholar
4. 4.
A.L. Cobbs. Fast identification of approximately matching substrings. In Z. Galil and E. Ukkonen, editors, Combinatorial Pattern Matching, volume 937 of Lecture Notes in Computer Science, pages 41–54. Springer Verlag, 1995.Google Scholar
5. 5.
M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Proc. Letters, 12:244–250, 1981.
6. 6.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.Google Scholar
7. 7.
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.
8. 8.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.Google Scholar
9. 9.
L. C. K. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pages 230–243. Springer-Verlag, 1992.Google Scholar
10. 10.
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., funct., and genetics, 7:41–51, 1990.
11. 11.
C. Lefevre and J.-E. Ikeda. A fast word search algorithm for the representation of sequence similarity in genomic DNA. Nucleic Acids Res., 22:404–411, 1994.Google Scholar
12. 12.
E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.
13. 13.
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12:345–374, 1994.
14. 14.
E. W. Myers. 1997. personal communication.Google Scholar
15. 15.
M.-F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. In R. Baeza-Yates and U. Manber, editors, Second South American Workshop on String Processing, pages 87–100, Viñas del Mar, Chili, 1995. University of Chili.Google Scholar
16. 16.
M.-F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. 1998. submitted to RECOMB 1998.Google Scholar
17. 17.
M.-F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. In D. Hirschberg and G. Myers, editors, Combinatorial Pattern Matching, volume 1075 of Lecture Notes in Computer Science, pages 186–208. Springer-Verlag, 1996.Google Scholar
18. 18.
M.-F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. Theoret. Comput. Sci., 180:115–137, 1997. presented at Combinatorial Pattern Matching 1995.
19. 19.
E. Ukkonen. Constructing suffix trees on-line in linear time, pages 484–492. IFIP'92, 1992.Google Scholar
20. 20.
E. Ukkonen. Approximate string matching over suffix trees. In Z. Galil A. Apostolico, M. Crochemore and U. Manber, editors, Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 228–242. Springer-Verlag, 1993.Google Scholar
21. 21.
M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.
22. 22.
M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.Google Scholar
23. 23.
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.
24. 24.
S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool, pages 153–162, San Francisco, CA, 1992. USENIX Technical Conference.Google Scholar
25. 25.
S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35:83–91, 1992.