# Spelling approximate repeated or common motifs using a suffix tree

## Abstract

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet *σ*. For instance, *σ* may be equal to {A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number *q* of times in the sequence with at most *e* mismatches each time (*q* is called the quorum constraint). The second algorithm extracts common motifs from a set of *N* ≥ 2 sequences. In this case, the motifs must occur, again with at most *e* mismatches, in 1 ≤ *q ≤ N* distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being “external” objects and denote them by the expression “valid models” if they verify the quorum constraint *q*. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply “spell” the models. Assuming an alphabet of fixed size, the total time needed is *O(nN*^{2}*V(e, k))* using *O(nN*^{2}/*w)* space, where *n* is the (average) length of the sequence(s), *k* is the length of the models sought or is the length of the longest possible valid models, *w* is the size of a word machine and *V(e, k)* is the number of words of length *k*; at a Hamming distance at most e from another *k*-length word. *V(e, k)* may be majored by *k*^{ e }¦σ¦^{ e }. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever *N < k¦σ¦*, and a better space bound when *N/w < k*. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are *O(nV(e, k))* and *O(n)* respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

## Keywords

Valid Model Space Complexity Distinct Sequence Common Motif Suffix Tree## Preview

Unable to display preview. Download preview PDF.

## References

- 1.R. Baeza-Yates and G. H. Gonnet. A new approach to text searching.
*Commun. ACM*, 35:74–82, 1992.CrossRefGoogle Scholar - 2.P. Bieganski, J. Riedl, J. V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: applications and implementations. In
*Proc. of the 27th Hawai Int. Conf. on Systems Sci.*, pages 35–44. IEEE Computer Society Press, 1994.Google Scholar - 3.B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes.
*Nucleic Acids Res.*, 14:141–158, 1986.Google Scholar - 4.A.L. Cobbs. Fast identification of approximately matching substrings. In Z. Galil and E. Ukkonen, editors,
*Combinatorial Pattern Matching*, volume 937 of*Lecture Notes in Computer Science*, pages 41–54. Springer Verlag, 1995.Google Scholar - 5.M. Crochemore. An optimal algorithm for computing the repetitions in a word.
*Inf. Proc. Letters*, 12:244–250, 1981.zbMATHMathSciNetCrossRefGoogle Scholar - 6.M. Crochemore and W. Rytter.
*Text Algorithms*. Oxford University Press, 1994.Google Scholar - 7.D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from
*Escherichia coli*.*J. Mol. Biol.*, 186:117–128, 1985.CrossRefGoogle Scholar - 8.D. Gusfield.
*Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology*. Cambridge University Press, 1997.Google Scholar - 9.L. C. K. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors,
*Combinatorial Pattern Matching*, volume 644 of*Lecture Notes in Computer Science*, pages 230–243. Springer-Verlag, 1992.Google Scholar - 10.C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.
*Proteins: struct., funct., and genetics*, 7:41–51, 1990.CrossRefGoogle Scholar - 11.C. Lefevre and J.-E. Ikeda. A fast word search algorithm for the representation of sequence similarity in genomic DNA.
*Nucleic Acids Res.*, 22:404–411, 1994.Google Scholar - 12.E. M. McCreight. A space-economical suffix tree construction algorithm.
*J. ACM*, 23:262–272, 1976.zbMATHMathSciNetCrossRefGoogle Scholar - 13.E. W. Myers. A sublinear algorithm for approximate keyword searching.
*Algorithmica*, 12:345–374, 1994.zbMATHMathSciNetCrossRefGoogle Scholar - 14.E. W. Myers. 1997. personal communication.Google Scholar
- 15.M.-F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. In R. Baeza-Yates and U. Manber, editors,
*Second South American Workshop on String Processing*, pages 87–100, Viñas del Mar, Chili, 1995. University of Chili.Google Scholar - 16.M.-F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. 1998. submitted to RECOMB 1998.Google Scholar
- 17.M.-F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. In D. Hirschberg and G. Myers, editors,
*Combinatorial Pattern Matching*, volume 1075 of*Lecture Notes in Computer Science*, pages 186–208. Springer-Verlag, 1996.Google Scholar - 18.M.-F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach.
*Theoret. Comput. Sci.*, 180:115–137, 1997. presented at*Combinatorial Pattern Matching 1995*.MathSciNetCrossRefGoogle Scholar - 19.E. Ukkonen. Constructing suffix trees on-line in linear time, pages 484–492. IFIP'92, 1992.Google Scholar
- 20.E. Ukkonen. Approximate string matching over suffix trees. In Z. Galil A. Apostolico, M. Crochemore and U. Manber, editors,
*Combinatorial Pattern Matching*, volume 684 of*Lecture Notes in Computer Science*, pages 228–242. Springer-Verlag, 1993.Google Scholar - 21.M. S. Waterman. Multiple sequence alignments by consensus.
*Nucleic Acids Res.*, 14:9095–9102, 1986.MathSciNetGoogle Scholar - 22.M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor,
*Mathematical Methods for DNA Sequences*, pages 93–116. CRC Press, 1989.Google Scholar - 23.M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment.
*Bull. Math. Biol.*, 46:515–527, 1984.MathSciNetCrossRefGoogle Scholar - 24.S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool, pages 153–162, San Francisco, CA, 1992. USENIX Technical Conference.Google Scholar
- 25.S. Wu and U. Manber. Fast text searching allowing errors.
*Commun. ACM*, 35:83–91, 1992.CrossRefGoogle Scholar