Abstract
An absent word in a sequence is a segment that does not occur in the given sequence. It is a minimal absent word if all its proper factors occur in the given sequence.
In this paper, we review the concept of minimal absent words, which includes the notion of shortest absent words but is much stronger. We present an efficient method for computing the minimal absent words of bounded length for DNA sequence using a Suffix Trie of bounded depth, representing bounded length factors. This method outputs the whole set of minimal absent words and furthermore our technique provides a linear-time algorithm with less memory usage than previous solutions.
We also present an approach to distinguish sequences of different organisms using their minimal absent words. Our solution applies a length-weighted index to discriminate sequences and the results show that we can build phylogenetic tree based on the collected information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acquisti, C., Poste, G., Curtiss, D., Kumar, S.: Nullomers: really a matter of natural selection? PLoS ONE. 10 (2007)
Béal, M.P., Crochemore, M., Mignosi, F., Restivo, A., Sciortino, M.: Forbidden words of regular languages. Fundamenta Informaticae 56, 121–135 (2003)
Béal, M.P., Mignosi, F., Restivo, A.: Minimal Forbidden Words and Symbolic Dynamics. In: Puech, C., Reischuk, R. (eds.) STACS 1996. LNCS, vol. 1046, pp. 555–566. Springer, Heidelberg (1996)
Böckenhauer, H.J., Bongartz, D.: Algorithmic Aspects of Bioinformatics. Springer, Berlin (2007)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Crochemore, M., Mignosi, F., Restivo, A.: Automata and Forbidden Words. Information Processing Letters 67, 111–117 (1998)
Crochemore, M., Mignosi, F., Restivo, A., Salemi, S.: Data compression using antidictonaries. Proceedings of the IEEE 88, 1756–1768 (2000)
Hampikian, G., Andersen, T.: Absent sequences: Nullomers and primes. In: Pacific Symposium on Biocomputing, vol. 12, pp. 355–366 (2007)
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9 (2008)
Liu, N., Wang, T.M.: A relative similarity measure for the similarity analysis of DNA sequences. Chemical Physics Letters 408, 307–311 (2005)
Pinho, A.J., Ferreira, P.J., Garcia, S.P., Rodrigues, J.M.: On finding minimal absent words. BMC Bioinformatics 10 (2009)
Polanski, A., Kimmel, M.: Bioinformatics. Springer, Berlin (2007)
Rosenberg, M.S.: Sequence Alignment: Methods, Models,Concepts, and Strategies. University of California Press, California (2009)
Sung, W.K.: Algorithms in Bioinformatics: a practical intoduction. CRC Press, New York (2009)
Wu, Z.D., Jiang, T., Su, W.J.: Efficient computation of shortest absent words in a genomic sequence. Information Processing Letters 110, 596–601 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chairungsee, S., Crochemore, M. (2011). Building Phylogeny with Minimal Absent Words. In: Bouchou-Markhoff, B., Caron, P., Champarnaud, JM., Maurel, D. (eds) Implementation and Application of Automata. CIAA 2011. Lecture Notes in Computer Science, vol 6807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22256-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-22256-6_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22255-9
Online ISBN: 978-3-642-22256-6
eBook Packages: Computer ScienceComputer Science (R0)