Comparing Sequences with Segment Rearrangements

  • Funda Ergun
  • S. Muthukrishnan
  • S. Cenk Sahinalp
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2914)

Abstract

Computational genomics involves comparing sequences based on “similarity” for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segments found in abundance in the human genome.

In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem is closely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these other problems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approximations for the related problems are factor Ω(logn) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our result, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arslan, A.N., Egecioglu, O., Pevzner, P.A.: A new approach to sequence comparison: normalized sequence alignment. In: Proceedings of RECOMB 2001 (2001)Google Scholar
  2. 2.
    Bafna, V., Pevzner, P.: Genome Rearrangements and Sorting by Reversals. In: Proc. IEEE FOCS, pp. 148–157(1993)Google Scholar
  3. 3.
    Bafna, V., Pevzner, P.: Sorting Permutations by Transpositions. In: Proc. ACM-SIAM SODA, pp. 614–623 (1995)Google Scholar
  4. 4.
    Benedetto, D., Caglioti, E., Lorento, V.: Language Trees and Zipping. Physical Review Letters 88(4) (January 2002)Google Scholar
  5. 5.
    Ball, P.: Algorithm makes tongue tree, Nature, Science update, January 22 (2002)Google Scholar
  6. 6.
    Borodin, A., Ostrovsky, R., Rabani, Y.: Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems. In: Proc. of ACM STOC (1999)Google Scholar
  7. 7.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. ACM STOC 2002, pp. 792–801 (2002)Google Scholar
  8. 8.
    Caprara, A.: Formulations and complexity of multiple sorting by reversals. In: Proc. ACM RECOMB (1999)Google Scholar
  9. 9.
    Christie, D.: A 3/2 approximation algorithm for sorting by reversals. In: Proc. ACMSIAM SODA (1998)Google Scholar
  10. 10.
    Cormode, G., Paterson, M., Sahinalp, S.C., Vishkin, U.: Communication Complexity of Document Exchange. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2000)Google Scholar
  11. 11.
    Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (2001)Google Scholar
  12. 12.
    Durand, Nadeau, Salzberg, Sankof (eds.): DIMACS Workshop on Whole Genome Comparison (2001)Google Scholar
  13. 13.
    Ergun, F., Muthukrishnan, S., Sahinalp, S.C.: Comparing sequences with segment rearrangements, http://cs.rutgers.edu/muthu/resrch_chrono.html
  14. 14.
    Hirschberg, D.: A Linear Space Algorithm for Computing Maximal Common Subsequences. CACM 18(6), 341–343 (1975)MATHMathSciNetGoogle Scholar
  15. 15.
    Hannenhalli, S., Pevzner, P.: Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proc. IEEE FOCS, pp. 581–592 (1995)Google Scholar
  16. 16.
    Hannenhalli, S., Pevzner, P.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. In: Proc. ACM STOC, pp. 178–189 (1995)Google Scholar
  17. 17.
    Andoni, A., Deza, M., Gupta, A., Indyk, P., Raskhodnikova, S.: Lower Bounds for Embedding of Edit Distance into Normed Spaces. In: To appear in 14th Symposium on Discrete Algorithms (SODA) (2003)Google Scholar
  18. 18.
    Ji, Y., Eichler, E.E., Schwartz, S., Nicholls, R.D.: Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders. Genome Research 10 (2000)Google Scholar
  19. 19.
    Kaplan, H., Shamir, R., Tarjan, R.: A faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal on Computing (2000)Google Scholar
  20. 20.
    Karp, R., Rabin, M.: Efficient randomized pattern-matching algorithms. IBM J. of Res. and Dev. 31, 249–260 (1987)MATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proc. ACM STOC, pp. 614–623 (1998)Google Scholar
  22. 22.
    Levenshtein, V.I.: Binary codes capable of correcting deletions. Insertions and reversals. Cybernetics and Control Theory 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  23. 23.
    Lander, E.S., et al.: Initial sequencing and analysis of the human genome. Nature 15, 409 (2001)Google Scholar
  24. 24.
    Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proc. ACM-SIAM SODA 2002, pp. 205–212 (2002)Google Scholar
  25. 25.
    Li, M., Badger, J.H., Xin, C., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17 (2001)Google Scholar
  26. 26.
    Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The Similarity Metric. In: Proceedings of ACM-SIAM SODA, Baltimore MD (2003)Google Scholar
  27. 27.
    Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science (1996)Google Scholar
  28. 28.
    Muthukrishnan, S.: Data streams: Algorithms and applications (2003), http://athos.rutgers.edu/muthu/stream-1-1.ps
  29. 29.
    Muthukrishnan, S., Sahinalp, S.C.: Approximate nearest neighbors and sequence comparison with block operations. In: Proc. ACM STOC (2000)Google Scholar
  30. 30.
    Muthukrishnan, S., Sahinalp, S.C.: Improved algorithm for sequence comparison with block reversals. In: Proc. LATIN (2002)Google Scholar
  31. 31.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  32. 32.
    Rodeh, M., Pratt, V., Even, S.: Linear Algorithm for Data Compression via String Matching. JACM 28(1), 16–24 (1981)MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Sellers, P.: The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms 1, 359–373 (1980)MATHCrossRefMathSciNetGoogle Scholar
  34. 34.
    Shapira, D., Storer, J.: In-place Differential File Compression. In: Proc. DCC, pp. 263–272 (2003)Google Scholar
  35. 35.
    Storer, J.A.: Data compression: methods and theory. Computer Science Press, Rockville (1988)Google Scholar
  36. 36.
    Tichy, W.F.: The string-to-string correction problem with block moves. ACM Trans. on Computer Systems 2(4), 309–321 (1984)CrossRefMathSciNetGoogle Scholar
  37. 37.
    Venter, C., et al.: The sequence of the human genome. Science 16, 291 (2001)Google Scholar
  38. 38.
    Varre, J.S., Delahaye, J.P., Rivals, E.: The Transformation Distance: A Dissimilarity Measure Based on Movements of Segments. Bioinformatics 15(3), 194–202 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Funda Ergun
    • 1
  • S. Muthukrishnan
    • 2
  • S. Cenk Sahinalp
    • 3
  1. 1.Department of EECSCWRU 
  2. 2.Rutgers Univ. and AT&T Research 
  3. 3.Depts of EECSGenetics and Center for Comp. Genomics, CWRU 

Personalised recommendations