# Simple and Practical Sequence Nearest Neighbors with Block Operations

## Abstract

Sequence nearest neighbors problem can be defined as follows. Given a database *D* of *n* sequences, preprocess *D* so that given any query sequence *Q*, one can quickly find a sequence *S* in *D* for which *d(S, Q) ≤ d(S, T)* for any other sequence *T* in *D*. Here *d(S, Q)* denotes the “distance” between sequences *S* and *Q*, which can be defined as the minimum number of “edit operations” to transform one sequence into the other. The edit operations considered in this paper include single character edits (insertions, deletions, replacements) as well as block (substring) edits (copying, uncopying and relocating blocks).

One of the main application domains for the sequence nearest neighbors problem is computational genomics where available tools for sequence comparison and search usually focus on edit operations involving single characters only. While such tools are useful for capturing certain evolutionary mechanisms (mainly point mutations), they may have limited applicability for understanding mechanisms for segmental rearrangements (duplications, translocations and deletions) underlying genome evolution. Recent improvements towards the resolution of the human genome composition suggest that such segmental rearrangements are much more common than what was estimated before. Thus there is substantial need for incorporating similarity measures that capture block edit operations in genomic sequence comparison and search. Unfortunately even the computation of a block edit distance between two sequences under any set of non-trivial edit operations is NP-hard.

The first efficient data structure for approximate sequence nearest neighbor search for any set of non-trivial edit operations were described in [11]; the measure considered in this pape is the block edit distance.This method achieves a preprocessing time and space polynomial in size of *D* and query time near-linear in size of *Q* by allowing an approximate factor of *O*(log *l*(log* *l*)^{2}).The approach involves embedding sequences into Hamming space so that approximating Hamming distances estimates sequence block edit distances within the approximation ratio above.

In this study we focus on simplification and experimental evaluation of the [11] method. We first describe how we implement and test the accuracy of the transformations provided in [11] in terms of estimating the block edit distance under controlled data sets. Then, based on the hamming distance estimator described in [3] we present a data structure for computing approximate nearest neighbors in hamming space; this is simpler than the well-known ones in [9,6]. We finally report on how well the combined data structure performs for sequence nearest neighbor search under block edit distance.

## Preview

Unable to display preview. Download preview PDF.

### References

- 1.A. N. Arslan, O. Egecioglu, P. A. Pevzner
*A new approach to sequence comparison: normalized sequence alignment*,*Proceedings of RECOMB 2001*.Google Scholar - 2.Bailey J.A., Yavor A.M., Massa H.F., Trask B.J., Eichler E.E.,
*Segmental duplications: organization and impact within the curren t human genome project assembly*,*Genome Research*11(6), Jun 2001.Google Scholar - 3.G. Cormode, M. Paterson, S. C. Sahinalp and U. Vishkin. Communication Complexity of Document Exchange.
*Proc. ACM-SIAM Symp. on Discrete Algorithms*, 2000.Google Scholar - 4.G. Cormode, S. Muthukrishnan, S. C. Sahinalp. Permutation editing and matching via Embeddings.
*Proc. ICALP*, 2001.Google Scholar - 5.Feng D.F., Doolittle R.F.,
*Progressive sequence alignment as a prerequisite to correct phylogenetic trees*,*J Mol Evol.*1987;25(4):351–60.CrossRefGoogle Scholar - 6.P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Remving the Curse of Dimensionality.
*Proc. ACM Symp. on Theory of Computing*, 1998, 604–613.Google Scholar - 7.Jackson, Strachan, Dover,
*Human Genome Evolution, Bios Scientific Publishers*, 1996.Google Scholar - 8.Y. Ji, E. E. Eichler, S. Schwartz, R. D. Nicholls,
*Structure of Chromosomal Duplications and their Role in Mediating Human Genomic Disorders*,*Genome Research*10, 2000.Google Scholar - 9.E. Kushilevitz, R. Ostrovsky and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces.
*Proc. ACM Symposium on Theory of Computing*, 1998, 614–623.Google Scholar - 10.D. Lopresti and A. Tomkins. Block edit models for approximate string matching.
*Theoretical Computer Science*, 1996.Google Scholar - 11.S. Muthukrishnan and S. C. Sahinalp,
*Approximate nearest neighbors and sequence comparison with block operations Proc. ACM Symposium on Theory of Computing*, 2000.Google Scholar - 12.V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals,
*Cybernetics and Control Theory*, 10(8):707–710, 1966.MathSciNetGoogle Scholar - 13.V. Bafna, P. A. Pevzner, Sorting by transpositions.
*SIAM J. Discrete Math*, 11, 224–240, 1998.MATHCrossRefMathSciNetGoogle Scholar - 14.D. Shapira and J. Storer, Edit distance with move operations,t Proceedings of
*CPM*, (2002).Google Scholar - 15.S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proceedings of
*IEEE Symposium on Foundations of Computer Science*, (1996).Google Scholar - 16.George P. Smith
*Evolution of Repeated DNA Sequences by Unequal Crossover*,*Science*, vol 191, pp 528–535.Google Scholar - 17.J. D. Thompson, D. G. Higgins, T. J. Gibson,
*Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice*,*Nucleic Acid Research*1994, Vol. 22, No. 22.Google Scholar - 18.L. Wang and T. Jiang,
*On the complexity of multiple sequence alignment*,*Journal of Computational Biology*, 1:337–348, 1994.CrossRefGoogle Scholar - 19.