Advertisement

BFT: Bit Filtration Technique for Approximate String Join in Biological Databases

  • S. Alireza Aghili
  • Divyakant Agrawal
  • Amr El Abbadi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2857)

Abstract

Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole-genome comparison into an approximate join operation in the well-established relational database context. We propose a novel Bit Filtration Technique (BFT) based on vector transformation and furthermore conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques as a pre-processing filtration step which effectively reduces the search space and running time of the join operation. Our empirical results on a number of Prokaryote and Eukaryote DNA contig datasets demonstrate very efficient filtration to effectively prune non-relevant portions of the database, incurring no false negatives, with up to 50 times faster running time compared with traditional dynamic programming, and q-gram approaches. BFT may easily be incorporated as a pre-processing step for any of the well-known sequence search heuristics as BLAST, QUASAR and FastA, for the purpose of pairwise whole-genome comparison. We analyze the precision of applying BFT and other transformation-based dimensionality reduction techniques, and finally discuss the imposed trade-offs.

Keywords

Discrete Wavelet Transformation Discrete Fourier Transformation Range Query Edit Distance Frequency Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aghili, S.A., Agrawal, D., El Abbadi, A.: Filtration of String Proximity Search via Transformation. BIBE, 149–157 (2003)Google Scholar
  2. 2.
    Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Join in Biological Databases (Extended Version). UCSB Technical Report, TRCS03-12 (2003)Google Scholar
  3. 3.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.J.: Basic Local Alignment Search tool. Molecular Biology 215, 403–410 (1990)Google Scholar
  4. 4.
    Apostolico, A.: Apostolico, A.: The Myriad Virtues of Subword Trees. Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)Google Scholar
  5. 5.
    Burkhardt, S., et al.: q-gram Based Database Searching Using a Suffix Array (QUASAR). RECOMB, 77–83 (1999)Google Scholar
  6. 6.
    Chavez, E., Navarro, G.: A Metric Index for Approximate String Matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Gaede, V., Günther, O.: Multidimensional Access Methods. ACM Computing Surveys 30, 170–231 (1998)CrossRefGoogle Scholar
  8. 8.
    Giladi, E., Walker, M.G., Wang, J.Z., Volkmuth, W.: SST: An Algorithm for Finding Near-Exact Sequence Matches in Time Proportional to the Logarithm of the Database Size. Bioinformatics 18, 873–877 (2002)CrossRefGoogle Scholar
  9. 9.
    Gravano, L., et al.: Approximate String Joins in a Database (Almost) for Free. VLDB, 491–500 (2001)Google Scholar
  10. 10.
    Internet Movie DataBase (IMDB), http://www.imdb.com
  11. 11.
    Jin, L., Li, C., Mehrotra, S.: Efficient Similarity String Joins in Large Data Sets. UCI ICS Technical Report, TR-DB-02-04 (2002)Google Scholar
  12. 12.
    Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching in Static Texts. MFCS 16, 240–248 (1991)MathSciNetGoogle Scholar
  13. 13.
    Kahveci, T., Singh, A.K.: Efficient Index Structures for String Databases. VLDB, 351–360 (2001)Google Scholar
  14. 14.
    National Center for Biotechnology Information(NCBI), http://www.ncbi.nih.gov/
  15. 15.
    Navarro, G., Baeza-Yates, R.A.: A Hybrid Indexing Method for Approximate String Matching. J. Discrete Algorithms 1, 205–239 (2000)MathSciNetGoogle Scholar
  16. 16.
    Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24, 19–27 (2001)Google Scholar
  17. 17.
    Needleman, S.B., Wunsch, C.D.: General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Molecular Biology 48, 443–453 (1970)CrossRefGoogle Scholar
  18. 18.
    Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25, 365–389 (1994)Google Scholar
  19. 19.
    Smith, R., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  20. 20.
    Thompson, J.D., et al.: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice. Nuc. Acids Research 22, 4673–4680 (1994)CrossRefGoogle Scholar
  21. 21.
    Wu, Y., Agrawal, D., El Abbadi, A.: A Comparison of DFT and DWT based Similarity Search in Time-Series Databases. CIKM, 488–495 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • S. Alireza Aghili
    • 1
  • Divyakant Agrawal
    • 1
  • Amr El Abbadi
    • 1
  1. 1.Department of Computer ScienceUniversity of California-Santa BarbaraSanta BarbaraUSA

Personalised recommendations