Skip to main content

Efficient String Similarity Search: A Cross Pivotal Based Approach

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9049))

Included in the following conference series:

  • 1943 Accesses

Abstract

In this paper, we study the problem of string similarity search with edit distance constraint; it retrieves all strings in a string database that are similar to a query string. The state-of-the-art approaches employ the concept of pivotal set, which is a set of non-overlapping signatures, for indexing and query processing. However, they do not fully exploit the pruning power potential of the pivotal sets by using only the pivotal set of the query string or the data strings. To remedy this issue, in this paper we propose a cross pivotal based approach to fully exploiting the pruning power of multiple pivotal sets. We prove theoretically that our cross pivotal filter has stronger pruning power than state-of-the-art filters. We also propose a more efficient algorithm with better time complexity for pivotal selection. Moreover, we further develop two advanced filters to prune unpromising single-match candidates which are the set of candidates introduced by one and only one of the probing signatures. Our experimental results on real datasets demonstrate that our cross pivotal based approach significantly outperforms the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB (2006)

    Google Scholar 

  2. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)

    Google Scholar 

  3. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)

    Google Scholar 

  4. Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: SIGMOD (2014)

    Google Scholar 

  5. Deng, D., Li, G., Feng, J., Li, W.: Top-k string similarity search with edit-distance constraints. In: ICDE (2013)

    Google Scholar 

  6. Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: SIGKDD (2005)

    Google Scholar 

  7. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB (2001)

    Google Scholar 

  8. Hadjieleftheriou, M., Koudas, N., Srivastava, D.: Incremental maintenance of length normalized indexes for approximate string matching. In: SIGMOD (2009)

    Google Scholar 

  9. Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB (2001)

    Google Scholar 

  10. Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: SIGMOD (2013)

    Google Scholar 

  11. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)

    Google Scholar 

  12. Li, C., Wang, B., Yang, X.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: VLDB (2007)

    Google Scholar 

  13. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1) (2001)

    Google Scholar 

  14. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD (2011)

    Google Scholar 

  15. Qin, J., Wang, W., Xiao, C., Lu, Y., Lin, X., Wang, H.: Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst. 38(3) (2013)

    Google Scholar 

  16. Sokol, D., Benson, G., Tojeira, J.: Tandem repeats over the edit distance. Bioinformatics 23(2) (2007)

    Google Scholar 

  17. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD (2012)

    Google Scholar 

  18. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1) (2008)

    Google Scholar 

  19. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)

    Google Scholar 

  20. Yang, X., Wang, B., Li, C.: Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)

    Google Scholar 

  21. Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Bi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bi, F., Chang, L., Zhang, W., Lin, X. (2015). Efficient String Similarity Search: A Cross Pivotal Based Approach. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9049. Springer, Cham. https://doi.org/10.1007/978-3-319-18120-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18120-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18119-6

  • Online ISBN: 978-3-319-18120-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics