Skip to main content

Efficient Set Similarity Joins Using Min-prefixes

  • Conference paper
Advances in Databases and Information Systems (ADBIS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5739))

Abstract

Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. VLDB, pp. 918–929 (2006)

    Google Scholar 

  2. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. WWW, pp. 131–140 (2007)

    Google Scholar 

  3. Broder, A.Z.: On the Resemblance and Containment of Documents. In: Proc. Compression and Complexity of Sequences, p. 21 (1997)

    Google Scholar 

  4. Chakrabarti, K., Chaudhuri, S., Ganti, V., Xin, D.: An Efficient Filter for Approximate Membership Checking. In: Proc. SIGMOD, pp. 805–818 (2008)

    Google Scholar 

  5. Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: Proc. ICDE, p. 5 (2006)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., et al.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB, pp. 491–500 (2001)

    Google Scholar 

  7. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast Indexes and Algorithms for Set Selection Queries. In: Proc. ICDE, pp. 267–276 (2008)

    Google Scholar 

  8. Li, C., Lu, J., Lu, Y.: Efficient Merging and Filtering Algorithms for Approximate String Searches. In: Proc. ICDE, pp. 257–266 (2008)

    Google Scholar 

  9. Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD, pp. 743–754 (2004)

    Google Scholar 

  10. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In: Proc. KDD, pp. 678–684 (2005)

    Google Scholar 

  11. Xiao, C., Wang, W., Lin, X.: Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008)

    Google Scholar 

  12. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k Set Similarity Joins. In: Proc. ICDE, pp. 916–927 (2009)

    Google Scholar 

  13. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: Proc. WWW, pp. 131–140 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ribeiro, L.A., Härder, T. (2009). Efficient Set Similarity Joins Using Min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03973-7_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03972-0

  • Online ISBN: 978-3-642-03973-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics