Hash $$^{ed}$$ -Join: Approximate String Similarity Join with Hashing

Yuan, Peisen; Sha, Chaofeng; Sun, Yi

doi:10.1007/978-3-662-43984-5_16

Peisen Yuan²¹,
Chaofeng Sha²² &
Yi Sun²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8505))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1053 Accesses
1 Citations

Abstract

The string similarity join, which finds similar string pairs from string sets, has received extensive attention in database and information retrieval fields. To this problem, the filter-and-refine framework is usually adopted by the existing research work, and various filtering methods have been proposed. Recently, tree based index techniques with the edit distance constraint are effectively employed for evaluating the string similarity join. However, they do not scale well with large distance threshold. In this paper, we propose an approach for approximate string similarity join based on Min-Hashing locality sensitive hashing and trie-based index techniques. Our approach is flexible between trading the efficiency and performance. Empirical study using the real datasets demonstrates that our framework is more efficient and scales better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

String Similarity Join with Different Thresholds

String similarity join with different similarity thresholds based on novel indexing techniques

Article 11 October 2016

Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

Notes

References

Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Google Scholar
Wang, J., Feng, J., Li, G.: Trie-join: efficient trie-based string similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2010)
Google Scholar
Siragusa, E., Weese, D., Knut R.: Scalable string similarity search/join with approximate seeds and multiple backtracking. In: EDBT/ICDT, pp. 370–374. ACM (2013)
Google Scholar
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. JACM 21(1), 168–173 (1974)
Article MATH MathSciNet Google Scholar
Gouda, K., Rashad, M.: Prejoin: an efficient trie-based string similarity join algorithm. In: INFOS, pp. DE–37. IEEE (2012)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)
MathSciNet Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences, pp. 21–29 (1997)
Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)
Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2013)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2008)
Google Scholar
Lu, H., Yang, B., Jensen, C.S.: Spatio-temporal joins on symbolic indoor tracking data. In: ICDE, pp. 816–827 (2011)
Google Scholar
Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: SIGMOD (2013)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Google Scholar
Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. ICDE (2013)
Google Scholar
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: B$^{ed}$-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926 (2010)
Google Scholar

Download references

Acknowledgments

This work was supported by the 973 project(No. 2010CB328106), NSFC grant (No. 61033007 and 61170085).

Author information

Authors and Affiliations

College of Information Science and Technology, Nanjing Agricultural University, Nanjing, 210095, China
Peisen Yuan
School of Computer Science, Fudan University, Shanghai, 200433, China
Chaofeng Sha & Yi Sun

Authors

Peisen Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Sha
View author publications
You can also search for this author in PubMed Google Scholar
Yi Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peisen Yuan .

Editor information

Editors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, Korea, Republic of (South Korea)
Wook-Shin Han
National University of Singapore, Singapore, Singapore
Mong Li Lee
Udayana University, Badung, Indonesia
Agus Muliantara
Udayana University, Badung, Indonesia
Ngurah Agus Sanjaya
Christian-Albrechts-Universität zu Kiel Institut für Informatik, Kiel, Germany
Bernhard Thalheim
Fudan University, Shanghai, China
Shuigeng Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, P., Sha, C., Sun, Y. (2014). Hash$^{ed}$-Join: Approximate String Similarity Join with Hashing. In: Han, WS., Lee, M., Muliantara, A., Sanjaya, N., Thalheim, B., Zhou, S. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science(), vol 8505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43984-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-662-43984-5_16
Published: 11 July 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43983-8
Online ISBN: 978-3-662-43984-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hash\(^{ed}\)-Join: Approximate String Similarity Join with Hashing

Abstract

Access this chapter

Similar content being viewed by others

String Similarity Join with Different Thresholds

String similarity join with different similarity thresholds based on novel indexing techniques

Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Hash\(^{ed}\)-Join: Approximate String Similarity Join with Hashing

Abstract

Access this chapter

Similar content being viewed by others

String Similarity Join with Different Thresholds

String similarity join with different similarity thresholds based on novel indexing techniques

Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation