Skip to main content

GFSF: A Novel Similarity Join Method Based on Frequency Vector

  • Conference paper
  • First Online:
  • 1132 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9659))

Abstract

String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.

Supported by the Natural Science Foundation of China (61303004), the National Key Technology Support Program (2015BAH16F00/F01) and the Key Technology Program of Xiamen City (3502Z20151016).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Metwally, A., Agrawal, D., Abbadi, A.E.: Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In: Proceedings of 16th International Conference on World Wide Web, pp. 241–250. ACM Press, New York (2007)

    Google Scholar 

  2. Ji, S., Li, G., Li, C., et al.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on World Wide Web, pp. 371–380. ACM Press, New York (2009)

    Google Scholar 

  3. Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. Int. J. Very Large Data Bases 18(2), 469–500 (2009)

    Article  Google Scholar 

  4. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, p. 5. IEEE Press (2006)

    Google Scholar 

  5. Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    Google Scholar 

  6. Wang, J., Li, G., Feng, J.: Trie-Join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)

    Google Scholar 

  7. Li, G., Deng, D., Wang, J., et al.: Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)

    Article  MathSciNet  Google Scholar 

  8. Sarwagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of ACM SIGMOD International Conference on Management of data, pp. 743–754. ACM Press, New York (2004)

    Google Scholar 

  9. Xiao, C., Wang, W., Lin, X., et al.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)

    Article  MathSciNet  Google Scholar 

  10. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM Press, New York (2010)

    Google Scholar 

  11. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International WWW Conference, pp. 131–140 (2007)

    Google Scholar 

  12. Wang, J., Li, G., Fe, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE International Conference on Data Engineering, pp. 458–469. IEEE Press (2011)

    Google Scholar 

  13. Chaudhuri, S., Ganjam, K., Ganti, V., et al.: Robust, efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 313–324. ACM Press, New York (2003)

    Google Scholar 

  14. Gravano, L., Ipeirotis, P., Jagadish, H., et al.: Approximate string joins in a database (almost) for free. In: Proceedings of the International Conference on Very Large Databases, pp. 491–500 (2001)

    Google Scholar 

  15. Metwally, A., Faloutsos, C.: V-SMART-Join: A scalable MapReduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  16. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: MassJoin: A MapReduce-based method for scalable string similarity joins. In: ICDE 2014, pp. 340–351 (2014)

    Google Scholar 

  17. Huang, J., Zhang, R., Buyya, R., Chen, J.: MELODY-JOIN: Efficient Earth Mover’s Distance similarity joins using MapReduce. In: ICDE 2014, pp. 808–819 (2014)

    Google Scholar 

  18. Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Effcient metric indexing for similarity search. In: Proceedings of IEEE 31st International Conference on Data Engineering, pp. 591–602, April 2015

    Google Scholar 

  19. Maehara, T., Kusumoto, M., Kawarabayashi, K.: Scalable SimRank join algorithm. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE), pp. 603–614 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziyu Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Lin, Z., Luo, D., Lai, Y. (2016). GFSF: A Novel Similarity Join Method Based on Frequency Vector. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9659. Springer, Cham. https://doi.org/10.1007/978-3-319-39958-4_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-39958-4_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-39957-7

  • Online ISBN: 978-3-319-39958-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics