Advertisement

Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

  • Rana KhalilEmail author
  • Ahmed Shawish
  • Doaa Elzanfaly
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 858)

Abstract

Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into “Blocks” of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.

Keywords

Entity resolution Blocking techniques Hashing Canopy clustering Scalability Efficiency Effectiveness Big-data 

References

  1. 1.
    Stefanidis, K.: Blocking for entity resolution in the web of data: challenges and algorithms. Springer (2017)Google Scholar
  2. 2.
    Saeedi, A., Peukert, E., Rahm, E.: Comparative evaluation of distributed clustering schemes for multi-source entity resolution. Springer, Germany (2017)CrossRefGoogle Scholar
  3. 3.
    Xia, W., Jiang, H., Feng, D., Douglis, F.: A comprehensive study of the past, present, and future of data deduplication. IEEE (2016)Google Scholar
  4. 4.
    Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016)CrossRefGoogle Scholar
  5. 5.
    Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) Privacy in Statistical Databases, 1 edn., vol. 8744, pp 253–268. Springer, Cham (2014)Google Scholar
  6. 6.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRefGoogle Scholar
  7. 7.
    Kenig, B., Gal, A.: MFIBlocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2012)CrossRefGoogle Scholar
  8. 8.
    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the fifth ACM International Conference, WSDM 2012, New York (2012)Google Scholar
  9. 9.
    Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)CrossRefGoogle Scholar
  10. 10.
    Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: Proceedings of the Fourth ACM International Conference, WSDM 2011, New York (2011)Google Scholar
  11. 11.
    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: To compare or not to compare: making entity resolution more efficient. In: Proceedings of the ACM International Workshop, SWIM 2011, New York (2011)Google Scholar
  12. 12.
    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods. In: Proceedings of the 11th Annual International, JCDL 2011. ACM/IEEE, New York (2011)Google Scholar
  13. 13.
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference, SIGMOD 2009, New York (2009)Google Scholar
  14. 14.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  15. 15.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: Sixth IEEE International Conference on Data Mining, ICDM 2006, Hong Kong (2006)Google Scholar
  16. 16.
    Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record. In: KDD 2003 WORKSHOPS, pp. 25–27. Citeseerx (2003)Google Scholar
  17. 17.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference, KDD 2000, New York (2000)Google Scholar
  18. 18.
    DBLP: DBLP-Scholar Dataset, DBLP Computer Science BibliographyGoogle Scholar
  19. 19.
    Leipzig, D.G.: Benchmark datasets for entity resolution, VLDBGoogle Scholar
  20. 20.
    Jaccard, P.: Jaccard Similarity Coefficient, Getting CirriusGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of InformaticsThe British University in EgyptCairoEgypt
  2. 2.Faculty of Computer StudiesThe Arab Open UniversityKuwait CityKuwait
  3. 3.Ain Shams UniversityCairoEgypt
  4. 4.Helwan UniversityHelwanEgypt

Personalised recommendations