Skip to main content

Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution

  • Conference paper
  • First Online:
Book cover Advances in Databases and Information Systems (ADBIS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

Abstract

Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are many possible clustering schemes for entity resolution, their relative suitability and scalability is still unclear. We therefore implemented and comparatively evaluate distributed versions of six clustering schemes based on Apache Flink within a new entity resolution framework called Famer. Our evaluation for different real-life and synthetically generated datasets considers both the match quality as well as the scalability for different number of machines and data sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is not a main restriction since we could first deduplicate the individual data sources before applying the workflow.

  2. 2.

    OAEI 2011 IM: http://oaei.ontologymatching.org/2011/instance/.

References

  1. Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of the Foundations of Computer Science, pp. 238–247. IEEE (2002)

    Google Scholar 

  3. Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: Proceedings of the ACM SIGKDD Conference, pp. 641–650 (2014)

    Google Scholar 

  4. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)

    Book  Google Scholar 

  5. Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of CIKM, pp. 1165–1168 (2013)

    Google Scholar 

  6. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 4 (2007)

    Article  Google Scholar 

  7. Hassanzadeh, O., Chiang, F., Lee, H., Miller, R.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)

    Google Scholar 

  8. Hassanzadeh, O., Miller, R.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)

    Article  Google Scholar 

  9. Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with Apache Spark. IEEE Trans. Big Data (2017)

    Google Scholar 

  10. Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). doi:10.1007/978-3-319-49340-4_14

    Chapter  Google Scholar 

  11. Junghanns, M., Petermann, A., Teichmann, N., Gómez, K., Rahm, E.: Analyzing extended property graphs with Apache Flink. In: Proceedings of the ACM SIGMOD Workshop on Network Data Analytics (2016)

    Google Scholar 

  12. Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)

    Google Scholar 

  13. Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)

    Article  Google Scholar 

  14. Mestre, D., Pires, C., Nascimento, D., de Queriroz, A., Santos, V., Araujo, T.: An efficient Spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)

    Article  Google Scholar 

  15. Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: IEEE ICDMW (2016)

    Google Scholar 

  16. Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchandran, K., Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)

    Google Scholar 

  17. Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_2

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was partly funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). Also, evaluations partly performed on the Galaxy-Infrastructure at Leipzig University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alieh Saeedi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Saeedi, A., Peukert, E., Rahm, E. (2017). Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66917-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66916-8

  • Online ISBN: 978-3-319-66917-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics