Abstract
Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are many possible clustering schemes for entity resolution, their relative suitability and scalability is still unclear. We therefore implemented and comparatively evaluate distributed versions of six clustering schemes based on Apache Flink within a new entity resolution framework called Famer. Our evaluation for different real-life and synthetically generated datasets considers both the match quality as well as the scalability for different number of machines and data sizes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This is not a main restriction since we could first deduplicate the individual data sources before applying the workflow.
- 2.
OAEI 2011 IM: http://oaei.ontologymatching.org/2011/instance/.
References
Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of the Foundations of Computer Science, pp. 238–247. IEEE (2002)
Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: Proceedings of the ACM SIGKDD Conference, pp. 641–650 (2014)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of CIKM, pp. 1165–1168 (2013)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 4 (2007)
Hassanzadeh, O., Chiang, F., Lee, H., Miller, R.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Hassanzadeh, O., Miller, R.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)
Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with Apache Spark. IEEE Trans. Big Data (2017)
Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). doi:10.1007/978-3-319-49340-4_14
Junghanns, M., Petermann, A., Teichmann, N., Gómez, K., Rahm, E.: Analyzing extended property graphs with Apache Flink. In: Proceedings of the ACM SIGMOD Workshop on Network Data Analytics (2016)
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Mestre, D., Pires, C., Nascimento, D., de Queriroz, A., Santos, V., Araujo, T.: An efficient Spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)
Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: IEEE ICDMW (2016)
Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchandran, K., Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)
Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_2
Acknowledgement
This work was partly funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). Also, evaluations partly performed on the Galaxy-Infrastructure at Leipzig University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Saeedi, A., Peukert, E., Rahm, E. (2017). Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-66917-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)