Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution

Saeedi, Alieh; Peukert, Eric; Rahm, Erhard

doi:10.1007/978-3-319-66917-5_19

Alieh Saeedi¹⁶,
Eric Peukert¹⁶ &
Erhard Rahm¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1348 Accesses
23 Citations

Abstract

Entity resolution identifies semantically equivalent entities, e.g., describing the same product or customer. It is especially challenging for big data applications where large volumes of data from many sources have to be matched and integrated. Entity resolution for multiple data sources is best addressed by clustering schemes that group all matching entities within clusters. While there are many possible clustering schemes for entity resolution, their relative suitability and scalability is still unclear. We therefore implemented and comparatively evaluate distributed versions of six clustering schemes based on Apache Flink within a new entity resolution framework called Famer. Our evaluation for different real-life and synthetically generated datasets considers both the match quality as well as the scalability for different number of machines and data sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This is not a main restriction since we could first deduplicate the individual data sources before applying the workflow.
2.
OAEI 2011 IM: http://oaei.ontologymatching.org/2011/instance/.

References

Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. J. Graph Algorithms Appl. 8, 95–129 (2004)
Article MathSciNet MATH Google Scholar
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of the Foundations of Computer Science, pp. 238–247. IEEE (2002)
Google Scholar
Chierichetti, F., Dalvi, N., Kumar, R.: Correlation clustering in MapReduce. In: Proceedings of the ACM SIGKDD Conference, pp. 641–650 (2014)
Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Book Google Scholar
Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of CIKM, pp. 1165–1168 (2013)
Google Scholar
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data (TKDD) 1(1), 4 (2007)
Article Google Scholar
Hassanzadeh, O., Chiang, F., Lee, H., Miller, R.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1), 1282–1293 (2009)
Google Scholar
Hassanzadeh, O., Miller, R.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141–1166 (2009)
Article Google Scholar
Hildebrandt, K., Panse, F., Wilcke, N., Ritter, N.: Large-scale data pollution with Apache Spark. IEEE Trans. Big Data (2017)
Google Scholar
Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). doi:10.1007/978-3-319-49340-4_14
Chapter Google Scholar
Junghanns, M., Petermann, A., Teichmann, N., Gómez, K., Rahm, E.: Analyzing extended property graphs with Apache Flink. In: Proceedings of the ACM SIGMOD Workshop on Network Data Analytics (2016)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. PVLDB 5(12), 1878–1881 (2012)
Google Scholar
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Article Google Scholar
Mestre, D., Pires, C., Nascimento, D., de Queriroz, A., Santos, V., Araujo, T.: An efficient Spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)
Article Google Scholar
Nentwig, M., Groß, A., Rahm, E.: Holistic entity clustering for linked data. In: IEEE ICDMW (2016)
Google Scholar
Pan, X., Papailiopoulos, D., Oymak, S., Recht, B., Ramchandran, K., Jordan, M.: Parallel correlation clustering on big graphs. In: Advances in Neural Information Processing Systems, pp. 82–90 (2015)
Google Scholar
Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_2
Chapter Google Scholar

Download references

Acknowledgement

This work was partly funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). Also, evaluations partly performed on the Galaxy-Infrastructure at Leipzig University.

Author information

Authors and Affiliations

Database Group, Department of Computer Science, University of Leipzig, Leipzig, Germany
Alieh Saeedi, Eric Peukert & Erhard Rahm

Authors

Alieh Saeedi
View author publications
You can also search for this author in PubMed Google Scholar
Eric Peukert
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alieh Saeedi .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saeedi, A., Peukert, E., Rahm, E. (2017). Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-66917-5_19
Published: 25 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics