Abstract
Over the last years, Linked Data has grown continuously. Today, we count more than 10,000 datasets being available online following Linked Data standards. These standards allow data to be machine readable and inter-operable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine. In this paper, we present DistQualityAssessment – an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines. This is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. The work presented here is integrated with the SANSA framework and has been applied to at least three use cases beyond the SANSA community. The results show that our approach is more generic, efficient, and scalable as compared to previously proposed approaches.
Resource type Software Framework
Website http://sansa-stack.net/distqualityassessment/
Permanent URL https://doi.org/10.6084/m9.figshare.7930139
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
We set the timeout delay to 24 hours of the quality assessment evaluation stage.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
References
Batini, C., Rula, A., Scannapieco, M., Viscusi, G.: From data quality to big data quality. J. Database Manag. 26(1), 60–82 (2015)
Batini, C., Scannapieco, M.: Data and Information Quality - Dimensions Principles and Techniques. Data-Centric Systems and Applications. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24106-7
Becker, D., King, T.D., McMullen, B.: Big data, big data quality problem. In: International Conference on Big Data, pp. 2644–2653. IEEE (2015)
Beek, W., Ilievski, F., Debattista, J., Schlobach, S., Wielemaker, J.: Literally better: analyzing and improving the quality of literals. Semant. Web 9(1), 131–150 (2018)
Benbernou, S., Ouziri, M.: Enhancing data quality by cleaning inconsistent big RDF data. In: International Conference on Big Data, pp. 74–79. IEEE (2017)
Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. 5, 1–24 (2009)
Bonner, S., et al.: Data quality assessment and anomaly detection via map/reduce and linked data: a case study in the medical domain. In: International Conference on Big Data. IEEE (2015)
Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015)
Catarci, T., Scannapieco, M., Console, M., Demetrescu, C.: My (fair) big data. In: International Conference on Big Data, pp. 2974–2979. IEEE (2017)
Debattista, J., Auer, S., Lange, C.: Luzzu-a methodology and framework for linked data quality assessment. J. Data Inf. Qual. (JDIQ) 8(1), 4 (2016)
Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluating the quality of the LOD cloud: an empirical investigation. Semant. Web 9(6), 859–901 (2018)
Ermilov, I., et al.: The tale of sansa spark. In: 16th International Semantic Web Conference, Poster & Demos (2017)
Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semant. Web 9(1), 77–129 (2018)
Kontokostas, D., et al.: Test-driven evaluation of linked data quality. In: 23rd International World Wide Web Conference, WWW 2014, Seoul, Republic of Korea, 7–11 April 2014, pp. 747–758 (2014)
Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6(2), 167–195 (2015)
Lehmann, J., et al.: Distributed semantic analytics using the SANSA stack. In: Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC 2017) (2017)
Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: LD sniffer: a quality assessment tool for measuring the accessibility of linked data. In: Ciancarini, P., et al. (eds.) EKAW 2016. LNCS (LNAI), vol. 10180, pp. 149–152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58694-6_20
Ngomo, A.-C.N., Auer, S., Lehmann, J., Zaveri, A.: Introduction to linked data and its lifecycle on the web. In: Koubarakis, M., et al. (eds.) Reasoning Web 2014. LNCS, vol. 8714, pp. 1–99. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10587-1_1
Rao, D., Gudivada, V.N., Raghavan, V.V.: Data quality issues in big data. In: International Conference on Big Data, pp. 2654–2660. IEEE (2015)
Stadler, C., Lehmann, J., Höffner, K., Auer, S.: Linkedgeodata: a core for a web of spatial open data. Semant. Web J. 3(4), 333–354 (2012)
Zaharia, M., et al.:. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX (2012)
Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2015)
Acknowledgment
This work was partly supported by the EU Horizon2020 projects BigDataOcean (GA no. 732310), Boost4.0 (GA no. 780732), QROWD (GA no. 723088) and CLEOPATRA (GA no. 812997).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sejdiu, G., Rula, A., Lehmann, J., Jabeen, H. (2019). A Scalable Framework for Quality Assessment of RDF Datasets. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11779. Springer, Cham. https://doi.org/10.1007/978-3-030-30796-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-30796-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30795-0
Online ISBN: 978-3-030-30796-7
eBook Packages: Computer ScienceComputer Science (R0)