Skip to main content

A Scalable Framework for Quality Assessment of RDF Datasets

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2019 (ISWC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11779))

Included in the following conference series:

Abstract

Over the last years, Linked Data has grown continuously. Today, we count more than 10,000 datasets being available online following Linked Data standards. These standards allow data to be machine readable and inter-operable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine. In this paper, we present DistQualityAssessment – an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines. This is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. The work presented here is integrated with the SANSA framework and has been applied to at least three use cases beyond the SANSA community. The results show that our approach is more generic, efficient, and scalable as compared to previously proposed approaches.

Resource type Software Framework

Website http://sansa-stack.net/distqualityassessment/

Permanent URL https://doi.org/10.6084/m9.figshare.7930139

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://lodstats.aksw.org/.

  2. 2.

    https://spark.apache.org/.

  3. 3.

    https://github.com/SANSA-Stack/SANSA-RDF/tree/develop/sansa-rdf-spark/src/main/scala/net/sansa_stack/rdf/spark/qualityassessment.

  4. 4.

    http://sansa-stack.net/.

  5. 5.

    https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.

  6. 6.

    https://www.scala-lang.org/.

  7. 7.

    https://www.w3.org/TR/vocab-dqv/.

  8. 8.

    https://github.com/big-data-europe.

  9. 9.

    https://github.com/SANSA-Stack.

  10. 10.

    https://github.com/Luzzu/Framework.

  11. 11.

    We set the timeout delay to 24 hours of the quality assessment evaluation stage.

  12. 12.

    https://goo.gl/mJTkPp.

  13. 13.

    http://qrowd-project.eu/.

  14. 14.

    https://aleth.io/.

  15. 15.

    https://medium.com/alethio/ethereum-linked-data-b72e6283812f.

  16. 16.

    https://github.com/ConsenSys/EthOn.

  17. 17.

    http://slipo.eu/.

References

  1. Batini, C., Rula, A., Scannapieco, M., Viscusi, G.: From data quality to big data quality. J. Database Manag. 26(1), 60–82 (2015)

    Article  Google Scholar 

  2. Batini, C., Scannapieco, M.: Data and Information Quality - Dimensions Principles and Techniques. Data-Centric Systems and Applications. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-24106-7

    Book  MATH  Google Scholar 

  3. Becker, D., King, T.D., McMullen, B.: Big data, big data quality problem. In: International Conference on Big Data, pp. 2644–2653. IEEE (2015)

    Google Scholar 

  4. Beek, W., Ilievski, F., Debattista, J., Schlobach, S., Wielemaker, J.: Literally better: analyzing and improving the quality of literals. Semant. Web 9(1), 131–150 (2018)

    Article  Google Scholar 

  5. Benbernou, S., Ouziri, M.: Enhancing data quality by cleaning inconsistent big RDF data. In: International Conference on Big Data, pp. 74–79. IEEE (2017)

    Google Scholar 

  6. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. 5, 1–24 (2009)

    Google Scholar 

  7. Bonner, S., et al.: Data quality assessment and anomaly detection via map/reduce and linked data: a case study in the medical domain. In: International Conference on Big Data. IEEE (2015)

    Google Scholar 

  8. Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (2015)

    Article  Google Scholar 

  9. Catarci, T., Scannapieco, M., Console, M., Demetrescu, C.: My (fair) big data. In: International Conference on Big Data, pp. 2974–2979. IEEE (2017)

    Google Scholar 

  10. Debattista, J., Auer, S., Lange, C.: Luzzu-a methodology and framework for linked data quality assessment. J. Data Inf. Qual. (JDIQ) 8(1), 4 (2016)

    Google Scholar 

  11. Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluating the quality of the LOD cloud: an empirical investigation. Semant. Web 9(6), 859–901 (2018)

    Article  Google Scholar 

  12. Ermilov, I., et al.: The tale of sansa spark. In: 16th International Semantic Web Conference, Poster & Demos (2017)

    Google Scholar 

  13. Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semant. Web 9(1), 77–129 (2018)

    Article  Google Scholar 

  14. Kontokostas, D., et al.: Test-driven evaluation of linked data quality. In: 23rd International World Wide Web Conference, WWW 2014, Seoul, Republic of Korea, 7–11 April 2014, pp. 747–758 (2014)

    Google Scholar 

  15. Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 6(2), 167–195 (2015)

    Article  Google Scholar 

  16. Lehmann, J., et al.: Distributed semantic analytics using the SANSA stack. In: Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC 2017) (2017)

    Google Scholar 

  17. Mihindukulasooriya, N., García-Castro, R., Gómez-Pérez, A.: LD sniffer: a quality assessment tool for measuring the accessibility of linked data. In: Ciancarini, P., et al. (eds.) EKAW 2016. LNCS (LNAI), vol. 10180, pp. 149–152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58694-6_20

    Chapter  Google Scholar 

  18. Ngomo, A.-C.N., Auer, S., Lehmann, J., Zaveri, A.: Introduction to linked data and its lifecycle on the web. In: Koubarakis, M., et al. (eds.) Reasoning Web 2014. LNCS, vol. 8714, pp. 1–99. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10587-1_1

    Chapter  Google Scholar 

  19. Rao, D., Gudivada, V.N., Raghavan, V.V.: Data quality issues in big data. In: International Conference on Big Data, pp. 2654–2660. IEEE (2015)

    Google Scholar 

  20. Stadler, C., Lehmann, J., Höffner, K., Auer, S.: Linkedgeodata: a core for a web of spatial open data. Semant. Web J. 3(4), 333–354 (2012)

    Article  Google Scholar 

  21. Zaharia, M., et al.:. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX (2012)

    Google Scholar 

  22. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Semant. Web 7(1), 63–93 (2015)

    Article  Google Scholar 

Download references

Acknowledgment

This work was partly supported by the EU Horizon2020 projects BigDataOcean (GA no. 732310), Boost4.0 (GA no. 780732), QROWD (GA no. 723088) and CLEOPATRA (GA no. 812997).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gezim Sejdiu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sejdiu, G., Rula, A., Lehmann, J., Jabeen, H. (2019). A Scalable Framework for Quality Assessment of RDF Datasets. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11779. Springer, Cham. https://doi.org/10.1007/978-3-030-30796-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30796-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30795-0

  • Online ISBN: 978-3-030-30796-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics