Abstract
Cardinality is an important structural aspect of data that has not received enough attention in the context of RDF knowledge bases (KBs). Information about cardinalities can be useful for data users and knowledge engineers when writing queries, reusing or engineering KBs. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties over instance data. However, their declaration is optional and consistency with the instance data is not ensured. In this paper, we address the problem of mining cardinality bounds for properties to discover structural characteristics of KBs, and use these bounds to assess completeness. Because KBs are incomplete and error-prone, we apply statistical methods for filtering property usage and for finding accurate and robust patterns. Accuracy of the cardinality patterns is ensured by properly handling equality axioms (owl:sameAs); and robustness by filtering outliers. We report an implementation of our algorithm with two variants using SPARQL 1.1 and Apache Spark, and their evaluation on real-world and synthetic data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Henceforth, we use prefixes for namespaces according to http://prefix.cc/.
- 2.
https://www.w3.org/TR/shacl/ (accessed on February 13, 2017).
- 3.
- 4.
OWL allows the expression of cardinalities through the minCardinality, maxCardinality, and cardinality restrictions.
- 5.
- 6.
- 7.
http://spark.apache.org/ (version 2.1.0).
- 8.
- 9.
- 10.
- 11.
- 12.
References
Bosch, T., Eckert, K.: Guidance, please! Towards a framework for RDF-based constraint languages. In: Proceedings of the International Conference on Dublin Core and Metadata Applications (2015)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Ferrarotti, F., Hartmann, S., Link, S.: Efficiency frontiers of XML cardinality constraints. Data Knowl. Eng. 87, 297–319 (2013)
Fleischhacker, D., Paulheim, H., Bryl, V., Völker, J., Bizer, C.: Detecting errors in numerical linked data using cross-checked outlier detection. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 357–372. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_23
Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM, pp. 375–383. ACM (2017)
Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: LDOW, CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)
Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW, CEUR Workshop Proceedings, vol. 628. CEUR-WS.org (2010)
Kellou-Menouer, K., Kedad, Z.: Evaluating the gap between an RDF dataset and its schema. In: Jeusfeld, M.A., Karlapalem, K. (eds.) ER 2015. LNCS, vol. 9382, pp. 283–292. Springer, Cham (2015). doi:10.1007/978-3-319-25747-1_28
Lausen, G., Meier, M., Schmidt, M.: SPARQLing constraints for RDF. In: EDBT, pp. 499–509 (2008)
Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality constraints in semantic data models. Data Knowl. Eng. 11(3), 235–270 (1993)
Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. Web Seman.: Sci. Serv. Agents World Wide Web 7(2), 74–89 (2009)
Motik, B., Nenov, Y., Piro, R.E.F., Horrocks, I.: Handling Owl:sameAs via rewriting. In: AAAI, pp. 231–237. AAAI Press (2015)
Motik, B., Patel-Schneider, P.F., Parsia, B.: OWL 2 Web Ontology Language structural specification and functional-style syntax, 2nd edn (2012). http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/
Muñoz, E.: On learnability of constraints from RDF data. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 834–844. Springer, Cham (2016). doi:10.1007/978-3-319-34129-3_52
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE Computer Society (2011)
Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)
Paulheim, H., Bizer, C.: Improving the quality of linked data using statistical distributions. Int. J. Semant. Web Inf. Syst. 10(2), 63–86 (2014)
Pearson, R.K.: Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Society for Industrial and Applied Mathematics, Philadelphia (2005)
Prud’hommeaux, E., Gayo, J.E.L., Solbrig, H.R.: Shape expressions: an RDF validation and transformation language. In: SEMANTICS, pp. 32–40. ACM (2014)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Rivero, C.R., Hernández, I., Ruiz, D., Corchuelo, R.: Towards discovering ontological models from big RDF data. In: Castano, S., Vassiliadis, P., Lakshmanan, L.V., Lee, M.L. (eds.) ER 2012. LNCS, vol. 7518, pp. 131–140. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33999-8_16
Rosner, B.: Percentage points for a generalized ESD many-outlier procedure. Technometrics 25(2), 165–172 (1983)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd and 3rd edn. Pearson Education, London (2009)
Ryman, A.G., Hors, A.L., Speicher, S.: OSLC resource shape: a language for defining constraints on linked data. In: Proceedings of the WWW 2013 Workshop on Linked Data on the Web (2013)
Schenner, G., Bischof, S., Polleres, A., Steyskal, S.: Integrating distributed configurations with RDFS and SPARQL. In: Configuration Workshop, CEUR Workshop Proceedings, vol. 1220, pp. 9–15. CEUR-WS.org (2014)
Schmidt, M., Lausen, G.: Pleasantly consuming Linked Data with RDF data descriptions. In: COLD. CEUR-WS.org (2013)
Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, pp. 4–33. ACM (2010)
Thalheim, B.: Fundamentals of cardinality constraints. In: Pernul, G., Tjoa, A.M. (eds.) ER 1992. LNCS, vol. 645, pp. 7–23. Springer, Heidelberg (1992). doi:10.1007/3-540-56023-8_3
Töpper, G., Knuth, M., Sack, H.: DBpedia ontology enrichment for inconsistency detection. In: I-SEMANTICS, pp. 33–40. ACM (2012)
Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., Leenheer, P., Pan, J. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21034-1_9
Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 504–518. Springer, Cham (2014). doi:10.1007/978-3-319-07443-6_34
Acknowledgements
This work has been supported by TOMOE project funded by Fujitsu Laboratories Ltd., Japan and Insight Centre for Data Analytics at National University of Ireland Galway, Ireland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Muñoz, E., Nickles, M. (2017). Mining Cardinalities from Knowledge Bases. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10438. Springer, Cham. https://doi.org/10.1007/978-3-319-64468-4_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-64468-4_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64467-7
Online ISBN: 978-3-319-64468-4
eBook Packages: Computer ScienceComputer Science (R0)