Abstract
There is an increasing number of Semantic Web knowledge bases (KBs) available on the Web, created in academia and industry alike. In this paper, we address the problem of lack of structure in these KBs due to their schema-free nature required for open environments such as the Web. Relation cardinality is an important structural aspect of data that has not received enough attention in the context of KBs. We propose a definition for relation cardinality bounds that can be used to unveil the structure that KBs data naturally exhibit. Information about relation cardinalities such as a person can have two parents and zero or more children, or a book should have one author at least, or a country should have more than two cities can be useful for data users and knowledge engineers when writing queries and reusing or engineering KB systems. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties in the domain of knowledge; however, their declaration is optional and consistency with the instance data is not ensured. We first address the problem of mining relation cardinality bounds by proposing an algorithm that normalises and filters the data to ensure the accuracy and robustness of the mined cardinality bounds. Then we show how these bounds can be used to assess two relevant data quality dimensions: consistency and completeness. Finally, we report that relation cardinality bounds can also be used to expose structural characteristics of a KB by mapping the bounds into a constraint language to declare the actual shape of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Henceforth, we use prefixes to replace namespaces according to http://prefix.cc/ to shorten the length of URLs.
- 2.
https://www.w3.org/TR/shacl/ (accessed on February 13, 2017).
- 3.
- 4.
OWL allows the expression of cardinalities through the minCardinality, maxCardinality, and cardinality restrictions.
- 5.
- 6.
- 7.
This work extends our previous work in [22].
- 8.
Shape Expressions (ShEx) Primer: http://shex.io/shex-primer/.
- 9.
http://spark.apache.org/ (version 2.1.0).
- 10.
Any complete graph is its own maximal clique.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
Arenas, M., Conca, S., Pérez, J.: Counting beyond a yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. In: WWW, pp. 629–638. ACM (2012)
Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., et al. (eds.) Reasoning Web 2009. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03754-2_4
Boneva, I., Labra Gayo, J.E., Prud’hommeaux, E.G.: Semantics and validation of shapes schemas for RDF. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 104–120. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_7
Bosch, T., Eckert, K.: Guidance, please! towards a framework for RDF-based constraint languages. In: Proceedings of the International Conference on Dublin Core and Metadata Applications (2015)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large-Scale Data Knowl.-Centered Syst. 19, 1–25 (2015)
Ferrarotti, F., Hartmann, S., Link, S.: Efficiency frontiers of XML cardinality constraints. Data Knowl. Eng. 87, 297–319 (2013)
Fleischhacker, D., Paulheim, H., Bryl, V., Völker, J., Bizer, C.: Detecting errors in numerical linked data using cross-checked outlier detection. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 357–372. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_23
Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM, pp. 375–383. ACM (2017)
Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: LDOW. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)
Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW. CEUR Workshop Proceedings, vol. 628. CEUR-WS.org (2010)
Horrocks, I., Tessaris, S.: Querying the semantic web: a formal approach. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 177–191. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48005-6_15
Kellou-Menouer, K., Kedad, Z.: Evaluating the gap between an RDF dataset and its schema. In: Jeusfeld, M.A., Karlapalem, K. (eds.) ER 2015. LNCS, vol. 9382, pp. 283–292. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25747-1_28
Kostylev, E.V., Reutter, J.L., Romero, M., Vrgoč, D.: SPARQL with property paths. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 3–18. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_1
Lausen, G., Meier, M., Schmidt, M.: SPARQLing constraints for RDF. In: EDBT, ACM International Conference Proceeding Series, vol. 261, pp. 499–509. ACM (2008)
Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality constraints in semantic data models. Data Knowl. Eng. 11(3), 235–270 (1993)
Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. J. Web Sem. 7(2), 74–89 (2009)
Motik, B., Nenov, Y., Piro, R.E.F., Horrocks, I.: Handling Owl:sameAs via rewriting. In: AAAI, pp. 231–237. AAAI Press (2015)
Motik, B., Patel-Schneider, P.F., Parsia, B.: OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax, 2nd edn (2012). http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/
Muñoz, E.: On learnability of constraints from RDF data. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 834–844. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_52
Muñoz, E., Nickles, M.: Mining cardinalities from knowledge bases. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 447–462. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_34
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE Computer Society (2011)
Olivé, A.: Conceptual Modeling of Information Systems. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-39390-0
Papakonstantinou, V., Flouris, G., Fundulaki, I., Gubichev, A.: Some thoughts on OWL-empowered SPARQL query optimization. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 12–16. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47602-5_3
Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)
Paulheim, H., Bizer, C.: Improving the quality of Linked Data using statistical distributions. Int. J. Semant. Web Inf. Syst. 10(2), 63–86 (2014)
Pearson, R.K.: Mining imperfect data - dealing with contamination and incomplete records. SIAM (2005)
Polleres, A., Reutter, J.L., Kostylev, E.V.: Nested constructs vs. sub-selects in SPARQL. In: AMW. CEUR Workshop Proceedings, vol. 1644. CEUR-WS.org (2016)
Polleres, A., Scharffe, F., Schindlauer, R.: SPARQL++ for mapping between RDF vocabularies. In: Meersman, R., Tari, Z. (eds.) OTM 2007. LNCS, vol. 4803, pp. 878–896. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76848-7_59
Prud’hommeaux, E., Gayo, J.E.L., Solbrig, H.R.: Shape expressions: an RDF validation and transformation language. In: SEMANTICS, pp. 32–40. ACM (2014)
Rivero, C.R., Hernández, I., Ruiz, D., Corchuelo, R.: Towards discovering ontological models from big RDF data. In: Castano, S., Vassiliadis, P., Lakshmanan, L.V., Lee, M.L. (eds.) ER 2012. LNCS, vol. 7518, pp. 131–140. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33999-8_16
Rosner, B.: Percentage points for a generalized ESD many-outlier procedure. Technometrics 25(2), 165–172 (1983)
Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach, 3rd internat. edn. Pearson Education (2010)
Ryman, A.G., Hors, A.L., Speicher, S.: OSLC resource shape: a language for defining constraints on linked data. In: LDOW. CEUR Workshop Proceedings, vol. 996. CEUR-WS.org (2013)
Schenner, G., Bischof, S., Polleres, A., Steyskal, S.: Integrating distributed configurations with RDFS and SPARQL. In: Configuration Workshop. CEUR Workshop Proceedings, vol. 1220, pp. 9–15. CEUR-WS.org (2014)
Schmidt, M., Lausen, G.: Pleasantly consuming linked data with RDF data descriptions. In: COLD. CEUR Workshop Proceedings, vol. 1034. CEUR-WS.org (2013)
Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, pp. 4–33. ACM International Conference Proceeding Series. ACM (2010)
Tanon, T.P., Stepanova, D., Razniewski, S., Mirza, P., Weikum, G.: Completeness-aware rule learning from knowledge graphs. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 507–525. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_30
Thalheim, B.: Fundamentals of cardinality constraints. In: Pernul, G., Tjoa, A.M. (eds.) ER 1992. LNCS, vol. 645, pp. 7–23. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-56023-8_3
Töpper, G., Knuth, M., Sack, H.: DBpedia ontology enrichment for inconsistency detection. In: I-SEMANTICS, pp. 33–40. ACM (2012)
Vandenbussche, P., Atemezing, G., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the web. Semant. Web 8(3), 437–452 (2017)
Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21034-1_9
Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 504–518. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_34
Acknowledgements
This work has been supported by TOMOE project funded by Fujitsu Laboratories Ltd., Japan and Insight Centre for Data Analytics at National University of Ireland Galway, Ireland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Muñoz, E., Nickles, M. (2018). Statistical Relation Cardinality Bounds in Knowledge Bases. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science(), vol 11310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58415-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-58415-6_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58414-9
Online ISBN: 978-3-662-58415-6
eBook Packages: Computer ScienceComputer Science (R0)