Skip to main content

Statistical Relation Cardinality Bounds in Knowledge Bases

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX

Abstract

There is an increasing number of Semantic Web knowledge bases (KBs) available on the Web, created in academia and industry alike. In this paper, we address the problem of lack of structure in these KBs due to their schema-free nature required for open environments such as the Web. Relation cardinality is an important structural aspect of data that has not received enough attention in the context of KBs. We propose a definition for relation cardinality bounds that can be used to unveil the structure that KBs data naturally exhibit. Information about relation cardinalities such as a person can have two parents and zero or more children, or a book should have one author at least, or a country should have more than two cities can be useful for data users and knowledge engineers when writing queries and reusing or engineering KB systems. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties in the domain of knowledge; however, their declaration is optional and consistency with the instance data is not ensured. We first address the problem of mining relation cardinality bounds by proposing an algorithm that normalises and filters the data to ensure the accuracy and robustness of the mined cardinality bounds. Then we show how these bounds can be used to assess two relevant data quality dimensions: consistency and completeness. Finally, we report that relation cardinality bounds can also be used to expose structural characteristics of a KB by mapping the bounds into a constraint language to declare the actual shape of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Henceforth, we use prefixes to replace namespaces according to http://prefix.cc/ to shorten the length of URLs.

  2. 2.

    https://www.w3.org/TR/shacl/ (accessed on February 13, 2017).

  3. 3.

    http://dublincore.org/documents/dc-dsp/.

  4. 4.

    OWL allows the expression of cardinalities through the minCardinality, maxCardinality, and cardinality restrictions.

  5. 5.

    http://docs.stardog.com/icv/icv-specification.html.

  6. 6.

    http://spinrdf.org/.

  7. 7.

    This work extends our previous work in [22].

  8. 8.

    Shape Expressions (ShEx) Primer: http://shex.io/shex-primer/.

  9. 9.

    http://spark.apache.org/ (version 2.1.0).

  10. 10.

    Any complete graph is its own maximal clique.

  11. 11.

    http://data.linkedmdb.org/.

  12. 12.

    http://www.cyc.com/platform/opencyc.

  13. 13.

    https://www.cs.ox.ac.uk/isg/tools/UOBMGenerator/.

  14. 14.

    http://www.bl.uk/bibliographic/download.html.

  15. 15.

    http://www.dbis.informatik.uni-goettingen.de/Mondial/.

  16. 16.

    https://datahub.io/dataset/nytimes-linked-open-data.

  17. 17.

    http://data.semanticweb.org/.

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  2. Arenas, M., Conca, S., Pérez, J.: Counting beyond a yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. In: WWW, pp. 629–638. ACM (2012)

    Google Scholar 

  3. Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., et al. (eds.) Reasoning Web 2009. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03754-2_4

    Chapter  Google Scholar 

  4. Boneva, I., Labra Gayo, J.E., Prud’hommeaux, E.G.: Semantics and validation of shapes schemas for RDF. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 104–120. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_7

    Chapter  Google Scholar 

  5. Bosch, T., Eckert, K.: Guidance, please! towards a framework for RDF-based constraint languages. In: Proceedings of the International Conference on Dublin Core and Metadata Applications (2015)

    Google Scholar 

  6. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)

    Article  Google Scholar 

  7. Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large-Scale Data Knowl.-Centered Syst. 19, 1–25 (2015)

    MathSciNet  Google Scholar 

  8. Ferrarotti, F., Hartmann, S., Link, S.: Efficiency frontiers of XML cardinality constraints. Data Knowl. Eng. 87, 297–319 (2013)

    Article  Google Scholar 

  9. Fleischhacker, D., Paulheim, H., Bryl, V., Völker, J., Bizer, C.: Detecting errors in numerical linked data using cross-checked outlier detection. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 357–372. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_23

    Chapter  Google Scholar 

  10. Galárraga, L., Razniewski, S., Amarilli, A., Suchanek, F.M.: Predicting completeness in knowledge bases. In: WSDM, pp. 375–383. ACM (2017)

    Google Scholar 

  11. Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: LDOW. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)

    Google Scholar 

  12. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW. CEUR Workshop Proceedings, vol. 628. CEUR-WS.org (2010)

    Google Scholar 

  13. Horrocks, I., Tessaris, S.: Querying the semantic web: a formal approach. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 177–191. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-48005-6_15

    Chapter  MATH  Google Scholar 

  14. Kellou-Menouer, K., Kedad, Z.: Evaluating the gap between an RDF dataset and its schema. In: Jeusfeld, M.A., Karlapalem, K. (eds.) ER 2015. LNCS, vol. 9382, pp. 283–292. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25747-1_28

    Chapter  Google Scholar 

  15. Kostylev, E.V., Reutter, J.L., Romero, M., Vrgoč, D.: SPARQL with property paths. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 3–18. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_1

    Chapter  Google Scholar 

  16. Lausen, G., Meier, M., Schmidt, M.: SPARQLing constraints for RDF. In: EDBT, ACM International Conference Proceeding Series, vol. 261, pp. 499–509. ACM (2008)

    Google Scholar 

  17. Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality constraints in semantic data models. Data Knowl. Eng. 11(3), 235–270 (1993)

    Article  Google Scholar 

  18. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. J. Web Sem. 7(2), 74–89 (2009)

    Article  Google Scholar 

  19. Motik, B., Nenov, Y., Piro, R.E.F., Horrocks, I.: Handling Owl:sameAs via rewriting. In: AAAI, pp. 231–237. AAAI Press (2015)

    Google Scholar 

  20. Motik, B., Patel-Schneider, P.F., Parsia, B.: OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax, 2nd edn (2012). http://www.w3.org/TR/2012/REC-owl2-syntax-20121211/

  21. Muñoz, E.: On learnability of constraints from RDF data. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 834–844. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_52

    Chapter  Google Scholar 

  22. Muñoz, E., Nickles, M.: Mining cardinalities from knowledge bases. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 447–462. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_34

    Chapter  Google Scholar 

  23. Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994. IEEE Computer Society (2011)

    Google Scholar 

  24. Olivé, A.: Conceptual Modeling of Information Systems. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-39390-0

    Book  MATH  Google Scholar 

  25. Papakonstantinou, V., Flouris, G., Fundulaki, I., Gubichev, A.: Some thoughts on OWL-empowered SPARQL query optimization. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 12–16. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47602-5_3

    Chapter  Google Scholar 

  26. Paulheim, H.: Knowledge graph refinement: a survey of approaches and evaluation methods. Semant. Web 8(3), 489–508 (2017)

    Article  Google Scholar 

  27. Paulheim, H., Bizer, C.: Improving the quality of Linked Data using statistical distributions. Int. J. Semant. Web Inf. Syst. 10(2), 63–86 (2014)

    Article  Google Scholar 

  28. Pearson, R.K.: Mining imperfect data - dealing with contamination and incomplete records. SIAM (2005)

    Google Scholar 

  29. Polleres, A., Reutter, J.L., Kostylev, E.V.: Nested constructs vs. sub-selects in SPARQL. In: AMW. CEUR Workshop Proceedings, vol. 1644. CEUR-WS.org (2016)

    Google Scholar 

  30. Polleres, A., Scharffe, F., Schindlauer, R.: SPARQL++ for mapping between RDF vocabularies. In: Meersman, R., Tari, Z. (eds.) OTM 2007. LNCS, vol. 4803, pp. 878–896. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76848-7_59

    Chapter  Google Scholar 

  31. Prud’hommeaux, E., Gayo, J.E.L., Solbrig, H.R.: Shape expressions: an RDF validation and transformation language. In: SEMANTICS, pp. 32–40. ACM (2014)

    Google Scholar 

  32. Rivero, C.R., Hernández, I., Ruiz, D., Corchuelo, R.: Towards discovering ontological models from big RDF data. In: Castano, S., Vassiliadis, P., Lakshmanan, L.V., Lee, M.L. (eds.) ER 2012. LNCS, vol. 7518, pp. 131–140. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33999-8_16

    Chapter  Google Scholar 

  33. Rosner, B.: Percentage points for a generalized ESD many-outlier procedure. Technometrics 25(2), 165–172 (1983)

    Article  Google Scholar 

  34. Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach, 3rd internat. edn. Pearson Education (2010)

    Google Scholar 

  35. Ryman, A.G., Hors, A.L., Speicher, S.: OSLC resource shape: a language for defining constraints on linked data. In: LDOW. CEUR Workshop Proceedings, vol. 996. CEUR-WS.org (2013)

    Google Scholar 

  36. Schenner, G., Bischof, S., Polleres, A., Steyskal, S.: Integrating distributed configurations with RDFS and SPARQL. In: Configuration Workshop. CEUR Workshop Proceedings, vol. 1220, pp. 9–15. CEUR-WS.org (2014)

    Google Scholar 

  37. Schmidt, M., Lausen, G.: Pleasantly consuming linked data with RDF data descriptions. In: COLD. CEUR Workshop Proceedings, vol. 1034. CEUR-WS.org (2013)

    Google Scholar 

  38. Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, pp. 4–33. ACM International Conference Proceeding Series. ACM (2010)

    Google Scholar 

  39. Tanon, T.P., Stepanova, D., Razniewski, S., Mirza, P., Weikum, G.: Completeness-aware rule learning from knowledge graphs. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 507–525. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_30

    Chapter  Google Scholar 

  40. Thalheim, B.: Fundamentals of cardinality constraints. In: Pernul, G., Tjoa, A.M. (eds.) ER 1992. LNCS, vol. 645, pp. 7–23. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-56023-8_3

    Chapter  Google Scholar 

  41. Töpper, G., Knuth, M., Sack, H.: DBpedia ontology enrichment for inconsistency detection. In: I-SEMANTICS, pp. 33–40. ACM (2012)

    Google Scholar 

  42. Vandenbussche, P., Atemezing, G., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the web. Semant. Web 8(3), 437–452 (2017)

    Article  Google Scholar 

  43. Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21034-1_9

    Chapter  Google Scholar 

  44. Wienand, D., Paulheim, H.: Detecting incorrect numerical data in DBpedia. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 504–518. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07443-6_34

    Chapter  Google Scholar 

Download references

Acknowledgements

This work has been supported by TOMOE project funded by Fujitsu Laboratories Ltd., Japan and Insight Centre for Data Analytics at National University of Ireland Galway, Ireland.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emir Muñoz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Muñoz, E., Nickles, M. (2018). Statistical Relation Cardinality Bounds in Knowledge Bases. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science(), vol 11310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58415-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-58415-6_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-58414-9

  • Online ISBN: 978-3-662-58415-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics