Structure Inference for Linked Data Sources Using Clustering

Christodoulou, Klitos; Paton, Norman W.; Fernandes, Alvaro A. A.

doi:10.1007/978-3-662-46562-2_1

Klitos Christodoulou²²,
Norman W. Paton²² &
Alvaro A. A. Fernandes²²

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 8990))

938 Accesses
14 Citations

Abstract

Linked Data (LD) overlays the World Wide Web of documents with a Web of Data. This is becoming significant as shown in the growth of LD repositories available as part of the Linked Open Data (LOD) cloud. At the instance-level, LD sources use a combination of terms from various vocabularies, expressed as RDFS/OWL, to describe data and publish it to the Web. However, LD sources do not organise data to conform to a specific structure analogous to a relational schema; instead data can adhere to multiple vocabularies. Expressing SPARQL queries over LD sources – usually over a SPARQL endpoint that is presented to the user – requires knowledge of the predicates used so as to allow queries to express user requirements as graph patterns. Although LD provides low barriers to data publication using a single language (i.e., RDF), sources organise data with different structures and terminologies. This paper describes an approach to automatically derive structural summaries over instance-level data expressed as RDF triples. The technique builds on a hierarchical clustering algorithm that organises RDF instance-level data into groups that are then utilised to infer a structural summary over a LD source. The resulting structural summaries are expressed in the form of classes, properties and, relationships. Our experimental evaluation shows good results when applied to different types of LD sources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://lod-cloud.net.
2.
See Vocabulary of Interlinked Datasets: http://www.w3.org/TR/void/.
3.
For more statistics, see http://www4.wiwiss.fu-berlin.de/lodcloud/state/.
4.
Is the opposite of a similarity.
5.
Observing the vocabularies listed by the Linked Open Vocabularies (LOV) project: http://lov.okfn.org/dataset/lov/.
6.
http://stats.lod2.eu/.

References

Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., Franconi, E., Eiter, T., Gutierrez, C., Handschuh, S., Rousset, M.-C., Schmidt, R.A. (eds.) Reasoning Web. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009)
Google Scholar
Bizer, C., Cyganiak, R.: D2r server - publishing relational databases on the semantic web. In: 5th International Semantic Web Conference, p. 26 (2006)
Google Scholar
Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semant. Web Inf. Syst. 5(3), 1–22 (2009)
Article Google Scholar
Fahad, M.: Er2owl: generating owl ontology from er diagram. In: Shi, Z., Mercier-Laurent, E., Leake, D. (eds.) Intelligent Information Processing IV. IFIP, vol. 288, pp. 28–37. Springer, Heidelberg (2008)
Chapter Google Scholar
Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Article Google Scholar
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 436–445. Morgan Kaufmann Publishers Inc. (1997)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)
Article MATH Google Scholar
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.-U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420 (2010)
Google Scholar
Heath, T., Bizer, C.: Linked Data: evolving the web into a global data space. In: Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)
Google Scholar
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with swse: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, New York (1990)
Book Google Scholar
Klyne, G., Carroll, J.J.: Resource description framework (RDF): concepts and abstract syntax. Technical report, W3C (2004)
Google Scholar
Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex - efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16, 52–58 (2012)
Article Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD, pp. 16–22 (1999)
Google Scholar
Ravi Bhushan Mishra and Sandeep Kumar: Semantic web reasoners and languages. Artif. Intell. Rev. 35(4), 339–368 (2011)
Article Google Scholar
Paton, N.W., Christodoulou, K., Fernandes, A.A.A., Parsia, B., Hedeler, C.: Pay-as-you-go data integration for linked data: opportunities, challenges and architectures. In: Proceedings of the 4th International Workshop on Semantic Web Information Management, SWIM 2012, pp. 3:1–3:8. ACM (2012)
Google Scholar
Prasser, F., Kemper, A., Kuhn, K.A.: Efficient distributed query processing for autonomous RDF databases. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT 2012, pp. 372–383. ACM (2012)
Google Scholar
Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C Recommendation 4, 1–106 (2008)
Google Scholar
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)
Chapter Google Scholar
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
Chapter Google Scholar
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., Polleres, A.: Comparing data summaries for processing live queries over linked data. World Wide Web 14(5–6), 495–544 (2011)
Article Google Scholar
Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011)
Chapter Google Scholar
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM, pp. 515–524 (2002)
Google Scholar
Zong, N., Im, D.-H., Yang, S.-K., Namgoong, H., Kim, H.-G.: Dynamic generation of concepts hierarchies for knowledge discovering in bio-medical linked data sets. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2012, pp. 12:1–12:5. ACM (2012)
Google Scholar

Download references

Acknowledgement

Klitos Christodoulou has been supported by funding from the UK Engineering and Physical Sciences Research council, whose support we are pleased to acknowledge.

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK
Klitos Christodoulou, Norman W. Paton & Alvaro A. A. Fernandes

Authors

Klitos Christodoulou
View author publications
You can also search for this author in PubMed Google Scholar
Norman W. Paton
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro A. A. Fernandes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Klitos Christodoulou .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Josef Küng
FAW, University of Linz, Linz, Austria
Roland Wagner
University of Brescia, Brescia, Italy
Devis Bianchini
University of Brescia, Brescia, Italy
Valeria De Antonellis
University of Rome III, Rome, Italy
Roberto De Virgilio

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Christodoulou, K., Paton, N.W., Fernandes, A.A.A. (2015). Structure Inference for Linked Data Sources Using Clustering. In: Hameurlain, A., Küng, J., Wagner, R., Bianchini, D., De Antonellis, V., De Virgilio, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIX. Lecture Notes in Computer Science(), vol 8990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46562-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-662-46562-2_1
Published: 24 February 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46561-5
Online ISBN: 978-3-662-46562-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics