Abstract
The continuously increasing volume of linked open data (LOD) is a challenge when it comes to processing this data. Using the output of an RDF graph traversal (e.g. an LOD crawl) as a linearisation of the data can serve as a basis for a stream-based processing approach. SchemEX (Konrath et al., J. Web Semantics 2012, to appear) utilises such an approach to efficiently compute a schema-based index structure for looking up relevant data sources. In this paper we conduct a detailed analysis of the impact of the stream-based approach regarding the accuracy of the computed schema. We investigate the impact of parameter choices as well as the impact of the analysed data set under several application-motivated metrics. It can be observed that all three factors have an influence on the quality of the schema. In particular, we found that excessive use of blank nodes has a negative impact when using SchemEX to answer complex queries in the deviations. However, stream-based schema approximation is quite accurate. The deviation in the schema elements is at most 10%; the information encoded in the schema deviates by even less than 4 %.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Available from: http://km.aifb.kit.edu/projects/btc-2011/.
- 2.
Essentially, the rightmost values of the curves correspond to the metric value we displayed in the plots above of a situation of having processed the complete 20 million triples of the data segment.
References
Böhm, C., Freitag, M., Heise, A., Lehmann, C., Mascher, A., Naumann, F., Ercegovac, V., Hernandez, M., Haase, P., Schmidt, M.: Govwild: integrating open government data for transparency. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 321–324. WWW ’12 Companion. ACM, New York, NY (2012)
Böhm, C., Lorey, J., Naumann, F.: Creating void descriptions for web-scale data. Web Semant. Sci. Serv. Agents World Wide Web 9(3), 339–345 (2011)
Gallego, M., Fernández, J., Martínez-Prieto, M., de la Fuente, P.: Rdf visualization using a three-dimensional adjacency matrix. In: SemSearch’11: Proceedings of 4th International Semantic Search Workshop, 2011
Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization in semistructured databases. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25–29, 1997, Athens, Greece. pp. 436–445. Morgan Kaufmann, San Francisco (1997)
Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: Explicit and implicit schema information on the linked open data cloud: Joined forces or antagonists? Tech. Rep. 06/2012, Institut WeST, Universität Koblenz-Landau (2012)
Gottron, T., Scherp, A., Krayer, B., Peters, A.: Get the google feeling: Supporting users in finding relevant sources of linked open data at web-scale. In: Semantic Web Challenge, Submission to the Billion Triple Track, 2012
Hausenblas, M., Halb, W., Raimond, Y., Heath, T.: What is the size of the semantic web? In: Proceedings of the International Conference on Semantic Systems, 2008
Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool (2011)
Isele, R., Harth, A., Umbrich, J., Bizer, C.: LDspider: An open-source crawling framework for the web of linked data. In: Poster, International Semantic Web Conference 2010. Shanghai, China (2010)
Konrath, M., Gottron, T., Scherp, A.: Schemex – web-scale indexed schema extraction of linked open data. In: Semantic Web Challenge, Submission to the Billion Triple Track, 2011
Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX-Efficient Construction of a Data Catalogue by Stream-based Indexing of Linked Data, Web Semantics: Science, Services and Agents on the World Wide Web, 16(5), pp. 52–58, 2012. The Semantic Web Challenge. (2011)
Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: Proceedings of the 5th European Semantic Web Conference on The Semantic Web: Research and Applications, pp. 508–523, ESWC’08. Springer, Berlin, Heidelberg (2008)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 295–306, SIGMOD ’98. ACM, New York, NY (1998)
Nestorov, S., Ullman, J.D., Wiener, J.L., Chawathe, S.S.: Representative objects: Concise representations of semistructured, hierarchial data. In: Proceedings of the Thirteenth International Conference on Data Engineering, pp. 79–90, ICDE ’97. IEEE Computer Society, Washington, DC (1997)
Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 251–260, ICDE ’95. IEEE Computer Society, Washington, DC (1995)
Wang, Q.Y., Yu, J.X., Wong, K.F.: Approximate graph schema extraction for semi-structured data. In: Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology, pp. 302–316, EDBT ’00. Springer, London (2000)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Proceedings of the 2002 IEEE International Conference on Data Mining, p. 721, ICDM ’02. IEEE Computer Society, Washington, DC (2002)
Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 335–346, SIGMOD ’04. ACM, New York, NY (2004)
Acknowledgements
The research leading to these results has received partial funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257859, ROBUST and grant agreement no. 287975, SocialSensor.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this paper
Cite this paper
Gottron, T., Pickhardt, R. (2013). A Detailed Analysis of the Quality of Stream-Based Schema Construction on Linked Open Data. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_8
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6880-6_8
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6879-0
Online ISBN: 978-1-4614-6880-6
eBook Packages: Computer ScienceComputer Science (R0)