Abstract
While the Web of (entity-centric) data has seen tremendous growth over the past years, take-up and re-use is still limited. Data vary heavily with respect to their scale, quality, coverage or dynamics, what poses challenges for tasks such as entity retrieval or search. This chapter provides an overview of approaches to deal with the increasing heterogeneity of Web data. On the one hand, recommendation, linking, profiling and retrieval can provide efficient means to enable discovery and search of entity-centric data, specifically when dealing with traditional knowledge graphs and linked data. On the other hand, embedded markup such as Microdata and RDFa has emerged a novel, Web-scale source of entity-centric knowledge. While markup has seen increasing adoption over the last few years, driven by initiatives such as schema.org, it constitutes an increasingly important source of entity-centric data on the Web, being in the same order of magnitude as the Web itself with regards to dynamics and scale. To this end, markup data lends itself as a data source for aiding tasks such as knowledge base augmentation, where data fusion techniques are required to address the inherent characteristics of markup data, such as its redundancy, heterogeneity and lack of links. Future directions are concerned with the exploitation of the complementary nature of markup data and traditional knowledge graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
RDFa W3C recommendation: http://www.w3.org/TR/xhtml-rdfa-primer/.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
RDFa W3C recommendation: http://www.w3.org/TR/xhtml-rdfa-primer/.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
References
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52
Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)
Blanco, R., Cambazoglu, B.B., Mika, P., Torzec, N.: Entity recommendations in web search. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 33–48. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41338-4_3
Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25073-6_6
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD 2008, pp. 1247–1250. ACM, New York (2008)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41338-4_18
DAquin, M., Adamou, A., Dietze, S.: Assessing the educational linked data landscape. In: ACM Web Science 2013 (WebSci 2013), Paris, France. ACM (2013)
Demartini, G., Missen, M.M.S., Blanco, R., Zaragoza, H.: Entity summarization of news articles. In: Proceedings of the 33rd ACM SIGIR, pp. 795–796 (2010)
Dietze, S., Taibi, D., dAquin, M.: Facilitating scientometrics in learning analytics and educational data mining - the LAK dataset. Semantic Web J. 8(3), 395–403 (2017)
Dietze, S., Taibi, D., Yu, H.Q., Dovrolis, N.: A linked dataset of medical educational resources. Br. J. Educ. Technol. BJET 46(5), 1123–1129 (2015)
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.: Beyond established knowledge graphs-recommending web datasets for data linking. In: Bozzon, A., Cudre-Maroux, P., Pautasso, C. (eds.) ICWE 2016. LNCS, vol. 9671, pp. 262–279. Springer, Heidelberg (2016). doi:10.1007/978-3-319-38791-8_15
Ben Ellefi, M., Bellahsene, Z., Dietze, S., Todorov, K.: Dataset recommendation for data linking: an intensional approach. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 36–51. Springer, Heidelberg (2016). doi:10.1007/978-3-319-34129-3_3
Fetahu, B., Dietze, S., Pereira Nunes, B., Antonio Casanova, M., Taibi, D., Nejdl, W.: A scalable approach for efficiently generating structured dataset topic profiles. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 519–534. Springer, Heidelberg (2014). doi:10.1007/978-3-319-07443-6_35
Fetahu, B., Gadiraju, U., Dietze, S.: Improving entity retrieval on structured data. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 474–491. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25007-6_28
Guéret, C., Groth, P., Stadler, C., Lehmann, J.: Assessing linked data mappings using network measures. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 87–102. Springer, Heidelberg (2012). doi:10.1007/978-3-642-30284-8_13
Harth, A.: Billion Triples Challenge data set. http://km.aifb.kit.edu/projects/btc-2012/ (2012)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova, M.A., Dietze, S.: Identifying candidate datasets for data interlinking. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 354–366. Springer, Heidelberg (2013). doi:10.1007/978-3-642-39200-9_29
Rabello Lopes, G., Paes Leme, L.A.P., Pereira Nunes, B., Casanova, M.A., Dietze, S.: Two approaches to the dataset interlinking recommendation problem. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8786, pp. 324–339. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11749-2_25
Meusel, R., Paulheim, H.: Heuristics for fixing common errors in deployed schema.org microdata. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 152–168. Springer, Heidelberg (2015). doi:10.1007/978-3-319-18818-8_10
Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_18
Pereira Nunes, B., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W.: Combining a co-occurrence-based and a semantic measure for entity linking. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 548–562. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38288-8_37
Oulabi, Y., Meusel, R., Bizer, C.: Fusing time-dependent web table data. In: Proceedings of the 19th International Workshop on Web and Databases, p. 3. ACM (2016)
Pelleg, D., Moore, A.W. et al.: X-means: extending k-means with efficient estimation of the number of clusters. In: ICML, pp. 727–734 (2000)
Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Proceedings of the 19th WWW, pp. 771–780 (2010)
Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova, M.A., Dietze, S.: Recommending tripleset interlinking through a social network approach. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8180, pp. 149–161. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41230-1_13
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.: Analysing structured scholarly data embedded in web pages. April 2016
Sahoo, P., Gadiraju, U., Yu, R., Saha, S., Dietze, S.: Analysing structured scholarly data embedded in web pages. In: Proceedings of the 25th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee (2016)
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_16
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Williamson, C.L., Zurko, M.E., Patel-Schneider, P.F., Shenoy, P.J. (eds) WWW, pp. 697–706. ACM, New York (2007)
Taibi, D., Chawla, S., Dietze, S., Marenzi, I., Fetahu, B.: Exploring ted talks as linked data for education. Brit. J. Educational Tech. 46(5), 1092–1096 (2015)
Taibi, D., Dietze, S.: Towards embedded markup of learning resources on the web: An initial quantitative analysis of LRMI terms usage. In: Bourdeau, J., Hendler, J., Nkambou, R., Horrocks, I., Zhao, B.Y. (eds.) WWW (Companion Volume), pp. 513–517. ACM, New York (2016)
Taibi, D., Dietze, S., Fetahu, B., Fulantelli, G.: Exploring type-specific topic profiles of datasets: a demo for educational linked data. In: Horridge, M., Rospocher, M., van Ossenbruggen, J. (eds.) International Semantic Web Conference - Posters and Demos, vol. 1272. CEUR Workshop Proceedings, pp. 353–356. CEUR-WS.org (2014)
Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for Ad-hoc object retrieval. In: Proceedings of the 35th ACM SIGIR, pp. 125–134 (2012)
White, S., Smyth, P.: Algorithms for estimating relative importance in networks. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 266–275 (2003)
Yu, R., Fetahu, B., Gadiraju, U., Dietze, S.: A survey on challenges in web markup data for entity retrieval. In: 15th International Semantic Web Conference (ISWC 2016) (2016)
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., Dietze, S.: Towards entity summarisation on structured web markup. In: Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer, S., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9989, pp. 69–73. Springer, Heidelberg (2016). doi:10.1007/978-3-319-47602-5_15
Yuan, W., Demidova, E., Dietze, S., Zhou, X.: Analyzing relative incompleteness of movie descriptions in the web of data: a case study. In: Horridge, M., Rospocher, M., van Ossenbruggen, J. (eds.) International Semantic Web Conference - Posters and Demos, vol. 1272. CEUR Workshop Proceedings, pp. 197–200. CEUR-WS.org (2014)
Acknowledgements
While all discussed works are joint research with numerous colleagues, friends and collaborators from a number of research institutions, the author would like to thank all involved researchers for the inspiring and productive work throughout the previous years. In addition, the author expresses his gratitude to all funding bodies that enabled the presented research through a variety of funding programs.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dietze, S. (2017). Retrieval, Crawling and Fusion of Entity-centric Data on the Web. In: Calì, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2016. Lecture Notes in Computer Science(), vol 10151. Springer, Cham. https://doi.org/10.1007/978-3-319-53640-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-53640-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53639-2
Online ISBN: 978-3-319-53640-8
eBook Packages: Computer ScienceComputer Science (R0)