Abstract
As the Web of Linked Open Data is growing the problem of crawling that cloud becomes increasingly important. Unlike normal Web crawlers, a Linked Data crawler performs a selection to focus on collecting linked RDF (including RDFa) data on the Web. From the perspectives of throughput and coverage, given a newly discovered and targeted URI, the key issue of Linked Data crawlers is to decide whether this URI is likely to dereference into an RDF data source and therefore it is worth downloading the representation it points to. Current solutions adopt heuristic rules to filter irrelevant URIs. Unfortunately, when the heuristics are too restrictive this hampers the coverage of crawling. In this paper, we propose and compare approaches to learn strategies for crawling Linked Data on the Web by predicting whether a newly discovered URI will lead to an RDF data source or not. We detail the features used in predicting the relevance and the methods we evaluated including a promising adaptation of FTRL-proximal online learning algorithm. We compare several options through extensive experiments including existing crawlers as baseline methods to evaluate their efficacy.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
RFC 3986, section 3(2005).
- 2.
- 3.
Bloom filter may report false positive results (but not false negatives) with a low chance. Thus it is possible that a URI has a wrong content type feature.
- 4.
- 5.
We only consider hard URIs.
References
Berners-Lee, T.: Linked data - design issues (2006). https://www.w3.org/DesignIssues/LinkedData.html
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Dodds, L.: Slug: A Semantic Web Crawler (2006)
Duchi, J.C., Singer, Y.: Efficient learning using forward-backward splitting. In: NIPS, pp. 495–503 (2009)
Ermilov, I., Lehmann, J., Martin, M., Auer, S.: LODStats: the data web census dataset. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 38–46. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46547-0_5
Färber, M., Bartscherer, F., Menne, C., Rettinger, A.: Linked data quality of dbpedia, freebase, opencyc, wikidata, and YAGO. Seman. Web 9(1), 77–129 (2018)
Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space, vol. 1. Morgan & Claypool Publishers, San Rafael (2011)
Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic web. In: LDOW (2010)
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track (2010)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
McMahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and L1 regularization. In: AISTATS, pp. 525–533 (2011)
McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: SIGKDD, pp. 1222–1230 (2013)
Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: CIKM, pp. 1039–1048 (2014)
Umbrich, J., Harth, A., Hogan, A., Decker, S.: Four heuristics to guide structured content crawling. In: ICWE, pp. 196–202 (2008)
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.J., Attenberg, J.: Feature hashing for large scale multitask learning. In: ICML, pp. 1113–1120 (2009)
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: NIPS, pp. 2116–2124 (2009)
Acknowledgement
This work is supported by the ANSWER project PIA FSN2 \(\text {N}^\circ \)P159564-2661789/DOS0060094 between Inria and Qwant.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, H., Gandon, F. (2019). Learning URI Selection Criteria to Improve the Crawling of Linked Open Data. In: Hitzler, P., et al. The Semantic Web. ESWC 2019. Lecture Notes in Computer Science(), vol 11503. Springer, Cham. https://doi.org/10.1007/978-3-030-21348-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-21348-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21347-3
Online ISBN: 978-3-030-21348-0
eBook Packages: Computer ScienceComputer Science (R0)