Abstract
This paper considers the problem of refreshing a crawl. More precisely, given a collection of Web pages (with hyperlinks) gathered at some time, we want to identify a significant fraction of these pages that still exist at present time. Liveness of an old page can be tested through an online query at present time. We call LiveRank a ranking of the old pages that tries to give good rankings to active nodes. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order. We study different scenarios from a static setting where the LiveRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks for Web graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW 2003, pp. 280–290. ACM (2003)
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: Towards an understanding of the web’s decay. In: WWW 2004, pp. 328–337 (2004)
Bianchini, M., Gori, M., Scarselli, F.: Inside pagerank. ACM Trans. Internet Technol. 5(1), 92–128 (2005)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)
Boldi, P., Santini, M., Vigna, S.: A large time-aware graph. SIGIR Forum 42(2), 33–38 (2008)
Cho, J., Ntoulas, A.: Effective change detection using sampling. In: VLDB 2002, pp. 514–525 (2002)
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., Tomkins, A.: The discoverability of the web. In: WWW 2007, pp. 421–430. ACM (2007)
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: WWW 2004, pp. 309–318. ACM (2004)
Haveliwala, T., Kamvar, A., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation, Technical report (2003)
Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: WWW 2003, pp. 261–270. ACM (2003)
Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Mathematics 1 (2004)
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Page, L., Brin, S., Motwani, R., Winograd, T.: In: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)
Tan, Q., Zhuang, Z., Mitra, P., Giles, C.L.: A clustering-based sampling approach for refreshing search engine’s database. In: WebDB 2007 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Huynh, T.D., Mathieu, F., Viennot, L. (2014). LiveRank: How to Refresh Old Crawls. In: Bonato, A., Graham, F., Prałat, P. (eds) Algorithms and Models for the Web Graph. WAW 2014. Lecture Notes in Computer Science(), vol 8882. Springer, Cham. https://doi.org/10.1007/978-3-319-13123-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-13123-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13122-1
Online ISBN: 978-3-319-13123-8
eBook Packages: Computer ScienceComputer Science (R0)