LiveRank: How to Refresh Old Crawls

Huynh, The Dang; Mathieu, Fabien; Viennot, Laurent

doi:10.1007/978-3-319-13123-8_12

The Dang Huynh^16,17,
Fabien Mathieu¹⁶ &
Laurent Viennot¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8882))

Included in the following conference series:

International Workshop on Algorithms and Models for the Web-Graph

753 Accesses

Abstract

This paper considers the problem of refreshing a crawl. More precisely, given a collection of Web pages (with hyperlinks) gathered at some time, we want to identify a significant fraction of these pages that still exist at present time. Liveness of an old page can be tested through an online query at present time. We call LiveRank a ranking of the old pages that tries to give good rankings to active nodes. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order. We study different scenarios from a static setting where the LiveRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks for Web graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW 2003, pp. 280–290. ACM (2003)
Google Scholar
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: Towards an understanding of the web’s decay. In: WWW 2004, pp. 328–337 (2004)
Google Scholar
Bianchini, M., Gori, M., Scarselli, F.: Inside pagerank. ACM Trans. Internet Technol. 5(1), 92–128 (2005)
Article Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)
Google Scholar
Boldi, P., Santini, M., Vigna, S.: A large time-aware graph. SIGIR Forum 42(2), 33–38 (2008)
Article Google Scholar
Cho, J., Ntoulas, A.: Effective change detection using sampling. In: VLDB 2002, pp. 514–525 (2002)
Google Scholar
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., Tomkins, A.: The discoverability of the web. In: WWW 2007, pp. 421–430. ACM (2007)
Google Scholar
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: WWW 2004, pp. 309–318. ACM (2004)
Google Scholar
Haveliwala, T., Kamvar, A., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation, Technical report (2003)
Google Scholar
Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: WWW 2003, pp. 261–270. ACM (2003)
Google Scholar
Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Mathematics 1 (2004)
Google Scholar
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Article MATH Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: In: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)
Google Scholar
Tan, Q., Zhuang, Z., Mitra, P., Giles, C.L.: A clustering-based sampling approach for refreshing search engine’s database. In: WebDB 2007 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Alcatel-Lucent Bell Labs, Paris, France
The Dang Huynh & Fabien Mathieu
Inria – Univ. Paris Diderot, Paris, France
The Dang Huynh & Laurent Viennot

Authors

The Dang Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Fabien Mathieu
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Viennot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabien Mathieu .

Editor information

Editors and Affiliations

Ryerson University, Toronto, Ontario, Canada
Anthony Bonato
University of California San Diego, La Jolla, California, USA
Fan Chung Graham
Ryerson University, Toronto, Ontario, Canada
Paweł Prałat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huynh, T.D., Mathieu, F., Viennot, L. (2014). LiveRank: How to Refresh Old Crawls. In: Bonato, A., Graham, F., Prałat, P. (eds) Algorithms and Models for the Web Graph. WAW 2014. Lecture Notes in Computer Science(), vol 8882. Springer, Cham. https://doi.org/10.1007/978-3-319-13123-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-13123-8_12
Published: 13 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13122-1
Online ISBN: 978-3-319-13123-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics