Skip to main content

LiveRank: How to Refresh Old Crawls

  • Conference paper
  • First Online:
Algorithms and Models for the Web Graph (WAW 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8882))

Included in the following conference series:

  • 753 Accesses

Abstract

This paper considers the problem of refreshing a crawl. More precisely, given a collection of Web pages (with hyperlinks) gathered at some time, we want to identify a significant fraction of these pages that still exist at present time. Liveness of an old page can be tested through an online query at present time. We call LiveRank a ranking of the old pages that tries to give good rankings to active nodes. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order. We study different scenarios from a static setting where the LiveRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks for Web graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW 2003, pp. 280–290. ACM (2003)

    Google Scholar 

  2. Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: Towards an understanding of the web’s decay. In: WWW 2004, pp. 328–337 (2004)

    Google Scholar 

  3. Bianchini, M., Gori, M., Scarselli, F.: Inside pagerank. ACM Trans. Internet Technol. 5(1), 92–128 (2005)

    Article  Google Scholar 

  4. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)

    Google Scholar 

  5. Boldi, P., Santini, M., Vigna, S.: A large time-aware graph. SIGIR Forum 42(2), 33–38 (2008)

    Article  Google Scholar 

  6. Cho, J., Ntoulas, A.: Effective change detection using sampling. In: VLDB 2002, pp. 514–525 (2002)

    Google Scholar 

  7. Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., Tomkins, A.: The discoverability of the web. In: WWW 2007, pp. 421–430. ACM (2007)

    Google Scholar 

  8. Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the web frontier. In: WWW 2004, pp. 309–318. ACM (2004)

    Google Scholar 

  9. Haveliwala, T., Kamvar, A., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation, Technical report (2003)

    Google Scholar 

  10. Kamvar, S.D., Haveliwala, T.H., Manning, C.D., Golub, G.H.: Extrapolation methods for accelerating pagerank computations. In: WWW 2003, pp. 261–270. ACM (2003)

    Google Scholar 

  11. Langville, A.N., Meyer, C.D.: Deeper inside pagerank. Internet Mathematics 1 (2004)

    Google Scholar 

  12. Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)

    Article  MATH  Google Scholar 

  13. Page, L., Brin, S., Motwani, R., Winograd, T.: In: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)

    Google Scholar 

  14. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web., number 1999–66. Stanford InfoLab (1999)

    Google Scholar 

  15. Tan, Q., Zhuang, Z., Mitra, P., Giles, C.L.: A clustering-based sampling approach for refreshing search engine’s database. In: WebDB 2007 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabien Mathieu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Huynh, T.D., Mathieu, F., Viennot, L. (2014). LiveRank: How to Refresh Old Crawls. In: Bonato, A., Graham, F., Prałat, P. (eds) Algorithms and Models for the Web Graph. WAW 2014. Lecture Notes in Computer Science(), vol 8882. Springer, Cham. https://doi.org/10.1007/978-3-319-13123-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13123-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13122-1

  • Online ISBN: 978-3-319-13123-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics