Skip to main content

Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives

  • Conference paper
Research and Advanced Technology for Digital Libraries (TPDL 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6966))

Included in the following conference series:

Abstract

We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions: a priori and a posteriori. As a priori solution, our idea is to crawl sites during the off-peak hours (i.e. the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an a posteriori solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time.

This research is supported by the French National Research Agency ANR in the CARTEC Project (ANR-07-MDCO-016).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ben Saad, M., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: EDBT/ICDT PhD Workshops, Lausanne, Switzerland (2010)

    Google Scholar 

  2. Ben Saad, M., Gançarski, S.: Archiving the Web using Page Changes Pattern: A Case Study. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Canada (2011)

    Google Scholar 

  3. Brewington, B., Cybenko, G.: How dynamic is the web? In: WWW 2000: Proceedings of the 9th International Conference on World Wide Web, pp. 257–276 (2000)

    Google Scholar 

  4. Brokes, A., Coufal, L., Flashkova, Z., Masanès, J., Oomen, J., Pop, R., Risse, T., Smulders, H.: Requirement analysis report living web archive. Technical Report FP7-ICT-2007-1 (2008)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)

    Article  Google Scholar 

  6. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Interet Technol. 3(3), 256–290 (2003)

    Article  Google Scholar 

  7. Cho, J., Garcia-molina, H., Page, L.: Efficient crawling through url ordering. In: Computer Networks and ISDN Systems, pp. 161–172 (1998)

    Google Scholar 

  8. de Sompel, H.V., Nelson, M.L., Sanderson, R., Balakireva, L., Ainsworth, S., Shankar, H.: Memento: Time travel for the web. CoRR, abs/0911.1112 (2009)

    Google Scholar 

  9. Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. Proc. VLDB Endow. 2(1), 586–597 (2009)

    Article  Google Scholar 

  10. Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: A browser for browsing the past web. In: Proceedings of the 15th International Conference on World Wide Web, WWW 2006, New York, NY, USA, pp. 877–878 (2006)

    Google Scholar 

  11. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, New York, NY, USA, pp. 437–446 (2008)

    Google Scholar 

  12. Pehlivan, Z., Ben Saad, M., Gançarski, S.: Vi-diff: Understanding web pages changes. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6261, pp. 1–15. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  13. Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd Workshop on Information Credibility on the Web, New York, NY, USA, pp. 19–26 (2009)

    Google Scholar 

  14. Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: ”catch me if you can”: Visual analysis of coherence defects in web archiving. In: 9th International Web Archiving Workshop (IWAW 2009), Corfu, Greece, pp. 27–37 (2009)

    Google Scholar 

  15. Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST 2009: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ben Saad, M., Pehlivan, Z., Gançarski, S. (2011). Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2011. Lecture Notes in Computer Science, vol 6966. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24469-8_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24469-8_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24468-1

  • Online ISBN: 978-3-642-24469-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics