Skip to main content

Vi-DIFF: Understanding Web Pages Changes

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6261))

Abstract

Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future generation. A major problem encountered by archiving systems is to understand what happened between two versions of web pages. In this paper, we address this requirement by proposing a new change detection approach that computes the semantic differences between two versions of HTML web pages. Our approach, called Vi-DIFF, detects changes on the visual representation of web pages. It detects two types of changes: content and structural changes. Content changes include modifications on text, hyperlinks and images. In contrast, structural changes alter the visual appearance of the page and the structure of its blocks. Our Vi-DIFF solution can serve for various applications such as crawl optimization, archive maintenance, web changes browsing, etc. Experiments on Vi-DIFF were conducted and the results are promising.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The Web archive bibliography, http://www.ifs.tuwien.ac.at/~aola/links/WebArchiving.html

  2. Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, p. 1. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  3. Ben-Saad, M., Gançarski, S., Pehlivan, Z.: A Novel Web Archiving Approach based on Visual Pages Analysis. In: 9th International Web Archiving Workshop (IWAW’09), Corfu, Greece (2009)

    Google Scholar 

  4. Blakeman, K.: Tracking changes to web page content, http://www.rba.co.uk/sources/monitor.htm

  5. Lampos, D.J.C., Eirinaki, M., Vazirgiannis, M.: Archiving the greek web. In: 4th International Web Archiving Workshop (IWAW’04), Bath, UK (2004)

    Google Scholar 

  6. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research (2003)

    Google Scholar 

  7. Cathro, W.: Development of a digital services architecture at the national library of Australia. EduCause (2003)

    Google Scholar 

  8. Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: ICDE ’02: Proceedings of 18th International Conference on Data Engineering (2002)

    Google Scholar 

  9. Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a web page. In: The Annals of the University of Craiova (2004)

    Google Scholar 

  10. Evi, M.K., Diligenti, M., Gori, M., Maggini, M., Milutinovi, V.: Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In: The Proceedings of 2002 IEEE International Conference on Data Mining ICDM’02 (2002)

    Google Scholar 

  11. Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing (2006)

    Google Scholar 

  12. Gu, X.-D., Chen, J., Ma, W.-Y., Chen, G.-L.: Visual Based Content Understanding towards Web Adaptation. In: De Bra, P., Brusilovsky, P., Conejo, R. (eds.) AH 2002. LNCS, vol. 2347, p. 164. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: A browser for browsing the past web. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 877–878. ACM, New York (2006)

    Chapter  Google Scholar 

  14. Kukulenz, D., Reinke, C., Hoeller, N.: Web contents tracking by learning of page grammars. In: ICIW ’08: Proceedings of the 2008 Third International Conference on Internet and Web Applications and Services, Washington, DC, USA, pp. 416–425. IEEE Computer Society, Los Alamitos (2008)

    Chapter  Google Scholar 

  15. La-Fontaine, R.: A Delta Format for XML: Identifying Changes in XML Files and Representing the Changes in XML. In: XML Europe (2001)

    Google Scholar 

  16. Leonardi, E., Hoai, T.T., Bhowmick, S.S., Madria, S.: DTD-Diff: A change detection algorithm for DTDs. Data Knowl. Eng. 61(2) (2007)

    Google Scholar 

  17. Lindholm, T., Kangasharju, J., Tarkoma, S.: Fast and simple XML tree differencing by sequence alignment. In: DocEng ’06: Proceedings of the 2006 ACM Symposium on Document Engineering (2006)

    Google Scholar 

  18. Liu, L., Pu, C., Tang, W.: Webcq - detecting and delivering information changes on the web. In: Proc. Int. Conf. on Information and Knowledge Management (CIKM), pp. 512–519. ACM Press, New York (2000)

    Chapter  Google Scholar 

  19. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance models for web pages. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web (2004)

    Google Scholar 

  20. Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: UIST ’09: Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology, pp. 237–246. ACM, New York (2009)

    Chapter  Google Scholar 

  21. Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: an effective change detection algorithm for XML documents. In: ICDE ’03: Proceedings of 19th International Conference on Data Engineering (March 2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pehlivan, Z., Ben-Saad, M., Gançarski, S. (2010). Vi-DIFF: Understanding Web Pages Changes. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds) Database and Expert Systems Applications. DEXA 2010. Lecture Notes in Computer Science, vol 6261. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15364-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15364-8_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15363-1

  • Online ISBN: 978-3-642-15364-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics