Skip to main content

Incremental Web-Site Boundary Detection Using Random Walks

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6871))

Abstract

The paper describes variations of the classical k-means clustering algorithm that can be used effectively to address the so called Web-site Boundary Detection (WBD) problem. The suggested advantages offered by these techniques are that they can quickly identify most of the pages belonging to a web-site; and, in the long run, return a solution of comparable (if not better) accuracy than other clustering methods. We analyze our techniques on artificial clones of the web generated using a well-known preferential attachment method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G.: A first experience in archiving the french web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 1–15. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  2. Aldous, D., Fill, J.: Reversible markov chains and random walks on graphs. Monograph in preparation (2002)

    Google Scholar 

  3. Aleliunas, R., Karp, R.M., Lipton, R.J., Lovasz, L., Rackoff, C.: Random walks, universal traversal sequences, and the complexity of maze problems. In: Proceedings of the 20th Annual Symposium on Foundations of Computer Science, pp. 218–223. IEEE Computer Society, Washington, DC, USA (1979)

    Google Scholar 

  4. Alshukri, A., Coenen, F., Zito, M.: Web-Site Boundary Detection. In: Perner, P. (ed.) ICDM 2010. LNCS, vol. 6171, pp. 529–543. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  5. Bharat, K., Chang, B.-W., Henzinger, M.R., Ruhl, M.: Who links to whom: Mining linkage between web sites. In: Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM 2001, pp. 51–58. IEEE Computer Society, Washington, DC, USA (2001)

    Chapter  Google Scholar 

  6. Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomikns, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6), 309–320 (2000)

    Article  Google Scholar 

  7. Broder, A.Z., Najork, M., Wiener, J.L.: Efficient url caching for world wide web crawling. In: Proceedings of the 12th International Conference on World Wide Web, pp. 679–689. ACM, New York (2003)

    Google Scholar 

  8. Dmitriev, P.: As we perceive: finding the boundaries of compound documents on the web. In: Proceeding of the 17th International Conference on World Wide Web, WWW 2008, pp. 1029–1030. ACM, New York (2008)

    Chapter  Google Scholar 

  9. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall PTR, Upper Saddle River (2002)

    Google Scholar 

  10. Feller, W.: Introduction to probability theory and its applications, vol. 1. WSS (1968)

    Google Scholar 

  11. Gomes, D., Silva, M.J.: Modelling Information Persistence on the Web. In: 6th International Conference on Web Engineering, pp. 193–200. ACM Press, New York (2006)

    Google Scholar 

  12. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)

    MATH  Google Scholar 

  13. Koehler, W.: Web page change and persistence A four-year longitudinal study. Journal of the American Society for Information, 162–171 (2002)

    Google Scholar 

  14. Kroeger, T.M., Long, D.D.E., Mogul, J.C.: Exploring the Bounds of Web Latency Reduction from Caching and Prefetching. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems Monterey, p. 2 (December 1997)

    Google Scholar 

  15. Kumar, R.: Trawling the Web for emerging cyber-communities. Computer Networks 31, 1481–1493 (1999)

    Article  Google Scholar 

  16. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the Web graph. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 57–65. IEEE Computer Society, Washington, DC, USA (2000)

    Chapter  Google Scholar 

  17. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  18. Lovász, L.: Random walks on graphs: A survey. Combinatorics Paul Erdos is Eighty 2, 1–46 (1994)

    Google Scholar 

  19. Padmanabhan, V.N., Mogul, J.C.: Using predictive prefetching to improve World Wide Web latency. ACM SIGCOMM Computer Communication Review 26 (July 1996)

    Google Scholar 

  20. Pokorn, J.: Web Searching and Information Retrieval. Computing in Science and Engineering 6(4), 43–48 (2004)

    Article  Google Scholar 

  21. Rodrigues, E.M., Milic-Frayling, N., Fortuna, B.: Detection of Web Subsites: Concepts, Algorithms, and Evaluation Issues. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 66–73. IEEE Computer Society, Los Alamitos (2007)

    Chapter  Google Scholar 

  22. Schneider, M.S., Kirsten, F., Michele, K., Gina, J.: Building thematic web collections: challenges and experiences from the september 11 web archive and the election 2002 web archive. In: Digital Libraries, ECDL, pp. 77–94 (2003)

    Google Scholar 

  23. Senellart, P.: Identifying Websites with Flow Simulation. In: Lowe, D.G., Gaedke, M. (eds.) ICWE 2005. LNCS, vol. 3579, pp. 124–129. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  24. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson International Edition (2006)

    Google Scholar 

  25. Witten, I.H., Frank, E.: Data Mining: practical machine learning tools and techniques. Morgan Kaufman, San Francisco (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Alshukri, A., Coenen, F., Zito, M. (2011). Incremental Web-Site Boundary Detection Using Random Walks. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2011. Lecture Notes in Computer Science(), vol 6871. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23199-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23199-5_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23198-8

  • Online ISBN: 978-3-642-23199-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics