Web-Site Boundary Detection Using Incremental RandomWalk Clustering

Alshukri, Ayesh; Coenen, Frans; Zito, Michele

doi:10.1007/978-1-4471-2318-7_20

Web-Site Boundary Detection Using Incremental RandomWalk Clustering

Ayesh Alshukri⁴,
Frans Coenen⁴ &
Michele Zito⁴

Conference paper
First Online: 14 October 2011

619 Accesses
1 Citations

Abstract

In this paper we describe a random walk clustering technique to addressthe Website Boundary Detection (WBD) problem. The technique is fully described and compared with alternative (breadth and depth first) approaches. The reported evaluation demonstrates that the random walk technique produces comparable or better results than those produced by these alternative techniques, while at the same time visiting fewer ‘noise’ pages. To demonstrate that the good results are not simply a consequence of a randomisation of the input data we also compare with a random ordering technique.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, G. Cobena, J. Masanés, and G. Sedrati. A First Experience in Archiving the French Web. In ECDL ’02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, volume 2458 of Lecture Notes in Computer Science, pages 1–15. Springer, 2002.
Google Scholar
R. Albert, H. Jeong, and A-l. Barabási. Diameter of the world wide web. Computing Research Repository, 1999.
Google Scholar
D. Aldous and J. Fill. Reversible Markov chains and random walks on graphs. Monograph in preparation, 2002.
Google Scholar
R. Aleliunas, R.M. Karp, R.J. Lipton, L. Lovasz, and C. Rackoff. Random walks, universal traversal sequences, and the complexity of maze problems. 20th Annual Symp. on Foundations of Computer Science, pages 218–223, 1979.
Google Scholar
A Alshukri, F. Coenen, and M. Zito. Web-Site Boundary Detection. In Proceedings of the 10th Industrial Conference on Data Mining, pages 529–543, Berlin, Germany, 2010. Springer.
Google Scholar
Albert-Laszloand Barabasi and Reka Albert. Emergence of scaling in random networks. Science, 1999.
Google Scholar
K. Bharat, B-W. Chang, M. Henzinger, and M. Ruhl. Who links to whom: mining linkage between Web sites. In Proceedings 2001 IEEE International Conference on Data Mining, pages 51–58, Washington, DC, USA, 2001. IEEE Computer Society.
Google Scholar
A. Z Broder. Graph structure in the Web. Computer Networks, 33(1-6):309–320, June 2000.
Article Google Scholar
A. Z Broder, M Najork, and J. LWiener. Efficient URL caching for world wide web crawling. In WWW’03 Proceedings of the 12th international conference on World Wide Web, pages 679–689, Budapest, Hungary., 2003. ACM.
Google Scholar
P. Dmitriev. As we may perceive: finding the boundaries of compound documents on the web. In WWW’08 Proceeding of the 17th international conference on World Wide Web, pages 1029–1030, Beijing, China, 2008. ACM.
Google Scholar
M. H. Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall PTR Upper Saddle River, NJ, USA, 2002.
Google Scholar
W. Feller. Introduction to probability theory and its applications. WSS, vol. 1, 1968.
Google Scholar
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.
Google Scholar
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284–291. ACM, 2006.
Google Scholar
R. Kumar. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11-16):1481–1493, May 1999.
Article Google Scholar
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 57–65, Washington, DC, USA, 2000. IEEE Computer Society.
Google Scholar
B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, Springer-Verlag New York, Inc., 2007.
MATH Google Scholar
L. Lovász. Random walks on graphs: A survey. YaleU/DCS/TR-1029, 2:1–46, 1994.
Google Scholar
J.M Peña, J.A Lozano, and P Larrañaga. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters, 20(10):1027–1040, October 1999.
Article Google Scholar
J. Pokorn. Web Searching and Information Retrieval. Computing in Science and Engineering, 6(4):43–48, 2004.
Article Google Scholar
P. Senellart. Identifying Websites with Flow Simulation. In David Lowe and Martin Gaedke, editors, ICWE, volume 3579 of Lecture Notes in Computer Science, Orsay, France., 2005. Gemo, INRIA Futurs., Springer.
Google Scholar
P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson International Edition, 2006.
Google Scholar
B. Meck Thiesson, C. Chickering, and D. Heckerman. Learning mixtures of Bayesian networks. Technical report, Microsoft Research Technical Report TR-97-30, Redmond, WA, 1997.
Google Scholar
I. H. Witten and E. Frank. Data Mining: practical machine learning tools and techniques. Morgan Kaufman, 2005.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Liverpool, Ashton Building, Ashton Street, Liverpool, L69 3BX, UK
Ayesh Alshukri, Frans Coenen & Michele Zito

Authors

Ayesh Alshukri
View author publications
You can also search for this author in PubMed Google Scholar
Frans Coenen
View author publications
You can also search for this author in PubMed Google Scholar
Michele Zito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayesh Alshukri .

Editor information

Editors and Affiliations

University of Portsmouth, Lion Terrace, Portsmouth, PO1 3HE, United Kingdom
Max Bramer
School of Computing &, Mathematical Sciences, University of Greenwich, Park Row 30, London, SE10 9LS, United Kingdom
Miltos Petridis
, School of Computing and Informatics, Nottingham Trent University, Burton Street, Nottingham, NG1 4BU, United Kingdom
Lars Nolle

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alshukri, A., Coenen, F., Zito, M. (2011). Web-Site Boundary Detection Using Incremental RandomWalk Clustering. In: Bramer, M., Petridis, M., Nolle, L. (eds) Research and Development in Intelligent Systems XXVIII. SGAI 2011. Springer, London. https://doi.org/10.1007/978-1-4471-2318-7_20

Download citation

DOI: https://doi.org/10.1007/978-1-4471-2318-7_20
Published: 14 October 2011
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2317-0
Online ISBN: 978-1-4471-2318-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics