Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining Methods

Kapusta, Jozef; Munk, Michal; Drlík, Martin

doi:10.1007/978-3-662-48145-5_7

Jozef Kapusta¹⁵,
Michal Munk¹⁵ &
Martin Drlík¹⁵

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 9240))

507 Accesses
2 Citations

Abstract

The paper describes an alternative method of website analysis and optimization that combines methods of web usage and web structure mining - discovering of web users’ behaviour patterns as well as discovering knowledge from the website structure. Its primary objective is identifying of web pages, in which the value of their importance, estimated by the website developers, does not correspond to the real behaviour of the website visitors. It was proved before that the expected visit rate correlate with the observed visit rate of the web pages. Consequently, the expected probabilities of visiting of web pages by a visitor were calculated using the PageRank method and observed probabilities were obtained from the web server log files using the web usage mining method. The observed and expected probabilities were compared using the residual analysis. While the sequence rules analysis can only uncover the potential problem of web pages with higher visit rate, the proposed method of residual analysis can also consider other web pages with a smaller visit rate. The obtained results can be successfully used for a website optimization and restructuring, improving website navigation, and adaptive website realisation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor. Newsl. 1, 12–23 (2000)
Article Google Scholar
Liu, Y., Zhang, M., Cen, R., Ru, L., Ma, S.: Data cleansing for web information retrieval using query independent features. J. Am. Soc. Inform. Sci. Technol. 58, 1884–1898 (2007)
Article Google Scholar
Chau, M., Chen, H.: A machine learning approach to web page filtering using content and structure analysis. Decis. Support Syst. 44, 482–494 (2008)
Article Google Scholar
Jacob, A., Olivier, C., Carlos, C.: WITCH: a new approach to web spam detection. Yahoo! Research report no. YR-2008-001 (2008)
Google Scholar
Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: web spam detection using the web topology. In: Conference Know Your Neighbors: Web Spam Detection Using the Web Topology, pp. 423–430. ACM (2006)
Google Scholar
Gan, Q., Suel, T.: Improving web spam classifiers using link structure. In: Conference Improving Web Spam Classifiers Using Link Structure, pp. 17–20. ACM (2007)
Google Scholar
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Conference Detecting Spam Web Pages Through Content Analysis, pp. 83–92 (2006)
Google Scholar
Stencl, M., St’astny, J.: Neural network learning algorithms comparison on numerical prediction of real data. In: Matousek, R. (ed.) 16th International Conference on Soft Computing Mendel 2010, pp. 280–285 (2010)
Google Scholar
Lorentzen, D.G.: Webometrics benefitting from web mining? an investigation of methods and applications of two research fields. Scientometrics 99, 409–445 (2014)
Article Google Scholar
Lili, Y., Yingbin, W., Zhanji, G., Yizhuo, C.: Research on PageRank and hyperlink-induced topic search in web structure mining. In: Conference Research on PageRank and Hyperlink-Induced Topic Search in Web Structure Mining, pp. 1–4 (2011)
Google Scholar
Wu, G., Wei, Y.: Arnoldi versus GMRES for computing pageRank: a theoretical contribution to google’s pageRank problem. ACM Trans. Inf. Syst. 28, 1–28 (2010)
Article Google Scholar
Jain, A., Sharma, R., Dixit, G., Tomar, V.: Page ranking algorithms in web mining, limitations of existing methods and a new method for indexing web pages. In: Proceedings of the 2013 International Conference on Communication Systems and Network Technologies, pp. 640–645. IEEE Computer Society (2013)
Google Scholar
Ahmadi-Abkenari, F., Selamat, A.: A clickstream based web page importance metric for customized search engines. In: Nguyen, N.T. (ed.) Transactions on Computational Collective Intelligence XII. LNCS, vol. 8240, pp. 21–41. Springer, Heidelberg (2013)
Chapter Google Scholar
Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating user behavior information. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19–26. ACM, Seattle (2006)
Google Scholar
Meiss, M.R., Menczer, F., Fortunato, S., Flammini, A., Vespignani, A.: Ranking web sites with real user traffic. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 65–76. ACM, Palo Alto (2008)
Google Scholar
Su, J.-H., Wang, B.-W., Tseng, V.S.: Effective ranking and recommendation on web page retrieval by integrating association mining and PageRank. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 03, pp. 455–458. IEEE Computer Society (2008)
Google Scholar
Pabarskaite, Z., Raudys, A.: A process of knowledge discovery from web log data: systematization and critical review. J. Intell. Inf. Syst. 28, 79–104 (2007)
Article Google Scholar
Shutong, C., Congfu, X., Hongwei, D.: Website structure optimization technology based on customer interest clustering algorithm. In: Conference Website Structure Optimization Technology Based on Customer Interest Clustering Algorithm, pp. 802–804 (2008)
Google Scholar
Wen-long, L., Ye-zheng, L.: A novel website structure optimization model for more effective web navigation. In: Conference A Novel Website Structure Optimization Model for More Effective Web Navigation, pp. 36–41 (2008)
Google Scholar
Jeffrey, J., Karski, P., Lohrmann, B., Kianmehr, K., Alhajj, R.: Optimizing web structures using web mining techniques. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 653–662. Springer, Heidelberg (2007)
Chapter Google Scholar
Wang, H., Liu, X.: Adaptive site design based on web mining and topology. In: Conference Adaptive Site Design Based on Web Mining and Topology, pp. 184–189 (2009)
Google Scholar
Romero, C., Ventura, S., Zafra, A., Bra, P.D.: Applying web usage mining for personalizing hyperlinks in web-based adaptive educational systems. Comput. Educ. 53, 828–840 (2009)
Article Google Scholar
Park, S., Suresh, N.C., Jeong, B.-K.: Sequence-based clustering for web usage mining: a new experimental framework and ANN-enhanced K-means algorithm. Data Knowl. Eng. 65, 512–543 (2008)
Article Google Scholar
Hay, B., Wets, G., Vanhoof, K.: Web usage mining by means of multidimensional sequence alignment methods. In: Zaïane, O.R., Srivastava, J., Spiliopoulou, M., Masand, B. (eds.) WebKDD 2003. LNCS (LNAI), vol. 2703, pp. 50–65. Springer, Heidelberg (2003)
Chapter Google Scholar
Hay, B., Wets, G., Vanhoof, K.: Segmentation of visiting patterns on web sites using a sequence alignment method. J. Retail. Consum. Serv. 10, 145–153 (2003)
Article Google Scholar
Masseglia, F., Tanasa, D., Trousse, B.: Web usage mining: sequential pattern extraction with a very low support. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 513–522. Springer, Heidelberg (2004)
Chapter Google Scholar
Oyanagi, S., Kubota, K., Nakase, A.: Mining WWW access sequence by matrix clustering. In: Zaïane, O.R., Srivastava, J., Spiliopoulou, M., Masand, B. (eds.) WebKDD 2003. LNCS (LNAI), vol. 2703, pp. 119–136. Springer, Heidelberg (2003)
Chapter Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst. 1(1), 5–32 (1999)
Article Google Scholar
Spiliopoulou, M., Faulstich, L.C.: WUM: a tool for web utilization analysis. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 184–203. Springer, Heidelberg (1999)
Chapter Google Scholar
Chen, M.-S., Park, J.S., Yu, P.S.: Data mining for path traversal patterns in a web environment. In: Conference Data Mining for Path Traversal Patterns in a Web Environment, pp. 385–392 (1996)
Google Scholar
Berendt, B., Spiliopoulou, M.: Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB J. 9, 56–75 (2000)
Article Google Scholar
Guerbas, A., Addam, O., Zaarour, O., Nagi, M., Elhajj, A., Ridley, M., Alhajj, R.: Effective web log mining and online navigational pattern prediction. Knowl.-Based Syst. 49, 50–62 (2013)
Article Google Scholar
Cooley, R.: Web usage mining: discovery and application of interesting patterns from web data. Ph.D. thesis. University of Minnesota (2000)
Google Scholar
Schmitt, E., Manning, H., Paul, Y., Tong, J.: Measuring Web Success. Forrester report (1999)
Google Scholar
Downey, D., Dumais, S., Horvitz, E.: Models of searching and browsing: languages, studies, and applications. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 2740–2747. Morgan Kaufmann Publishers Inc., Hyderabad (2007)
Google Scholar
Chien, S., Immorlica, N.: Semantic similarity between search engine queries using temporal correlation. In: Proceedings of the 14th International Conference on World Wide Web, pp. 2–11. ACM, Chiba (2005)
Google Scholar
He, D., Göker, A.: Detecting session boundaries from web user logs. In: Conference Detecting Session Boundaries from Web User Logs, pp. 57–66 (2000)
Google Scholar
Radlinski, F., Joachims, T.: Query chains: learning to rank from implicit feedback. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 239–248. ACM, Chicago (2005)
Google Scholar
Huynh, T., Miller, J.: Empirical observations on the session timeout threshold. Inf. Process. Manage. 45, 513–528 (2009)
Article Google Scholar
Zhang, J., Ghorbani, A.A.: The reconstruction of user sessions from a server log using improved time-oriented heuristics. In: Conference The reconstruction of User Sessions from a Server Log Using Improved Time-Oriented Heuristics, pp. 315–322 (2009)
Google Scholar
Seco, N., Cardoso, N.: Detecting user sessions in the Tumba! query log. Technical report., Faculdade de Ciências da Universidade de Lisboa (2006)
Google Scholar
Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. INFORMS J. Comput. 15, 171–190 (2003)
Article MATH Google Scholar
Gong, W., Baohui, T.: A new path filling method on data preprocessing in web mining. In: Conference A New Path Filling Method on Data Preprocessing in Web Mining, pp. 1033–1035 (2008)
Google Scholar
Dhawan, S., Lathwal, M.: Study of preprocessing methods in web server logs. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 430–433 (2013)
Google Scholar
Li, Y., Feng, B., Mao, Q.: Research on path completion technique in web usage mining. In: Proceedings of the 2008 International Symposium on Computer Science and Computational Technology, vol. 01, pp. 554–559. IEEE Computer Society (2008)
Google Scholar
Tauscher, L., Greenberg, S.: Revisitation patterns in World Wide Web navigation. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 399–406. ACM, Atlanta (1997)
Google Scholar
Chitraa, V., Davamani, A.S.: An Efficient path completion technique for web log mining. In IEEE International Conference on Computational Intelligence and Computing Research (2010)
Google Scholar
Zhang, C., Zhuang, L.: New path filling method on data preprocessing in web mining. Proc. Comput. Inf. Sci. 1, 112–115 (2008)
Google Scholar
Liu, B.: Web data mining. Springer, New York (2007)
MATH Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30, 107–117 (1998)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Standford Digital (1998)
Google Scholar
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 118–125. ACM, Vancouver (1996)
Google Scholar
Munk, M., Kapusta, J., Švec, P.: Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor. Procedia Comput. Sci. 1, 2273–2280 (2010)
Article Google Scholar
Kapusta, J., Munk, M.: Web usage mining: analysis of expeced and observed visit rate UKF (2014)
Google Scholar
Pilkova, A., Volna, J., Papula, J., Holienka, M.: The influence of intellectual capital on firm performance among slovak SMEs. In: Proceedings of the 10th International Conference on Intellectual Capital, Knowledge Management and Organisational Learning (Icickm-2013), pp. 329–338 (2013)
Google Scholar
Kumar, P.R., Singh, A.K., Mohan, A.: Efficient methodologies to optimize website for link structure based search engines. In: Conference Efficient Methodologies to Optimize Website for Link Structure Based Search Engines, pp. 719–724 (2013)
Google Scholar

Download references

Acknowledgements

This paper is published with the financial support of the project of Scientific Grant Agency (VEGA), project number VEGA 1/0392/13.

Author information

Authors and Affiliations

Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, Nitra, 949 74, Slovakia
Jozef Kapusta, Michal Munk & Martin Drlík

Authors

Jozef Kapusta
View author publications
You can also search for this author in PubMed Google Scholar
Michal Munk
View author publications
You can also search for this author in PubMed Google Scholar
Martin Drlík
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jozef Kapusta .

Editor information

Editors and Affiliations

Wroclaw University of Technology, Department of Information Systems, Wroclaw, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kapusta, J., Munk, M., Drlík, M. (2015). Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining Methods. In: Nguyen, N. (eds) Transactions on Computational Collective Intelligence XVIII. Lecture Notes in Computer Science(), vol 9240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48145-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-48145-5_7
Published: 31 July 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48144-8
Online ISBN: 978-3-662-48145-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics