Exploiting link structure for web page genre identification

Zhu, Jia; Xie, Qing; Yu, Shoou-I; Wong, Wai Hung

doi:10.1007/s10618-015-0428-8

Exploiting link structure for web page genre identification

Published: 07 July 2015

Volume 30, pages 550–575, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jia Zhu¹,
Qing Xie²,
Shoou-I Yu³ &
…
Wai Hung Wong⁴

761 Accesses
18 Citations
Explore all metrics

Abstract

As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

Bekir Parlak

Link prediction in social networks using hyper-motif representation on hypergraph

Article 12 April 2024

ChunYan Meng & Hooman Motevalli

An effective keyword search co-occurrence multi-layer graph mining approach

Article 02 April 2024

Janet Oluwasola Bolorunduro, Zhaonian Zou & Mohamed Jaward Bah

Notes

References

Abramson M, Aha DW (2012) What’s in a URL? genre classification from URLs. In: Workshops at the 26th Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1–8
Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43
Article Google Scholar
Bernhard S, Burges JC, Smola AJ (1998) Advances in kernel methods: support vector learning. The MIT Press, Cambridge
MATH Google Scholar
Bjroneborn L (2011) Genre connectivity and genre drift in a web of genres. In: Genres on the Web: Computational Models and Empirical Studies, pp. 255–274
Boese E, Howe A (2005) Effects of web document evolution on genre classification. In: Proceedings of the ACM 14th Conference on Information and Knowledge Management, pp. 632–639
Chen G, Choi B (2008) Web page genre classification. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 2353–2357
Dong L, Watters C, Duffy J, Shepherd M (2008) An examination of genre attributes for web page classification. In: Proceedings of the 41th Annual Hawaii International Conference on System Sciences, pp. 129–138
Finn A, Kushmerick N (2006) Learning to classify documents according to genre. J Am Soc Inf Sci Technol 57(11):257–262
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Jebari C (2009) A new centroid-based approach for genre categorization of web pages. J Lang Technol Comput Linguist 24(1):73–96
Google Scholar
Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543
Kanaris I, Stamatatos E (2007) Web page genre identification using variable-length character n-grams. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol 7(1), pp. 3–10
Kennedy A, Shepherd M (2005) Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pp. 99–108
Kessler B, Nunberg G, Shutze H (1997) Automatic detection of text genre. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 32–38
Kim Y, Ross S (2011) Formulating representative features with respect to genre classification. Genres Web Comput Model Empir Stud 42:129–147
Google Scholar
Kleinberg JM (1999) Hubs, authorities, and communities. ACM Comput Surv 31(4es):5
Article Google Scholar
Kleinbery JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Article MathSciNet MATH Google Scholar
Kumari KP, Reddy A (2012) Performance improvement of web page genre classification. Int J Comput Appl 53(10):24–27
Google Scholar
Kuncheva LI, Bezdek JC, Duin RP (2001) Decision templates for multiple classifier fusion. Pattern Recognit 34(2):299–314
Article MATH Google Scholar
Laender AHF, Goncalves MA, Cota RG, Ferreira AA, Santos RLT, Silva AJC (2008) Keeping a digital library clean: new solutions to old problems. In: Proceedings of the 8th ACM Symposium on Document Engineering, pp. 257–262
Lam L, Suen CY (1996) Majority vote of even and odd experts in a polychotomous choice situation. Theory Decision 41(1):13–36
Article MathSciNet MATH Google Scholar
Lee Y, Myaeng S (2002) Text genre classification with genre-revealing and subject-revealing features. In: Proceedings of the 25th ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, pp. 145–150
Lin Z, King I, Ly MR (2006) Pagesim: a novel link-based similarity measure for the World Wide Web. In: Proceedings of the 5th International Conference on Web Intelligence, pp. 687–693
Lovins J (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31
Google Scholar
Mason JE, Shepherd M, Duffy J, Keselj V, Watters C (2010) An n-gram based approach to multi-labeled web page genre classification. In: Proceedings of the 46th Hawaii International Conference on System Sciences, pp. 1–10
Mehler A, Gleim R, Wegner A (2007) Structural uncertainty of hypertext types. an empirical study. Proceedings of the International Workshop:Towards Genre-Enabled Search Engines: The Impact of NLP, pp. 13–19
Mitchell T (1997) Machine learning. McGraw-Hill, New York
MATH Google Scholar
Orrite C, Rodriguez M, Martinez F, Fairhurst M (2008) Classifier ensemble generation for the majority vote rule. In: Proceedings of the 13th Iberoamerican Congress on Pattern Recognition, pp. 340–347
Pereira DA, Ribeiro BN, Ziviani N, Alberto HF, Goncalves AM, Ferreira AA (2009) Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58
Pritsos D, Stamatatos E (2013) Open-set classification for automated genre identification. In: Proceedings of the 35th European Conference on Information Retrieval Research, pp. 207–217
Qi X, Davison B (2008) Classifiers without borders: incorporating fielded text from neighboring web pages. In: Proceedings of the 31st Annual International ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development on Information Retrieval, pp. 643–650
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Santini M (2006) Common criteria for genre classification: Annotation and granularity. In: Workshop on Text-based Information Retrieval. In Conjunction with the 21st European Conference on Artificial Intelligence(ECAI), pp. 1–6
Santini M (2007) Characterizing genres of web pages: Genre hybridism and individualization. In: Proceedings of the 40th Annual Hawaii International Conference on System Sciences, pp. 71–80
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Sharoff S, Wu Z, Markert K (2010) The web library of babel: evaluating genre collections. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 3063–3070
Stamatatos E, Fakotakis N, Kokkinakis G (2000) Text genre detection using common word frequencies. In: Proceedings of the 18th Internation Conference on Computational Linguistics, pp. 808–814
Stein B, zu Eissen SM (2006) Is web genre identification feasible? In: 17th European Conference on Artificial Intelligence (ECAI 06), pp. 815–816
Vapnik V (1995) The nature of statistical learning. Springer, New York
Book MATH Google Scholar
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, pp. 42–49
Zhu J, Zhou X, Fung G (2011) Enhance web pages genre identification using neighboring pages. In: Proceedings of the 12th International Conference on Web Information System Engineering, pp. 282–289
Zu Eissen SM, Stein B (2004) Genre classification of web pages: user study and feasibility analysis. In: 27th Annual German Conference on AI (KI 04), pp. 256–269

Download references

Acknowledgments

This work was supported by the Youth Teacher Startup Fund of South China Normal University (No. 14KJ18) and the National High Technology Research and Development Program of China (863, No. 2013AA01A212).

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, China
Jia Zhu
Division of Computer, Electrical and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Qing Xie
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Shoou-I Yu
School of Decision Sciences, Hang Seng Management College, Hong Kong, China
Wai Hung Wong

Authors

Jia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Xie
View author publications
You can also search for this author in PubMed Google Scholar
Shoou-I Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wai Hung Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Zhu.

Additional information

Responsible editor: Thomas Seidl.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, J., Xie, Q., Yu, SI. et al. Exploiting link structure for web page genre identification. Data Min Knowl Disc 30, 550–575 (2016). https://doi.org/10.1007/s10618-015-0428-8

Download citation

Received: 01 December 2013
Accepted: 20 June 2015
Published: 07 July 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s10618-015-0428-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting link structure for web page genre identification

Abstract

Access this article

Similar content being viewed by others

A novel feature and class-based globalization technique for text classification

Link prediction in social networks using hyper-motif representation on hypergraph

An effective keyword search co-occurrence multi-layer graph mining approach

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting link structure for web page genre identification

Abstract

Access this article

Similar content being viewed by others

A novel feature and class-based globalization technique for text classification

Link prediction in social networks using hyper-motif representation on hypergraph

An effective keyword search co-occurrence multi-layer graph mining approach

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation