Skip to main content
Log in

Exploiting link structure for web page genre identification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://www.uni-weimar.de/en/media/chairs/webis/research/projects/wega/.

  2. http://www.transformersmovie.com/.

  3. http://www.dreamstime.com/.

  4. http://www.youtube.com.

  5. http://www.music.com.

  6. http://ahrefs.com/.

  7. http://www.textfixer.com/resources/common-english-words.txt.

  8. http://ostatic.com/wvtool.

  9. http://htmlparser.sourceforge.net/.

  10. http://www.google.com/help/features.html.

  11. http://www.cs.waikato.ac.nz/ml/weka/.

  12. http://en.wikipedia.org/wiki/T-test.

References

  • Abramson M, Aha DW (2012) What’s in a URL? genre classification from URLs. In: Workshops at the 26th Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1–8

  • Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43

    Article  Google Scholar 

  • Bernhard S, Burges JC, Smola AJ (1998) Advances in kernel methods: support vector learning. The MIT Press, Cambridge

    MATH  Google Scholar 

  • Bjroneborn L (2011) Genre connectivity and genre drift in a web of genres. In: Genres on the Web: Computational Models and Empirical Studies, pp. 255–274

  • Boese E, Howe A (2005) Effects of web document evolution on genre classification. In: Proceedings of the ACM 14th Conference on Information and Knowledge Management, pp. 632–639

  • Chen G, Choi B (2008) Web page genre classification. In: Proceedings of the 2008 ACM Symposium on Applied Computing, pp. 2353–2357

  • Dong L, Watters C, Duffy J, Shepherd M (2008) An examination of genre attributes for web page classification. In: Proceedings of the 41th Annual Hawaii International Conference on System Sciences, pp. 129–138

  • Finn A, Kushmerick N (2006) Learning to classify documents according to genre. J Am Soc Inf Sci Technol 57(11):257–262

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jebari C (2009) A new centroid-based approach for genre categorization of web pages. J Lang Technol Comput Linguist 24(1):73–96

    Google Scholar 

  • Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543

  • Kanaris I, Stamatatos E (2007) Web page genre identification using variable-length character n-grams. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol 7(1), pp. 3–10

  • Kennedy A, Shepherd M (2005) Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pp. 99–108

  • Kessler B, Nunberg G, Shutze H (1997) Automatic detection of text genre. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 32–38

  • Kim Y, Ross S (2011) Formulating representative features with respect to genre classification. Genres Web Comput Model Empir Stud 42:129–147

    Google Scholar 

  • Kleinberg JM (1999) Hubs, authorities, and communities. ACM Comput Surv 31(4es):5

    Article  Google Scholar 

  • Kleinbery JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632

    Article  MathSciNet  MATH  Google Scholar 

  • Kumari KP, Reddy A (2012) Performance improvement of web page genre classification. Int J Comput Appl 53(10):24–27

    Google Scholar 

  • Kuncheva LI, Bezdek JC, Duin RP (2001) Decision templates for multiple classifier fusion. Pattern Recognit 34(2):299–314

    Article  MATH  Google Scholar 

  • Laender AHF, Goncalves MA, Cota RG, Ferreira AA, Santos RLT, Silva AJC (2008) Keeping a digital library clean: new solutions to old problems. In: Proceedings of the 8th ACM Symposium on Document Engineering, pp. 257–262

  • Lam L, Suen CY (1996) Majority vote of even and odd experts in a polychotomous choice situation. Theory Decision 41(1):13–36

    Article  MathSciNet  MATH  Google Scholar 

  • Lee Y, Myaeng S (2002) Text genre classification with genre-revealing and subject-revealing features. In: Proceedings of the 25th ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, pp. 145–150

  • Lin Z, King I, Ly MR (2006) Pagesim: a novel link-based similarity measure for the World Wide Web. In: Proceedings of the 5th International Conference on Web Intelligence, pp. 687–693

  • Lovins J (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11:22–31

    Google Scholar 

  • Mason JE, Shepherd M, Duffy J, Keselj V, Watters C (2010) An n-gram based approach to multi-labeled web page genre classification. In: Proceedings of the 46th Hawaii International Conference on System Sciences, pp. 1–10

  • Mehler A, Gleim R, Wegner A (2007) Structural uncertainty of hypertext types. an empirical study. Proceedings of the International Workshop:Towards Genre-Enabled Search Engines: The Impact of NLP, pp. 13–19

  • Mitchell T (1997) Machine learning. McGraw-Hill, New York

    MATH  Google Scholar 

  • Orrite C, Rodriguez M, Martinez F, Fairhurst M (2008) Classifier ensemble generation for the majority vote rule. In: Proceedings of the 13th Iberoamerican Congress on Pattern Recognition, pp. 340–347

  • Pereira DA, Ribeiro BN, Ziviani N, Alberto HF, Goncalves AM, Ferreira AA (2009) Using web information for author name disambiguation. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 49–58

  • Pritsos D, Stamatatos E (2013) Open-set classification for automated genre identification. In: Proceedings of the 35th European Conference on Information Retrieval Research, pp. 207–217

  • Qi X, Davison B (2008) Classifiers without borders: incorporating fielded text from neighboring web pages. In: Proceedings of the 31st Annual International ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development on Information Retrieval, pp. 643–650

  • Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

  • Santini M (2006) Common criteria for genre classification: Annotation and granularity. In: Workshop on Text-based Information Retrieval. In Conjunction with the 21st European Conference on Artificial Intelligence(ECAI), pp. 1–6

  • Santini M (2007) Characterizing genres of web pages: Genre hybridism and individualization. In: Proceedings of the 40th Annual Hawaii International Conference on System Sciences, pp. 71–80

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  • Sharoff S, Wu Z, Markert K (2010) The web library of babel: evaluating genre collections. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 3063–3070

  • Stamatatos E, Fakotakis N, Kokkinakis G (2000) Text genre detection using common word frequencies. In: Proceedings of the 18th Internation Conference on Computational Linguistics, pp. 808–814

  • Stein B, zu Eissen SM (2006) Is web genre identification feasible? In: 17th European Conference on Artificial Intelligence (ECAI 06), pp. 815–816

  • Vapnik V (1995) The nature of statistical learning. Springer, New York

    Book  MATH  Google Scholar 

  • Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM Special Interest Group on Information Retrieval (SIGIR) Conference on Research and Development in Information Retrieval, pp. 42–49

  • Zhu J, Zhou X, Fung G (2011) Enhance web pages genre identification using neighboring pages. In: Proceedings of the 12th International Conference on Web Information System Engineering, pp. 282–289

  • Zu Eissen SM, Stein B (2004) Genre classification of web pages: user study and feasibility analysis. In: 27th Annual German Conference on AI (KI 04), pp. 256–269

Download references

Acknowledgments

This work was supported by the Youth Teacher Startup Fund of South China Normal University (No. 14KJ18) and the National High Technology Research and Development Program of China (863, No. 2013AA01A212).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zhu.

Additional information

Responsible editor: Thomas Seidl.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, J., Xie, Q., Yu, SI. et al. Exploiting link structure for web page genre identification. Data Min Knowl Disc 30, 550–575 (2016). https://doi.org/10.1007/s10618-015-0428-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0428-8

Keywords

Navigation