Skip to main content

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11438))

Included in the following conference series:

Abstract

Web genre identification can boost information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. The open-set scenario is more realistic for this task as web genres evolve over time and it is not feasible to define a universally agreed genre palette. In this work, we bring to bear a novel approach to web genre identification underpinned by distributional features acquired by doc2vec and a recently-proposed open-set classification algorithm—the nearest neighbors distance ratio classifier. We present experimental results using a benchmark corpus and a strong baseline and demonstrate that the proposed approach is highly competitive, especially when emphasis is given on precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/dpritsos/html2vec.

  2. 2.

    https://github.com/dpritsos/OpenNNDR.

References

  1. Abramson, M., Aha, D.W.: What’s in a URL? Genre classification from URLs. Intelligent techniques for web personalization and recommender systems. AAAI Technical report. Association for the Advancement of Artificial Intelligence (2012)

    Google Scholar 

  2. Asheghi, N.R.: Human Annotation and Automatic Detection of Web Genres. Ph.D. thesis, University of Leeds (2015)

    Google Scholar 

  3. Asheghi, N.R., Markert, K., Sharoff, S.: Semi-supervised graph-based genre classification for web pages. In: TextGraphs-9, p. 39 (2014)

    Google Scholar 

  4. Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 632–639. ACM (2005)

    Google Scholar 

  5. Crowston, K., Kwaśnik, B., Rubleske, J.: Problems in the use-centered development of a taxonomy of web genres. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 69–84. Springer, Dordrecht (2011). https://doi.org/10.1007/978-90-481-9178-9_4

    Chapter  Google Scholar 

  6. Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 313–316 (2006)

    Google Scholar 

  7. Jebari, C.: A pure URL-based genre classification of web pages. In: 2014 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 233–237. IEEE (2014)

    Google Scholar 

  8. Jebari, C.: A combination based on OWA operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Nat. 54, 13–20 (2015)

    Google Scholar 

  9. Joho, H., Sanderson, M.: The spirit collection: an overview of a large web collection. SIGIR Forum 38(2), 57–61 (2004)

    Article  Google Scholar 

  10. Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manage. 45(5), 499–512 (2009)

    Article  Google Scholar 

  11. Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS 2005, p. 99c. IEEE (2005)

    Google Scholar 

  12. Kumari, K.P., Reddy, A.V., Fatima, S.S.: Web page genre classification: impact of n-gram lengths. Int. J. Comput. Appl. 88(13), 13–17 (2014)

    Google Scholar 

  13. Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, pp. 131–131. IEEE (2008)

    Google Scholar 

  14. Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)

    Article  Google Scholar 

  15. Madjarov, G., Vidulin, V., Dimitrovski, I., Kocev, D.: Web genre classification via hierarchical multi-label classification. In: Jackowski, K., Burduk, R., Walkowiak, K., Woźniak, M., Yin, H. (eds.) IDEAL 2015. LNCS, vol. 9375, pp. 9–17. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24834-9_2

    Chapter  Google Scholar 

  16. Malhotra, R., Sharma, A.: Quantitative evaluation of web metrics for automatic genre classification of web pages. Int. J. Syst. Assur. Eng. Manage. 8(2), 1567–1579 (2017)

    Article  Google Scholar 

  17. Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009)

    Google Scholar 

  18. Mehler, A., Sharoff, S., Santini, M.: Genres on the Web: Computational Models and Empirical Studies. Text, Speech and Language Technology. Springer, Heidelberg (2010). https://doi.org/10.1007/978-90-481-9178-9

    Book  MATH  Google Scholar 

  19. Mendes Júnior, P.R., et al.: Nearest neighbors distance ratio open-set classifier. Mach. Learn. 106, 1–28 (2016)

    MathSciNet  Google Scholar 

  20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  22. Nooralahzadeh, F., Brun, C., Roux, C.: Part of speech tagging for French social media data. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 1764–1772 (2014)

    Google Scholar 

  23. Onan, A.: An ensemble scheme based on language function analysis and feature engineering for text genre classification. J. Inf. Sci. 44(1), 28–47 (2018)

    Article  Google Scholar 

  24. Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)

    Article  Google Scholar 

  25. Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)

    Article  Google Scholar 

  26. Pritsos, D., Stamatatos, E.: The impact of noise in web genre identification. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 268–273. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_27

    Chapter  Google Scholar 

  27. Pritsos, D., Stamatatos, E.: Open set evaluation of web genre identification. Lang. Resour. Eval. 52, 1–20 (2018)

    Article  Google Scholar 

  28. Pritsos, D.A., Stamatatos, E.: Open-set classification for automated genre identification. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 207–217. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5_18

    Chapter  Google Scholar 

  29. Priyatam, P.N., Iyengar, S., Perumal, K., Varma, V.: Don’t use a lot when little will do: genre identification using URLs. Res. Comput. Sci. 70, 207–218 (2013)

    Google Scholar 

  30. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp. 45–50, May 2010. http://is.muni.cz/publication/884893/en

  31. Rosso, M.A.: User-based identification of web genres. J. Am. Soc. Inf. Sci. Technol. 59(7), 1053–1072 (2008). https://doi.org/10.1002/asi.20798

    Article  Google Scholar 

  32. Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007)

    Google Scholar 

  33. Santini, M.: Cross-testing a genre classification model for the web. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 87–128. Springer, Dordrecht (2011). https://doi.org/10.1007/978-90-481-9178-9_5

    Chapter  Google Scholar 

  34. Sharoff, S., Wu, Z., Markert, K.: The Web Library of Babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010)

    Google Scholar 

  35. Shepherd, M.A., Watters, C.R., Kennedy, A.: Cybergenre: automatic identification of home pages on the web. J. Web Eng. 3(3–4), 236–251 (2004)

    Google Scholar 

  36. Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Carnegie Mellon University (2009)

    Google Scholar 

  37. Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. (IJDAR) 10(3–4), 199–209 (2007)

    Article  Google Scholar 

  38. Vidulin, V., Luštrek, M., Gams, M.: Using genres to improve search engines. In: Proceedings of the International Workshop Towards Genre-Enabled Search Engines, pp. 45–51 (2007)

    Google Scholar 

  39. Worsham, J., Kalita, J.: Genre identification and the compositional effect of genre in literature. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973 (2018)

    Google Scholar 

  40. Zhu, J., Zhou, X., Fung, G.: Enhance web pages genre identification using neighboring pages. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds.) WISE 2011. LNCS, vol. 6997, pp. 282–289. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24434-6_23

    Chapter  Google Scholar 

Download references

Acknowledgement

Prof. Rocha thanks the financial support of FAPESP DéjàVu (Grant #2017/12646-3) and CAPES DeepEyes Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Efstathios Stamatatos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pritsos, D., Rocha, A., Stamatatos, E. (2019). Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11438. Springer, Cham. https://doi.org/10.1007/978-3-030-15719-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15719-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15718-0

  • Online ISBN: 978-3-030-15719-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics