Advertisement

Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio

  • Dimitrios Pritsos
  • Anderson Rocha
  • Efstathios StamatatosEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11438)

Abstract

Web genre identification can boost information retrieval systems by providing rich descriptions of documents and enabling more specialized queries. The open-set scenario is more realistic for this task as web genres evolve over time and it is not feasible to define a universally agreed genre palette. In this work, we bring to bear a novel approach to web genre identification underpinned by distributional features acquired by doc2vec and a recently-proposed open-set classification algorithm—the nearest neighbors distance ratio classifier. We present experimental results using a benchmark corpus and a strong baseline and demonstrate that the proposed approach is highly competitive, especially when emphasis is given on precision.

Keywords

Web genre identification Open-set classification Distributional features 

Notes

Acknowledgement

Prof. Rocha thanks the financial support of FAPESP DéjàVu (Grant #2017/12646-3) and CAPES DeepEyes Grant.

References

  1. 1.
    Abramson, M., Aha, D.W.: What’s in a URL? Genre classification from URLs. Intelligent techniques for web personalization and recommender systems. AAAI Technical report. Association for the Advancement of Artificial Intelligence (2012)Google Scholar
  2. 2.
    Asheghi, N.R.: Human Annotation and Automatic Detection of Web Genres. Ph.D. thesis, University of Leeds (2015)Google Scholar
  3. 3.
    Asheghi, N.R., Markert, K., Sharoff, S.: Semi-supervised graph-based genre classification for web pages. In: TextGraphs-9, p. 39 (2014)Google Scholar
  4. 4.
    Boese, E.S., Howe, A.E.: Effects of web document evolution on genre classification. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 632–639. ACM (2005)Google Scholar
  5. 5.
    Crowston, K., Kwaśnik, B., Rubleske, J.: Problems in the use-centered development of a taxonomy of web genres. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 69–84. Springer, Dordrecht (2011).  https://doi.org/10.1007/978-90-481-9178-9_4CrossRefGoogle Scholar
  6. 6.
    Dong, L., Watters, C., Duffy, J., Shepherd, M.: Binary cybergenre classification using theoretic feature measures. In: 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), pp. 313–316 (2006)Google Scholar
  7. 7.
    Jebari, C.: A pure URL-based genre classification of web pages. In: 2014 25th International Workshop on Database and Expert Systems Applications (DEXA), pp. 233–237. IEEE (2014)Google Scholar
  8. 8.
    Jebari, C.: A combination based on OWA operators for multi-label genre classification of web pages. Procesamiento del Lenguaje Nat. 54, 13–20 (2015)Google Scholar
  9. 9.
    Joho, H., Sanderson, M.: The spirit collection: an overview of a large web collection. SIGIR Forum 38(2), 57–61 (2004)CrossRefGoogle Scholar
  10. 10.
    Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manage. 45(5), 499–512 (2009)CrossRefGoogle Scholar
  11. 11.
    Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, HICSS 2005, p. 99c. IEEE (2005)Google Scholar
  12. 12.
    Kumari, K.P., Reddy, A.V., Fatima, S.S.: Web page genre classification: impact of n-gram lengths. Int. J. Comput. Appl. 88(13), 13–17 (2014)Google Scholar
  13. 13.
    Levering, R., Cutler, M., Yu, L.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual Hawaii International Conference on System Sciences, pp. 131–131. IEEE (2008)Google Scholar
  14. 14.
    Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)CrossRefGoogle Scholar
  15. 15.
    Madjarov, G., Vidulin, V., Dimitrovski, I., Kocev, D.: Web genre classification via hierarchical multi-label classification. In: Jackowski, K., Burduk, R., Walkowiak, K., Woźniak, M., Yin, H. (eds.) IDEAL 2015. LNCS, vol. 9375, pp. 9–17. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24834-9_2CrossRefGoogle Scholar
  16. 16.
    Malhotra, R., Sharma, A.: Quantitative evaluation of web metrics for automatic genre classification of web pages. Int. J. Syst. Assur. Eng. Manage. 8(2), 1567–1579 (2017)CrossRefGoogle Scholar
  17. 17.
    Mason, J., Shepherd, M., Duffy, J.: An n-gram based approach to automatically identifying web page genre. In: HICSS, pp. 1–10. IEEE Computer Society (2009)Google Scholar
  18. 18.
    Mehler, A., Sharoff, S., Santini, M.: Genres on the Web: Computational Models and Empirical Studies. Text, Speech and Language Technology. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-90-481-9178-9CrossRefzbMATHGoogle Scholar
  19. 19.
    Mendes Júnior, P.R., et al.: Nearest neighbors distance ratio open-set classifier. Mach. Learn. 106, 1–28 (2016)MathSciNetGoogle Scholar
  20. 20.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  21. 21.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  22. 22.
    Nooralahzadeh, F., Brun, C., Roux, C.: Part of speech tagging for French social media data. In: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 23–29 August 2014, Dublin, Ireland, pp. 1764–1772 (2014)Google Scholar
  23. 23.
    Onan, A.: An ensemble scheme based on language function analysis and feature engineering for text genre classification. J. Inf. Sci. 44(1), 28–47 (2018)CrossRefGoogle Scholar
  24. 24.
    Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)CrossRefGoogle Scholar
  25. 25.
    Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernández, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput. 21(3), 627–639 (2017)CrossRefGoogle Scholar
  26. 26.
    Pritsos, D., Stamatatos, E.: The impact of noise in web genre identification. In: Mothe, J., et al. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 268–273. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24027-5_27CrossRefGoogle Scholar
  27. 27.
    Pritsos, D., Stamatatos, E.: Open set evaluation of web genre identification. Lang. Resour. Eval. 52, 1–20 (2018)CrossRefGoogle Scholar
  28. 28.
    Pritsos, D.A., Stamatatos, E.: Open-set classification for automated genre identification. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 207–217. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-36973-5_18CrossRefGoogle Scholar
  29. 29.
    Priyatam, P.N., Iyengar, S., Perumal, K., Varma, V.: Don’t use a lot when little will do: genre identification using URLs. Res. Comput. Sci. 70, 207–218 (2013)Google Scholar
  30. 30.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp. 45–50, May 2010. http://is.muni.cz/publication/884893/en
  31. 31.
    Rosso, M.A.: User-based identification of web genres. J. Am. Soc. Inf. Sci. Technol. 59(7), 1053–1072 (2008).  https://doi.org/10.1002/asi.20798CrossRefGoogle Scholar
  32. 32.
    Santini, M.: Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton (2007)Google Scholar
  33. 33.
    Santini, M.: Cross-testing a genre classification model for the web. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech and Language Technology, vol. 42, pp. 87–128. Springer, Dordrecht (2011).  https://doi.org/10.1007/978-90-481-9178-9_5CrossRefGoogle Scholar
  34. 34.
    Sharoff, S., Wu, Z., Markert, K.: The Web Library of Babel: evaluating genre collections. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp. 3063–3070 (2010)Google Scholar
  35. 35.
    Shepherd, M.A., Watters, C.R., Kennedy, A.: Cybergenre: automatic identification of home pages on the web. J. Web Eng. 3(3–4), 236–251 (2004)Google Scholar
  36. 36.
    Stewart, J.G.: Genre oriented summarization. Ph.D. thesis, Carnegie Mellon University (2009)Google Scholar
  37. 37.
    Stubbe, A., Ringlstetter, C., Schulz, K.U.: Genre as noise: noise in genre. Int. J. Doc. Anal. Recogn. (IJDAR) 10(3–4), 199–209 (2007)CrossRefGoogle Scholar
  38. 38.
    Vidulin, V., Luštrek, M., Gams, M.: Using genres to improve search engines. In: Proceedings of the International Workshop Towards Genre-Enabled Search Engines, pp. 45–51 (2007)Google Scholar
  39. 39.
    Worsham, J., Kalita, J.: Genre identification and the compositional effect of genre in literature. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1963–1973 (2018)Google Scholar
  40. 40.
    Zhu, J., Zhou, X., Fung, G.: Enhance web pages genre identification using neighboring pages. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds.) WISE 2011. LNCS, vol. 6997, pp. 282–289. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-24434-6_23CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Dimitrios Pritsos
    • 1
  • Anderson Rocha
    • 2
  • Efstathios Stamatatos
    • 1
    Email author
  1. 1.University of the AegeanKarlovassi, SamosGreece
  2. 2.Institute of ComputingUniversity of Campinas (Unicamp)CampinasBrazil

Personalised recommendations