On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 806)


Natural Language Processing techniques are of utmost importance for the proper management of Digital Libraries. These techniques are based on language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is costly, time-consuming and error-prone, it would be desirable to learn these resources automatically from sample texts, without any prior knowledge about the language under consideration. In this paper we focus on stopwords, i.e., terms that can be ignored in order to understand the topic and content of a document. We propose an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents. The reliability and/or deficiencies of the stopwords obtained from the experiments is evaluated by comparison to existing linguistic resources. While the study is conducted on texts in Italian, we are confident that the same approach and experimental results may apply to other languages as well.


Natural Language Processing Linguistic resources Stopwords Keyword extraction 


  1. 1.
    Ahmed, B., Cha, S.-H., Tappert, C.: Language identification from text using n-gram based cumulative frequency addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, p. 12-1 (2004)Google Scholar
  2. 2.
    Brill, E.: A simple rule-based Part of Speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)Google Scholar
  3. 3.
    Brill, E.: Some advances in transformation-based Part of Speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)Google Scholar
  4. 4.
    Brill, E.: Unsupervised learning of disambiguation rules for Part of Speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer (1995)Google Scholar
  5. 5.
    Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res. 24(1), 305–339 (2005)MATHGoogle Scholar
  6. 6.
    D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference methods for natural language learning. Artif. Intell. Rev. 36(1), 1–27 (2012)CrossRefGoogle Scholar
  7. 7.
    Ferilli, S., Esposito, F., Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Comput. Sci. 38, 116–123 (2014)CrossRefGoogle Scholar
  8. 8.
    Ferilli, S., Esposito, F., Redavid, D.: Language identification as process prediction using woman. In: Proceedings of the 12th Italian Research Conference on Digital Library Management Systems (IRCDL 2016), p. 12 (2016)Google Scholar
  9. 9.
    Ferilli, S., Grieco, D., Esposito, F.: Automatic learning of linguistic resources for stopword removal and stemming from text. In: Agosti, M., Ferro, N. (eds.) Proceedings of the 10th Italian Research Conference on Digital Library Management Systems (IRCDL 2014), p. 12 (2014)Google Scholar
  10. 10.
    Fox, C.: A stop list for general text. SIGIR Forum 24(1–2), 19–21 (1989)CrossRefGoogle Scholar
  11. 11.
    Hensman, S.: Construction of conceptual graph representation of texts. In: Proceedings of the Student Research Workshop at HLT-NAACL 2004, HLT-SRWS 2004, pp. 49–54. Association for Computational Linguistics (2004)Google Scholar
  12. 12.
    Leuzzi, F., Ferilli, S., Rotella, F.: ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci, T., Ferro, N., Poggi, A. (eds.) IRCDL 2013. CCIS, vol. 385, pp. 93–104. Springer, Heidelberg (2014). CrossRefGoogle Scholar
  13. 13.
    Maedche, A., Staab, S.: Mining ontologies from text. In: EKAW, pp. 189–202 (2000)Google Scholar
  14. 14.
    Maedche, A., Staab, S.: The text-to-onto ontology learning environment. In: ICCS-2000 - Eight International Conference on Conceptual Structures, Software Demonstration (2000)Google Scholar
  15. 15.
    Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)Google Scholar
  16. 16.
    Nagarajan, T., Murthy, H.A.: Language identification using parallel syllable-like unit recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, p. I-401. IEEE (2004)Google Scholar
  17. 17.
    Ogata, N.: A formal ontology discovery from web documents. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 514–519. Springer, Heidelberg (2001). CrossRefGoogle Scholar
  18. 18.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  19. 19.
    Rotella, F., Leuzzi, F., Ferilli, S.: Learning and exploiting concept networks with connektion. Appl. Intell. 42, 87–111 (2015)CrossRefGoogle Scholar
  20. 20.
    Savoy, J.: A stemming procedure and stopword list for general french corpora. J. Assoc. Inf. Sci. Technol. 50, 944–952 (1999)Google Scholar
  21. 21.
    Shamsfard, M., Barforoush, A.A.: Learning ontologies from natural language texts. Int. J. Hum.-Comput. Stud. 60(1), 17–63 (2004)CrossRefGoogle Scholar
  22. 22.
    Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F.: Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press (2006)Google Scholar
  23. 23.
    John Wilbur, W., Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.University of BariBariItaly

Personalised recommendations