Skip to main content

On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language

  • Conference paper
  • First Online:
Digital Libraries and Multimedia Archives (IRCDL 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 806))

Included in the following conference series:

Abstract

Natural Language Processing techniques are of utmost importance for the proper management of Digital Libraries. These techniques are based on language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is costly, time-consuming and error-prone, it would be desirable to learn these resources automatically from sample texts, without any prior knowledge about the language under consideration. In this paper we focus on stopwords, i.e., terms that can be ignored in order to understand the topic and content of a document. We propose an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents. The reliability and/or deficiencies of the stopwords obtained from the experiments is evaluated by comparison to existing linguistic resources. While the study is conducted on texts in Italian, we are confident that the same approach and experimental results may apply to other languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.gutenberg.org/.

  2. 2.

    https://www.liberliber.it/.

  3. 3.

    http://snowball.tartarus.org/algorithms/italian/stop.

References

  1. Ahmed, B., Cha, S.-H., Tappert, C.: Language identification from text using n-gram based cumulative frequency addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, p. 12-1 (2004)

    Google Scholar 

  2. Brill, E.: A simple rule-based Part of Speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)

    Google Scholar 

  3. Brill, E.: Some advances in transformation-based Part of Speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)

    Google Scholar 

  4. Brill, E.: Unsupervised learning of disambiguation rules for Part of Speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer (1995)

    Google Scholar 

  5. Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res. 24(1), 305–339 (2005)

    MATH  Google Scholar 

  6. D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference methods for natural language learning. Artif. Intell. Rev. 36(1), 1–27 (2012)

    Article  Google Scholar 

  7. Ferilli, S., Esposito, F., Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Comput. Sci. 38, 116–123 (2014)

    Article  Google Scholar 

  8. Ferilli, S., Esposito, F., Redavid, D.: Language identification as process prediction using woman. In: Proceedings of the 12th Italian Research Conference on Digital Library Management Systems (IRCDL 2016), p. 12 (2016)

    Google Scholar 

  9. Ferilli, S., Grieco, D., Esposito, F.: Automatic learning of linguistic resources for stopword removal and stemming from text. In: Agosti, M., Ferro, N. (eds.) Proceedings of the 10th Italian Research Conference on Digital Library Management Systems (IRCDL 2014), p. 12 (2014)

    Google Scholar 

  10. Fox, C.: A stop list for general text. SIGIR Forum 24(1–2), 19–21 (1989)

    Article  Google Scholar 

  11. Hensman, S.: Construction of conceptual graph representation of texts. In: Proceedings of the Student Research Workshop at HLT-NAACL 2004, HLT-SRWS 2004, pp. 49–54. Association for Computational Linguistics (2004)

    Google Scholar 

  12. Leuzzi, F., Ferilli, S., Rotella, F.: ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci, T., Ferro, N., Poggi, A. (eds.) IRCDL 2013. CCIS, vol. 385, pp. 93–104. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54347-0_11

    Chapter  Google Scholar 

  13. Maedche, A., Staab, S.: Mining ontologies from text. In: EKAW, pp. 189–202 (2000)

    Google Scholar 

  14. Maedche, A., Staab, S.: The text-to-onto ontology learning environment. In: ICCS-2000 - Eight International Conference on Conceptual Structures, Software Demonstration (2000)

    Google Scholar 

  15. Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)

    Google Scholar 

  16. Nagarajan, T., Murthy, H.A.: Language identification using parallel syllable-like unit recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, p. I-401. IEEE (2004)

    Google Scholar 

  17. Ogata, N.: A formal ontology discovery from web documents. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 514–519. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45490-X_66

    Chapter  Google Scholar 

  18. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  19. Rotella, F., Leuzzi, F., Ferilli, S.: Learning and exploiting concept networks with connektion. Appl. Intell. 42, 87–111 (2015)

    Article  Google Scholar 

  20. Savoy, J.: A stemming procedure and stopword list for general french corpora. J. Assoc. Inf. Sci. Technol. 50, 944–952 (1999)

    Google Scholar 

  21. Shamsfard, M., Barforoush, A.A.: Learning ontologies from natural language texts. Int. J. Hum.-Comput. Stud. 60(1), 17–63 (2004)

    Article  Google Scholar 

  22. Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F.: Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press (2006)

    Google Scholar 

  23. John Wilbur, W., Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ferilli, S., Esposito, F. (2018). On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language. In: Serra, G., Tasso, C. (eds) Digital Libraries and Multimedia Archives. IRCDL 2018. Communications in Computer and Information Science, vol 806. Springer, Cham. https://doi.org/10.1007/978-3-319-73165-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73165-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73164-3

  • Online ISBN: 978-3-319-73165-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics