On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language

Ferilli, Stefano; Esposito, Floriana

doi:10.1007/978-3-319-73165-0_7

Stefano Ferilli¹¹ &
Floriana Esposito¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 806))

Included in the following conference series:

Italian Research Conference on Digital Libraries

672 Accesses
2 Citations

Abstract

Natural Language Processing techniques are of utmost importance for the proper management of Digital Libraries. These techniques are based on language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is costly, time-consuming and error-prone, it would be desirable to learn these resources automatically from sample texts, without any prior knowledge about the language under consideration. In this paper we focus on stopwords, i.e., terms that can be ignored in order to understand the topic and content of a document. We propose an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents. The reliability and/or deficiencies of the stopwords obtained from the experiments is evaluated by comparison to existing linguistic resources. While the study is conducted on texts in Italian, we are confident that the same approach and experimental results may apply to other languages as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ahmed, B., Cha, S.-H., Tappert, C.: Language identification from text using n-gram based cumulative frequency addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, p. 12-1 (2004)
Google Scholar
Brill, E.: A simple rule-based Part of Speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)
Google Scholar
Brill, E.: Some advances in transformation-based Part of Speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)
Google Scholar
Brill, E.: Unsupervised learning of disambiguation rules for Part of Speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer (1995)
Google Scholar
Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res. 24(1), 305–339 (2005)
MATH Google Scholar
D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference methods for natural language learning. Artif. Intell. Rev. 36(1), 1–27 (2012)
Article Google Scholar
Ferilli, S., Esposito, F., Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Comput. Sci. 38, 116–123 (2014)
Article Google Scholar
Ferilli, S., Esposito, F., Redavid, D.: Language identification as process prediction using woman. In: Proceedings of the 12th Italian Research Conference on Digital Library Management Systems (IRCDL 2016), p. 12 (2016)
Google Scholar
Ferilli, S., Grieco, D., Esposito, F.: Automatic learning of linguistic resources for stopword removal and stemming from text. In: Agosti, M., Ferro, N. (eds.) Proceedings of the 10th Italian Research Conference on Digital Library Management Systems (IRCDL 2014), p. 12 (2014)
Google Scholar
Fox, C.: A stop list for general text. SIGIR Forum 24(1–2), 19–21 (1989)
Article Google Scholar
Hensman, S.: Construction of conceptual graph representation of texts. In: Proceedings of the Student Research Workshop at HLT-NAACL 2004, HLT-SRWS 2004, pp. 49–54. Association for Computational Linguistics (2004)
Google Scholar
Leuzzi, F., Ferilli, S., Rotella, F.: ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci, T., Ferro, N., Poggi, A. (eds.) IRCDL 2013. CCIS, vol. 385, pp. 93–104. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54347-0_11
Chapter Google Scholar
Maedche, A., Staab, S.: Mining ontologies from text. In: EKAW, pp. 189–202 (2000)
Google Scholar
Maedche, A., Staab, S.: The text-to-onto ontology learning environment. In: ICCS-2000 - Eight International Conference on Conceptual Structures, Software Demonstration (2000)
Google Scholar
Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)
Google Scholar
Nagarajan, T., Murthy, H.A.: Language identification using parallel syllable-like unit recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, p. I-401. IEEE (2004)
Google Scholar
Ogata, N.: A formal ontology discovery from web documents. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 514–519. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45490-X_66
Chapter Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Rotella, F., Leuzzi, F., Ferilli, S.: Learning and exploiting concept networks with connektion. Appl. Intell. 42, 87–111 (2015)
Article Google Scholar
Savoy, J.: A stemming procedure and stopword list for general french corpora. J. Assoc. Inf. Sci. Technol. 50, 944–952 (1999)
Google Scholar
Shamsfard, M., Barforoush, A.A.: Learning ontologies from natural language texts. Int. J. Hum.-Comput. Stud. 60(1), 17–63 (2004)
Article Google Scholar
Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F.: Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press (2006)
Google Scholar
John Wilbur, W., Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Bari, Bari, Italy
Stefano Ferilli & Floriana Esposito

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar
Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Giuseppe Serra
University of Udine, Udine, Italy
Carlo Tasso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferilli, S., Esposito, F. (2018). On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language. In: Serra, G., Tasso, C. (eds) Digital Libraries and Multimedia Archives. IRCDL 2018. Communications in Computer and Information Science, vol 806. Springer, Cham. https://doi.org/10.1007/978-3-319-73165-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-73165-0_7
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73164-3
Online ISBN: 978-3-319-73165-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics