Abstract
Natural Language Processing techniques are of utmost importance for the proper management of Digital Libraries. These techniques are based on language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is costly, time-consuming and error-prone, it would be desirable to learn these resources automatically from sample texts, without any prior knowledge about the language under consideration. In this paper we focus on stopwords, i.e., terms that can be ignored in order to understand the topic and content of a document. We propose an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents. The reliability and/or deficiencies of the stopwords obtained from the experiments is evaluated by comparison to existing linguistic resources. While the study is conducted on texts in Italian, we are confident that the same approach and experimental results may apply to other languages as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmed, B., Cha, S.-H., Tappert, C.: Language identification from text using n-gram based cumulative frequency addition. Proceedings of Student/Faculty Research Day, CSIS, Pace University, p. 12-1 (2004)
Brill, E.: A simple rule-based Part of Speech tagger. In: HLT 1991: Proceedings of the Workshop on Speech and Natural Language, pp. 112–116 (1992)
Brill, E.: Some advances in transformation-based Part of Speech tagging. In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI), vol. 1, pp. 722–727 (1994)
Brill, E.: Unsupervised learning of disambiguation rules for Part of Speech tagging. In: Natural Language Processing Using Very Large Corpora Workshop, pp. 1–13. Kluwer (1995)
Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Int. Res. 24(1), 305–339 (2005)
D’Ulizia, A., Ferri, F., Grifoni, P.: A survey of grammatical inference methods for natural language learning. Artif. Intell. Rev. 36(1), 1–27 (2012)
Ferilli, S., Esposito, F., Grieco, D.: Automatic learning of linguistic resources for stopword removal and stemming from text. Procedia Comput. Sci. 38, 116–123 (2014)
Ferilli, S., Esposito, F., Redavid, D.: Language identification as process prediction using woman. In: Proceedings of the 12th Italian Research Conference on Digital Library Management Systems (IRCDL 2016), p. 12 (2016)
Ferilli, S., Grieco, D., Esposito, F.: Automatic learning of linguistic resources for stopword removal and stemming from text. In: Agosti, M., Ferro, N. (eds.) Proceedings of the 10th Italian Research Conference on Digital Library Management Systems (IRCDL 2014), p. 12 (2014)
Fox, C.: A stop list for general text. SIGIR Forum 24(1–2), 19–21 (1989)
Hensman, S.: Construction of conceptual graph representation of texts. In: Proceedings of the Student Research Workshop at HLT-NAACL 2004, HLT-SRWS 2004, pp. 49–54. Association for Computational Linguistics (2004)
Leuzzi, F., Ferilli, S., Rotella, F.: ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci, T., Ferro, N., Poggi, A. (eds.) IRCDL 2013. CCIS, vol. 385, pp. 93–104. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54347-0_11
Maedche, A., Staab, S.: Mining ontologies from text. In: EKAW, pp. 189–202 (2000)
Maedche, A., Staab, S.: The text-to-onto ontology learning environment. In: ICCS-2000 - Eight International Conference on Conceptual Structures, Software Demonstration (2000)
Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)
Nagarajan, T., Murthy, H.A.: Language identification using parallel syllable-like unit recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, p. I-401. IEEE (2004)
Ogata, N.: A formal ontology discovery from web documents. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 514–519. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45490-X_66
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Rotella, F., Leuzzi, F., Ferilli, S.: Learning and exploiting concept networks with connektion. Appl. Intell. 42, 87–111 (2015)
Savoy, J.: A stemming procedure and stopword list for general french corpora. J. Assoc. Inf. Sci. Technol. 50, 944–952 (1999)
Shamsfard, M., Barforoush, A.A.: Learning ontologies from natural language texts. Int. J. Hum.-Comput. Stud. 60(1), 17–63 (2004)
Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F.: Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press (2006)
John Wilbur, W., Sirotkin, K.: The automatic identification of stop words. J. Inf. Sci. 18(1), 45–55 (1992)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Ferilli, S., Esposito, F. (2018). On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources — A Study on Italian Language. In: Serra, G., Tasso, C. (eds) Digital Libraries and Multimedia Archives. IRCDL 2018. Communications in Computer and Information Science, vol 806. Springer, Cham. https://doi.org/10.1007/978-3-319-73165-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-73165-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73164-3
Online ISBN: 978-3-319-73165-0
eBook Packages: Computer ScienceComputer Science (R0)