Abstract
Natural Language Processing (NLP) is used to identify key information, generating predictive models, and explaining global events or trends. Also, NLP is supported during the process to create knowledge. Therefore, it is important to apply refinement techniques in major stages such as preprocessing, when data is frequently produced and processed with poor results. This document analyzes and measures the impact of combinations of preprocessing techniques and libraries for short texts that have been written in Spanish. These techniques were applied in tweets for analysis of sentiments considering evaluation parameters in its analysis, the processing time and characteristics of the techniques for each library. The performed experimentation provides readers insights for choosing the appropriate combination of techniques during preprocessing. The results show improvement of up to 5% to 9% in the performance of the classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Reese, R.M.: Natural Language Processing with Java. Packt Publishing (2015)
Battistelli, D., Charnois, T., Minel, J.L., Teissèdre, C.: Detecting salient events in large corpora by a combination of NLP and data mining techniques. Comput. y Sist. 17, 229–237 (2013)
Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50, 104–112 (2014). https://doi.org/10.1016/j.ipm.2013.08.006
Krouska, A., Troussas, C., Virvou, M.: The effect of preprocessing techniques on twitter sentiment analysis. In: 2016 7th International Conference on Information, Intelligent System Application (IISA), pp. 1–5 (2016). https://doi.org/10.1109/iisa.2016.7785373
Hidalgo, O., Jaimes, R., Gomez, E., Luján-mora, S.: Análisis de sentimiento aplicado al nivel de popularidad del lÃder polÃtico ecuatoriano Rafael Correa Sentiment Analysis applied to the popularity level of the Ecuadorian political leader Rafael Correa. In: 2017 International Conference on Information Systems and Computer Science (INCISCOS), pp. 340–346 (2017)
Gómez-Jiménez, G., Gonzalez-Ponce, K., Castillo-Pazos, D.J., Madariaga-Mazon, A., Barroso-Flores, J., Cortes-Guzman, F., Martinez-Mayorga, K.: The OECD Principles for (Q)SAR Models in the Context of Knowledge Discovery in Databases (KDD). Elsevier Inc. (2018)
Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in sentiment analysis. Procedia Comput. Sci. 17, 26–32 (2013). https://doi.org/10.1016/j.procs.2013.05.005
Gupta, I., Joshi, N.: Tweet normalization : a knowledge based approach. In: 2017 International Conference on Infocom Technologies and Unmanned Systems (Trends Future Directions) (ICTUS), pp. 1–6 (2017)
Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access. 5, 2870–2879 (2017). https://doi.org/10.1109/ACCESS.2017.2672677
Galadanci, B.S., Muaz, S.A., Mukhtar, M.I.: Comparing research outputs of Nigeria Federal Universities based on the scopus database. In: CEUR Workshop Proceedings, vol. 1755, pp. 79–84 (2016). https://doi.org/10.1177/0165551510000000
Paramkusham, S.: NLTK: The natural language toolkit. Int. J. Technol. Res. Eng. 5, 2845–2847 (2017)
Weerasooriya, T., Perera, N., Liyanage, S.R.: A method to extract essential keywords from a tweet using NLP tools. In: 16th International Conference on Advances in ICT for Emerging Regions, ICTer 2016 - Conference Proceedings, pp. 29–34 (2017)
SpaCy: spaCy. https://spacy.io/usage/linguistic-features#_title
Padró, L., Stanilovsky, E.: FreeLing 3.0: towards wider multilinguality. In: Proceedings Language Resources Evaluation Conference (LREC 2012), pp. 2473–2479 (2012)
HenrÃquez, C., Guzmán, J., Salcedo, D.: MinerÃa de Opiniones basado en la adaptación al español de ANEW sobre opiniones acerca de hoteles. Proces. del Leng. Nat. 41, 25–32 (2016)
Prata, D.N., Soares, K.P., Silva, M.A., Trevisan, D.Q., Letouze, P.: Social data analysis of Brazilian’s mood from twitter. Int. J. Soc. Sci. Humanit. 6, 179–183 (2016). https://doi.org/10.7763/IJSSH.2016.V6.640
Altszyler, E., Brusco, P.: Análisis de la dinámica del contenido semántico de textos. In: Argentine Symposium on Artificial Intelligence, pp. 256–263 (2015)
Pérez-guadarramas, Y., RodrÃguez-blanco, A., Simón-cuevas, A.: Combinando patrones léxico - sintácticos y análisis de tópicos para la extracción automática de frases relevantes en textos. Proces. L. 59, 39–46 (2017)
Antonio, F., Velásquez, C., Paul, J., De Paz, Z., Guzmán, J.F.: Aplicación del análisis sintáctico automático en la atribución de autorÃa de mensajes en redes sociales. Res. Comput. Sci. 137, 109–119 (2017)
Soto Kiewit, L.D.: Un acercamiento a la concepción de gobernabilidad en los discursos presidenciales de José MarÃa Figueres Olsen. Rev. Rupturas. 7, 1 (2017). https://doi.org/10.22458/rr.v7i1.1609
Poornima, B.K.: Text preprocessing on extracted text from audio/video using R. Int. J. Comput. Intell. Inform. 6, 267–278 (2017)
He, Y., Kayaalp, M.: A comparison of 13 tokenizers on MEDLINE. Bethesda, MD List. Hill Natl. Cent. Biomed. Commun. 48 (2006)
Alami, N., Meknassi, M., Ouatik, S.A., Ennahnahi, N.: Impact of stemming on Arabic text summarization. In: Colloquium in Information Science and Technology, CIST, pp. 338–343 (2017)
Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Procedia Comput. Sci. 89, 549–554 (2016). https://doi.org/10.1016/j.procs.2016.06.095
Katariya, N.P., Chaudhari, M.S.: Text preprocessing for text mining using side information. Int. J. Comput. Sci. Mob. Appl. 3, 3–7 (2015)
Althobaiti, M., Kruschwitz, U., Poesio, M.: AraNLP: a Java-based library for the processing of Arabic text. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 4134–4138 (2014)
Twitter Inc: Search Tweets. https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
RStudio: Take control of your R code. https://www.rstudio.com/products/rstudio/
GmbH R: Rapidminer Documentation
Acknowledgment
This research was supported by the vice-rectorate of investigations of the Universidad del Azuay. We thank our colleagues from Laboratorio de Investigación y Desarrollo en Informática (LIDI) at Universidad del Azuay who provided insight and expertise that greatly assisted this work. Part of this research is supported by the Design of architectures and interaction models for assisted living environments aimed at older adults project of the XVIII DIUC Call for Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Orellana, M., Trujillo, A., Cedillo, P. (2020). A Comparative Evaluation of Preprocessing Techniques for Short Texts in Spanish. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Advances in Information and Communication. FICC 2020. Advances in Intelligent Systems and Computing, vol 1130. Springer, Cham. https://doi.org/10.1007/978-3-030-39442-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-39442-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39441-7
Online ISBN: 978-3-030-39442-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)