Abstract
Technological advances have made it possible for areas such as Corpus Linguistics and Computational Linguistics to advance exponentially. However, the basic evolution followed by corpora, as an essential tool in these areas, has been fundamentally in size. Proof of this is the Google nGram project, which has digitized a vast number of books from 1505 to the present day, allowing studies to be carried out on corpora. However, and as a result of the continuous evolution of new communication media and social networks, we have witnessed the birth of a new genre, called cyber-language, situated between orality and textuality, of which there are no specialized corpora. Our proposal is to design a tool to create a large multidimensional corpus based on the social network Twitter and a set of specific tools to generate subcorpora, conduct quantitative studies and visualize the stored information, from the perspective of bigdata manipulation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
More information on https://www.ibm.com/internet-of-things.
- 2.
An API is a set of commands, functions, protocols, and objects that programmers can use to create software or interact with an external system. It provides developers with standard commands for performing common operations, thus they do not have to write the code from scratch.
- 3.
More information on https://developer.twitter.com/en/docs/api-reference-index.
- 4.
More information on https://www.ibm.com/analytics/hadoop/mapreduce.
- 5.
More information on http://www.nltk.org/ .
References
Michel, J., Shen, Y., et al.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Zieba, A.: Google books Ngram viewer in socio-cultural research. Res. Lang. 16, 357–376 (2018). https://doi.org/10.2478/rela-2018-0015
Naveed, A., Aziz, S., Mehfooz, M.: Analysis of cyber language: identifying gender boundaries. Eur. Acad. Res. II(7), 9706–9724 (2014)
Anthony, L., Hardaker, C.: FireAnt (1.1.3) [Computer Software]. Waseda University, Tokio (2019). http://www.laurenceanthony.net/. Accessed 06 July 2019
Morstatter, F., Pfeffer, J., Liu, H., Carley, K.: Is the sample good enough? Comparing data from Twitter’s Streaming API with Twitter’s Firehose. Association for the Advancement of Artificial Intelligence arXiv:1306.5204 (2013)
Church, K.: Corpus methods in a digitized world, pp. 3–15 (2017). https://doi.org/10.1007/978-3-319-69805-2_1
Maroto, A.: Big Data, Twitter and Music: New paths in research. https://www.researchgate.net/publication/331479188. Accessed 14 Jan 2019
Maroto, A.: El metadiscurso en las redes sociales: Una extensión multidimensional. Análisis de cinco dirigentes políticos de la coalición Ahora Podemos a través de la red social Twitter. https://www.researchgate.net/publication/331479188. Accessed 14 Jan 2019
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Maroto Conde, Á.L., Bermúdez Vázquez, M. (2019). MBLA Social Corpus. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-30135-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30134-7
Online ISBN: 978-3-030-30135-4
eBook Packages: Computer ScienceComputer Science (R0)