A Low Dimensionality Representation for Language Variety Identification

Rangel, Francisco; Franco-Salvador, Marc; Rosso, Paolo

doi:10.1007/978-3-319-75487-1_13

Francisco Rangel^14,15,
Marc Franco-Salvador¹⁴ &
Paolo Rosso¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1296 Accesses
10 Citations

Abstract

Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of \({\sim }\)35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality—and increasing the big data suitability—to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.

The work of the first author was in the framework of ECOPORTUNITY IPT-2012-1220-430000. The work of the last two authors was in the framework of the SomEMBED MINECO TIN2015-71147-C2-1-P research project. This work has been also supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAPATER (PrometeoII/2014/030).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://alt.qcri.org/LT4CloseLang/index.html.
2.
http://corporavm.uni-koeln.de/vardial/sharedtask.html.
3.
http://ttg.uni-saarland.de/lt4vardial2015/dsl.html.
4.
It is important to highlight the importance of this aspect from an evaluation perspective in an author profiling scenario. In fact, if texts from the same authors are both part of the training and test sets, their particular style and vocabulary choice may contribute at training time to learn the profile of the authors. In consequence, over-fitting would be biasing the results.
5.
Our hypothesis is that the distribution of weights for a given document should be closer to the weights of its corresponding language variety, therefore, we use the most common descriptive statistics to measure this variability among language varieties.
6.
Using the LDR a document is represented by a total set of features equal to 6 multiplied by the number of categories (the 5 language varieties), in our case 30 features. This is a considerable dimensionality reduction that may be helpful to deal with big data environments.
7.
The HispaBlogs dataset was collected by experts on social media from the Autoritas Consulting company (http://www.autoritas.net). Autoritas experts in the different countries selected popular bloggers related to politics, online marketing, technology or trends. The HispaBlogs dataset is publicly available at: https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs.
8.
We used 300-dimensional vectors, context windows of size 10, and 20 negative words for each sample. We preprocessed the text with word lowercase, tokenization, removing the words of length one, and with phrase detection using word2vec tools: https://code.google.com/p/word2vec/.
9.
http://www.cs.waikato.ac.nz/ml/weka/.
10.
We used SVM with default parameters and exhaustive correction code to transform the multiclass problem into a binary one.
11.
http://ttg.uni-saarland.de/lt4vardial2015/dsl.html.
12.
http://pan.webis.de.
13.
http://www.clef-innitiative.org.

References

Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., Antònia Martít, M.: Language variety identification using distributed representations of words and documents. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 28–40. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24027-5_3
Chapter Google Scholar
Goodman, J.: Classes for fast maximum entropy training. In: Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP 2001), vol. 1, pp. 561–564 (2001)
Google Scholar
Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13, 307–361 (2012)
MathSciNet MATH Google Scholar
Hinton, G.E., Mcclelland, J.L., Rumelhart, D.E.: Distributed Representations, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Foundations, vol. 1. MIT Press, Cambridge (1986)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), vol. 32 (2014)
Google Scholar
Maier, W., Gómez-Rodríguez, C.: Language variety identification in Spanish tweets. In: Workshop on Language Technology for Closely Related Languages and Language Variants (EMNLP 2014), pp. 25–35 (2014)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at International Conference on Learning Representations (ICLR 2013) (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1751–1758 (2012)
Google Scholar
Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: 1st International Workshop on Social Media Retrieval and Analysis (SoMeRa 2014) (2014)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Tan, L., Zampieri, M., Ljubešic, N., Tiedemann, J.: Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection. In: 7th Workshop on Building and Using Comparable Corpora Building Resources for Machine Translation Research (BUCC 2014), pp. 6–10 (2014)
Google Scholar
Zampieri, M., Gebrekidan-Gebre, B.: Automatic identification of language varieties: the case of Portuguese. In: Proceedings of the 11th Conference on Natural Language Processing (KONVENS 2012), pp. 233–237 (2012)
Google Scholar
Zampieri, M., Tan, L., Ljubeši, N., Tiedemann, J.: A report on the DSL shared task 2014. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial 2014), pp. 58–67 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Universitat Politècnica de València, Valencia, Spain
Francisco Rangel, Marc Franco-Salvador & Paolo Rosso
Autoritas Consulting, Valencia, Spain
Francisco Rangel

Authors

Francisco Rangel
View author publications
You can also search for this author in PubMed Google Scholar
Marc Franco-Salvador
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco Rangel .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rangel, F., Franco-Salvador, M., Rosso, P. (2018). A Low Dimensionality Representation for Language Variety Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_13
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Low Dimensionality Representation for Language Variety Identification