Abstract
In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Desagulier, G. Can Word Vectors help corpus linguists? https://halshs.archives-ouver-tes.fr/halshs-01657591v2/document, last accessed 2019/05/01.
Ritu, Z. S., Nowshin N., Nahid M. M. H., Ismail, S. Performance Analysis of Different Word Embedding Models on Bangla Language. In: International Conference on Bangla Speech and Language Processing (ICBSLP), IEEE: Sylhet, Bangladesh (2018).
Nooralahzadeh F., Øvrelid L., Lønning J. T. Evaluation of Domain-Specific Word Embed-dings Using Knowledge Resources. In: Calzolari N. et al (eds.) the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1438-1445, ELRA, Miya-zaki, Japan (2018).
Řehůřek R. Word2vec Tutorial. https://rare-technologies.com/word2vec-tutorial/#app, last accessed 2019/01/30.
Chen, D., Peterson, J.C., Griffiths, T.L. Evaluating vector-space models of analogy. In: The Proceedings of the 39th Annual Conference of the Cognitive Science, Cognitive Science Society: London, UK. (2017)
Word2vec. https://code.google.com/archive/p/word2vec/, last accessed 2019/05/30.
Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL: Berlin, Germany (2016).
Nina Tahmasebi, N. A Study on Word2Vec on a Historical Swedish Newspaper Corpus. In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland (2018).
Mikolov T., Chen K, Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. In: Proceedings of Workshop at ICLR 2013, Microtome Publishing: Scottsdale, AZ, USA (2013).
Acknowledgement
This works is supported by Universiti Kebangsaan Malaysia (Research Code: GGPM-2017-025)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tiun, S., Nor, N.F.M., Jalaludin, A., Rahman, A.N.C.A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In: Alfred, R., Lim, Y., Haviluddin, H., On, C. (eds) Computational Science and Technology. Lecture Notes in Electrical Engineering, vol 603. Springer, Singapore. https://doi.org/10.1007/978-981-15-0058-9_42
Download citation
DOI: https://doi.org/10.1007/978-981-15-0058-9_42
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0057-2
Online ISBN: 978-981-15-0058-9
eBook Packages: EngineeringEngineering (R0)