Word Embedding for Small and Domain-specific Malay Corpus

Tiun, Sabrina; Nor, Nor Fariza Mohd; Jalaludin, Azhar; Rahman, Anis Nadiah Che Abdul

doi:10.1007/978-981-15-0058-9_42

Sabrina Tiun³⁸,
Nor Fariza Mohd Nor³⁹,
Azhar Jalaludin³⁹ &
…
Anis Nadiah Che Abdul Rahman³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 603))

1877 Accesses
1 Citations

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Desagulier, G. Can Word Vectors help corpus linguists? https://halshs.archives-ouver-tes.fr/halshs-01657591v2/document, last accessed 2019/05/01.
Ritu, Z. S., Nowshin N., Nahid M. M. H., Ismail, S. Performance Analysis of Different Word Embedding Models on Bangla Language. In: International Conference on Bangla Speech and Language Processing (ICBSLP), IEEE: Sylhet, Bangladesh (2018).
Google Scholar
Nooralahzadeh F., Øvrelid L., Lønning J. T. Evaluation of Domain-Specific Word Embed-dings Using Knowledge Resources. In: Calzolari N. et al (eds.) the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1438-1445, ELRA, Miya-zaki, Japan (2018).
Google Scholar
Řehůřek R. Word2vec Tutorial. https://rare-technologies.com/word2vec-tutorial/#app, last accessed 2019/01/30.
Chen, D., Peterson, J.C., Griffiths, T.L. Evaluating vector-space models of analogy. In: The Proceedings of the 39th Annual Conference of the Cognitive Science, Cognitive Science Society: London, UK. (2017)
Google Scholar
Word2vec. https://code.google.com/archive/p/word2vec/, last accessed 2019/05/30.
Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL: Berlin, Germany (2016).
Google Scholar
Nina Tahmasebi, N. A Study on Word2Vec on a Historical Swedish Newspaper Corpus. In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland (2018).
Google Scholar
Mikolov T., Chen K, Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. In: Proceedings of Workshop at ICLR 2013, Microtome Publishing: Scottsdale, AZ, USA (2013).
Google Scholar

Download references

Acknowledgement

This works is supported by Universiti Kebangsaan Malaysia (Research Code: GGPM-2017-025)

Author information

Authors and Affiliations

ASLAN Lab, CAIT, Faculty of Technology and Information System, Universiti Kebangsaan Malaysia, Selangor, Malaysia
Sabrina Tiun
Faculty of Science Social and Humanities, Universiti Kebangsaan Malaysia, Selangor, Malaysia
Nor Fariza Mohd Nor, Azhar Jalaludin & Anis Nadiah Che Abdul Rahman

Authors

Sabrina Tiun
View author publications
You can also search for this author in PubMed Google Scholar
Nor Fariza Mohd Nor
View author publications
You can also search for this author in PubMed Google Scholar
Azhar Jalaludin
View author publications
You can also search for this author in PubMed Google Scholar
Anis Nadiah Che Abdul Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sabrina Tiun .

Editor information

Editors and Affiliations

Knowledge Technology Research Unit, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu, Malaysia
Rayner Alfred
School of Information Science, Security and Networks Area, Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Yuto Lim
Department of Computer Science, Universitas Mulawarman, Samarinda, Indonesia
Haviluddin Haviluddin
Center of Data and Information Management, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, Malaysia
Chin Kim On

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tiun, S., Nor, N.F.M., Jalaludin, A., Rahman, A.N.C.A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In: Alfred, R., Lim, Y., Haviluddin, H., On, C. (eds) Computational Science and Technology. Lecture Notes in Electrical Engineering, vol 603. Springer, Singapore. https://doi.org/10.1007/978-981-15-0058-9_42

Download citation

DOI: https://doi.org/10.1007/978-981-15-0058-9_42
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0057-2
Online ISBN: 978-981-15-0058-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics