Skip to main content

Word Embedding for Small and Domain-specific Malay Corpus

  • Conference paper
Book cover Computational Science and Technology

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 603))

Abstract

In this paper, we present the process of training the word embedding (WE) model for a small, domain-specific Malay corpus. In this study, Hansard corpus of Malaysia Parliament for specific years was trained on the Word2vec model. However, a specific setting of the hyperparameters is required to obtain an accurate WE model because changing one of the hyperparameters would affect the model’s performance. We trained the corpus into a series of WE model on a set of hyperparameters where one of the parameter values was different from each model. The model performances were intrinsically evaluated using three semantic word relations, namely; word similarity, dissimilarity and analogy. The evaluation was performed based on the model output and analysed by experts (corpus linguists). Experts’ evaluation result on a small, domain-specific corpus showed that the suitable hyperparameters were a window size of 5 or 10, a vector size of 50 to 100 and Skip-gram architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Desagulier, G. Can Word Vectors help corpus linguists? https://halshs.archives-ouver-tes.fr/halshs-01657591v2/document, last accessed 2019/05/01.

  2. Ritu, Z. S., Nowshin N., Nahid M. M. H., Ismail, S. Performance Analysis of Different Word Embedding Models on Bangla Language. In: International Conference on Bangla Speech and Language Processing (ICBSLP), IEEE: Sylhet, Bangladesh (2018).

    Google Scholar 

  3. Nooralahzadeh F., Øvrelid L., Lønning J. T. Evaluation of Domain-Specific Word Embed-dings Using Knowledge Resources. In: Calzolari N. et al (eds.) the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1438-1445, ELRA, Miya-zaki, Japan (2018).

    Google Scholar 

  4. Řehůřek R. Word2vec Tutorial. https://rare-technologies.com/word2vec-tutorial/#app, last accessed 2019/01/30.

  5. Chen, D., Peterson, J.C., Griffiths, T.L. Evaluating vector-space models of analogy. In: The Proceedings of the 39th Annual Conference of the Cognitive Science, Cognitive Science Society: London, UK. (2017)

    Google Scholar 

  6. Word2vec. https://code.google.com/archive/p/word2vec/, last accessed 2019/05/30.

  7. Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL: Berlin, Germany (2016).

    Google Scholar 

  8. Nina Tahmasebi, N. A Study on Word2Vec on a Historical Swedish Newspaper Corpus. In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland (2018).

    Google Scholar 

  9. Mikolov T., Chen K, Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. In: Proceedings of Workshop at ICLR 2013, Microtome Publishing: Scottsdale, AZ, USA (2013).

    Google Scholar 

Download references

Acknowledgement

This works is supported by Universiti Kebangsaan Malaysia (Research Code: GGPM-2017-025)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sabrina Tiun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Tiun, S., Nor, N.F.M., Jalaludin, A., Rahman, A.N.C.A. (2020). Word Embedding for Small and Domain-specific Malay Corpus. In: Alfred, R., Lim, Y., Haviluddin, H., On, C. (eds) Computational Science and Technology. Lecture Notes in Electrical Engineering, vol 603. Springer, Singapore. https://doi.org/10.1007/978-981-15-0058-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-0058-9_42

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-0057-2

  • Online ISBN: 978-981-15-0058-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics