Advertisement

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

  • Thanapapas Horsuwan
  • Kasidis Kanwatchara
  • Peerapon VateekulEmail author
  • Boonserm KijsirikulEmail author
Conference paper
  • 299 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12033)

Abstract

The ever-growing volume of data of user-generated content on social media provides a nearly unlimited corpus of unlabeled data even in languages where resources are scarce. In this paper, we demonstrate that state-of-the-art results on two Thai social text categorization tasks can be realized by pretraining a language model on a large noisy Thai social media corpus of over 1.26 billion tokens and later fine-tuned on the downstream classification tasks. Due to the linguistically noisy and domain-specific nature of the content, our unique data preprocessing steps designed for Thai social media were utilized to ease the training comprehension of the model. We compared four modern language models: ULMFiT, ELMo with biLSTM, OpenAI GPT, and BERT. We systematically compared the models across different dimensions including speed of pretraining and fine-tuning, perplexity, downstream classification benchmarks, and performance in limited pretraining data.

Keywords

Language model Pretraining Thai social media Comparative study Data preprocessing 

Notes

Acknowledgements

In the making of the paper, the authors would like to acknowledge Mr. Can Udomcharoenchaikit for his continuous and insightful research suggestions until the completion of this paper.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Aroonmanakun, W.: Thoughts on word and sentence segmentation in Thai (2007)Google Scholar
  5. 5.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  6. 6.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv e-prints arXiv:1406.1078, June 2014
  7. 7.
    Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. arXiv e-prints arXiv:1511.01432, November 2015
  8. 8.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018
  9. 9.
    Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv e-prints arXiv:1801.06146, January 2018
  10. 10.
    Lertpiya, A., et al.: A preliminary study on fundamental Thai NLP tasks for user-generated web content. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–8, November 2018.  https://doi.org/10.1109/iSAI-NLP.2018.8692946
  11. 11.
    Merity, S., Shirish Keskar, N., Socher, R.: Regularizing and Optimizing LSTM Language Models. arXiv e-prints arXiv:1708.02182, August 2017
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv e-prints arXiv:1301.3781, January 2013
  13. 13.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  14. 14.
    Peters, M.E., et al.: Deep contextualized word representations. arXiv e-prints arXiv:1802.05365, February 2018
  15. 15.
  16. 16.
    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)Google Scholar
  17. 17.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv e-prints arXiv:1409.3215, September 2014
  18. 18.
    Vaswani, A., et al.: Attention is all you need. arXiv e-prints arXiv:1706.03762, June 2017

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer Engineering, Faculty of EngineeringChulalongkorn UniversityBangkokThailand

Personalised recommendations