A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text

Marukatat, Rangsipan

doi:10.1007/978-3-030-25217-5_7

A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text

Rangsipan Marukatat³

Chapter
First Online: 22 August 2019

496 Accesses
4 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 847))

Abstract

This research compares the effectiveness of using traditional bag-of-words and word-embedding attributes to classify movie comments into spoiler or non-spoiler. Both approaches were applied to comments in English, an inflectional language; and in Thai, a non-inflectional language. Experimental results suggested that in terms of classification performance, word embedding was not clearly better than bag of words. Yet, a decision to choose it over bag of words could be due to its scalability. Between Word2Vec and FastText embeddings, the former was favorable when few out-of-vocabulary (OOV) words were present. Finally, although FastText was expected to be helpful with a large number of OOV words, its benefit was hardly seen for Thai language.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Some linguistic researches suggested that modern English is drifting towards analyticity [8, 9]. It has lower degree of inflection than Old English and other languages such as German. However, it is still more synthetic than Thai.
2.
https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2.
3.
https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-pages-articles.xml.bz2.

References

Boyd-Graber, J., Glasgow, K., Zajac, S.J.: Spoiler alerts: machine learning approaches to detect social media posts with revelatory information. In: Proceedings of the American Society for Information Science and Technology (ASIST), Montreal, Quebec, Canada, Nov 2013
Google Scholar
Hijikata, Y., Iwai, H., Nishida, S.: Context-based plot detection from online comments for preventing spoilers. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Omaha, NE, USA, pp. 57–65, Oct 2016
Google Scholar
Chang, B., Kim, H., Kim, R., Kim, D., Kang, J.: A deep neural spoiler detection model using a genre-aware attention mechanism. In: Phung, D., et al. (eds.) PAKDD 2018. Lecture Notes in Artificial Intelligence (LNAI), vol. 10937, pp. 183–195 (2018)
Chapter Google Scholar
Iwai, H., Hijikata, Y., Ikeda, K., Nishida, S.: Sentence-based plot classification for online review comment. In: Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technologies, Warsaw, Poland, pp. 245–253, Aug 2014
Google Scholar
Jeon, S., Kim, S., Yu, H.: Spoiler detection in TV program tweets. Inf. Sci. 329, 220–235 (2016)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Scottsdale, AZ, USA, May 2013
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)
Article Google Scholar
Haspelmath, M., Michaelis, S.M.: Analytic and synthetic: typological change in varieties of European languages. In: Buchstaller, I., Siebenhaar, B. (eds.). Language Variation—European Perspective VI (2017)
Google Scholar
Szmrecsanyi, B.: An analytic-synthetic spiral in the history of English. Linguist. Today 227, 93–112 (2016)
Article Google Scholar
Songram, P., Choompol, A., Thipsanthia, P., Boonjing, V.: Detecting Thai messages leading to deception on Facebook. In: Huynh, V.-N., et al. (eds.) IUKM 2016. Lecture Notes in Artificial Intelligence (LNAI), vol. 9978, pp. 293–304 (2016)
Google Scholar
Tuarob, S., Mitrpanont, J.L.: Automatic discovery of abusive Thai language usages in social networks. In: Choemprayong, S., et al. (eds.) ICADL 2017. Lecture Notes in Computer Science (LNCS), vol. 10647, pp. 267–278 (2017)
Chapter Google Scholar
Porter, M.F.: Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. Accessed 15 May 2019
Singh, J., Gupta, V.: Text stemming: approaches, applications, and challenges. ACM Comput. Surv. 49(3) (article 45) (2016)
Article Google Scholar
Slayden, G.: Overview of Thai language. http://www.thai-language.com/ref/overview. Accessed 15 Apr 2019
Seneewong Na Ayutthaya, T., Pasupa, K.: Thai sentiment analysis via bidirectional LSTM-CNN model with embedding vectors and sentic features. In: Proceedings of the International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Pattaya, Thailand, Nov 2018
Google Scholar
Promrit, N., Waijanya, S.: Convolutional neural networks for Thai poem classification. In: Cong, F., et al. (eds.) ISNN 2017, Part I. Lecture Notes in Computer Science (LNCS), vol 10261, pp. 449–456 (2017)
Chapter Google Scholar
Polpinij, J., Srikanjanapert, N., Sopon, P.: Word2Vec approach for sentiment classification relating to hotel reviews. In: Meesad, P., et al. (eds.) Recent Advances in Information and Communication Technology 2017. Advances in Intelligence Systems and Computing, vol. 556, pp. 308–316 (2017)
Google Scholar
Facebook Open Source: Word vectors for 157 languages. https://fasttext.cc/docs/en/crawl-vectors.html. Accessed 15 Apr 2019
Horn, F.: Context encoder as a simple but powerful extension of Word2Vec. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 10–14, Aug 2017
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, pp. 1188–1196, June 2014
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543, Oct 2014
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Mahidol University, Nakhon Pathom, 73170, Thailand
Rangsipan Marukatat

Authors

Rangsipan Marukatat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rangsipan Marukatat .

Editor information

Editors and Affiliations

Software Engineering and Information Technology Institute, Central Michigan University, Mount Pleasant, MI, USA
Roger Lee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Marukatat, R. (2020). A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text. In: Lee, R. (eds) Applied Computing and Information Technology. ACIT 2019. Studies in Computational Intelligence, vol 847. Springer, Cham. https://doi.org/10.1007/978-3-030-25217-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-25217-5_7
Published: 22 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-25216-8
Online ISBN: 978-3-030-25217-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics