Abstract
This research compares the effectiveness of using traditional bag-of-words and word-embedding attributes to classify movie comments into spoiler or non-spoiler. Both approaches were applied to comments in English, an inflectional language; and in Thai, a non-inflectional language. Experimental results suggested that in terms of classification performance, word embedding was not clearly better than bag of words. Yet, a decision to choose it over bag of words could be due to its scalability. Between Word2Vec and FastText embeddings, the former was favorable when few out-of-vocabulary (OOV) words were present. Finally, although FastText was expected to be helpful with a large number of OOV words, its benefit was hardly seen for Thai language.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
References
Boyd-Graber, J., Glasgow, K., Zajac, S.J.: Spoiler alerts: machine learning approaches to detect social media posts with revelatory information. In: Proceedings of the American Society for Information Science and Technology (ASIST), Montreal, Quebec, Canada, Nov 2013
Hijikata, Y., Iwai, H., Nishida, S.: Context-based plot detection from online comments for preventing spoilers. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Omaha, NE, USA, pp. 57–65, Oct 2016
Chang, B., Kim, H., Kim, R., Kim, D., Kang, J.: A deep neural spoiler detection model using a genre-aware attention mechanism. In: Phung, D., et al. (eds.) PAKDD 2018. Lecture Notes in Artificial Intelligence (LNAI), vol. 10937, pp. 183–195 (2018)
Iwai, H., Hijikata, Y., Ikeda, K., Nishida, S.: Sentence-based plot classification for online review comment. In: Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technologies, Warsaw, Poland, pp. 245–253, Aug 2014
Jeon, S., Kim, S., Yu, H.: Spoiler detection in TV program tweets. Inf. Sci. 329, 220–235 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Scottsdale, AZ, USA, May 2013
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)
Haspelmath, M., Michaelis, S.M.: Analytic and synthetic: typological change in varieties of European languages. In: Buchstaller, I., Siebenhaar, B. (eds.). Language Variation—European Perspective VI (2017)
Szmrecsanyi, B.: An analytic-synthetic spiral in the history of English. Linguist. Today 227, 93–112 (2016)
Songram, P., Choompol, A., Thipsanthia, P., Boonjing, V.: Detecting Thai messages leading to deception on Facebook. In: Huynh, V.-N., et al. (eds.) IUKM 2016. Lecture Notes in Artificial Intelligence (LNAI), vol. 9978, pp. 293–304 (2016)
Tuarob, S., Mitrpanont, J.L.: Automatic discovery of abusive Thai language usages in social networks. In: Choemprayong, S., et al. (eds.) ICADL 2017. Lecture Notes in Computer Science (LNCS), vol. 10647, pp. 267–278 (2017)
Porter, M.F.: Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. Accessed 15 May 2019
Singh, J., Gupta, V.: Text stemming: approaches, applications, and challenges. ACM Comput. Surv. 49(3) (article 45) (2016)
Slayden, G.: Overview of Thai language. http://www.thai-language.com/ref/overview. Accessed 15 Apr 2019
Seneewong Na Ayutthaya, T., Pasupa, K.: Thai sentiment analysis via bidirectional LSTM-CNN model with embedding vectors and sentic features. In: Proceedings of the International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Pattaya, Thailand, Nov 2018
Promrit, N., Waijanya, S.: Convolutional neural networks for Thai poem classification. In: Cong, F., et al. (eds.) ISNN 2017, Part I. Lecture Notes in Computer Science (LNCS), vol 10261, pp. 449–456 (2017)
Polpinij, J., Srikanjanapert, N., Sopon, P.: Word2Vec approach for sentiment classification relating to hotel reviews. In: Meesad, P., et al. (eds.) Recent Advances in Information and Communication Technology 2017. Advances in Intelligence Systems and Computing, vol. 556, pp. 308–316 (2017)
Facebook Open Source: Word vectors for 157 languages. https://fasttext.cc/docs/en/crawl-vectors.html. Accessed 15 Apr 2019
Horn, F.: Context encoder as a simple but powerful extension of Word2Vec. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 10–14, Aug 2017
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, pp. 1188–1196, June 2014
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543, Oct 2014
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Marukatat, R. (2020). A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text. In: Lee, R. (eds) Applied Computing and Information Technology. ACIT 2019. Studies in Computational Intelligence, vol 847. Springer, Cham. https://doi.org/10.1007/978-3-030-25217-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-25217-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-25216-8
Online ISBN: 978-3-030-25217-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)