Skip to main content

A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text

  • Chapter
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 847))

Abstract

This research compares the effectiveness of using traditional bag-of-words and word-embedding attributes to classify movie comments into spoiler or non-spoiler. Both approaches were applied to comments in English, an inflectional language; and in Thai, a non-inflectional language. Experimental results suggested that in terms of classification performance, word embedding was not clearly better than bag of words. Yet, a decision to choose it over bag of words could be due to its scalability. Between Word2Vec and FastText embeddings, the former was favorable when few out-of-vocabulary (OOV) words were present. Finally, although FastText was expected to be helpful with a large number of OOV words, its benefit was hardly seen for Thai language.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Some linguistic researches suggested that modern English is drifting towards analyticity [8, 9]. It has lower degree of inflection than Old English and other languages such as German. However, it is still more synthetic than Thai.

  2. 2.

    https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2.

  3. 3.

    https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-pages-articles.xml.bz2.

References

  1. Boyd-Graber, J., Glasgow, K., Zajac, S.J.: Spoiler alerts: machine learning approaches to detect social media posts with revelatory information. In: Proceedings of the American Society for Information Science and Technology (ASIST), Montreal, Quebec, Canada, Nov 2013

    Google Scholar 

  2. Hijikata, Y., Iwai, H., Nishida, S.: Context-based plot detection from online comments for preventing spoilers. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Omaha, NE, USA, pp. 57–65, Oct 2016

    Google Scholar 

  3. Chang, B., Kim, H., Kim, R., Kim, D., Kang, J.: A deep neural spoiler detection model using a genre-aware attention mechanism. In: Phung, D., et al. (eds.) PAKDD 2018. Lecture Notes in Artificial Intelligence (LNAI), vol. 10937, pp. 183–195 (2018)

    Chapter  Google Scholar 

  4. Iwai, H., Hijikata, Y., Ikeda, K., Nishida, S.: Sentence-based plot classification for online review comment. In: Proceedings of the IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technologies, Warsaw, Poland, pp. 245–253, Aug 2014

    Google Scholar 

  5. Jeon, S., Kim, S., Yu, H.: Spoiler detection in TV program tweets. Inf. Sci. 329, 220–235 (2016)

    Article  Google Scholar 

  6. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Scottsdale, AZ, USA, May 2013

    Google Scholar 

  7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)

    Article  Google Scholar 

  8. Haspelmath, M., Michaelis, S.M.: Analytic and synthetic: typological change in varieties of European languages. In: Buchstaller, I., Siebenhaar, B. (eds.). Language Variation—European Perspective VI (2017)

    Google Scholar 

  9. Szmrecsanyi, B.: An analytic-synthetic spiral in the history of English. Linguist. Today 227, 93–112 (2016)

    Article  Google Scholar 

  10. Songram, P., Choompol, A., Thipsanthia, P., Boonjing, V.: Detecting Thai messages leading to deception on Facebook. In: Huynh, V.-N., et al. (eds.) IUKM 2016. Lecture Notes in Artificial Intelligence (LNAI), vol. 9978, pp. 293–304 (2016)

    Google Scholar 

  11. Tuarob, S., Mitrpanont, J.L.: Automatic discovery of abusive Thai language usages in social networks. In: Choemprayong, S., et al. (eds.) ICADL 2017. Lecture Notes in Computer Science (LNCS), vol. 10647, pp. 267–278 (2017)

    Chapter  Google Scholar 

  12. Porter, M.F.: Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html. Accessed 15 May 2019

  13. Singh, J., Gupta, V.: Text stemming: approaches, applications, and challenges. ACM Comput. Surv. 49(3) (article 45) (2016)

    Article  Google Scholar 

  14. Slayden, G.: Overview of Thai language. http://www.thai-language.com/ref/overview. Accessed 15 Apr 2019

  15. Seneewong Na Ayutthaya, T., Pasupa, K.: Thai sentiment analysis via bidirectional LSTM-CNN model with embedding vectors and sentic features. In: Proceedings of the International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Pattaya, Thailand, Nov 2018

    Google Scholar 

  16. Promrit, N., Waijanya, S.: Convolutional neural networks for Thai poem classification. In: Cong, F., et al. (eds.) ISNN 2017, Part I. Lecture Notes in Computer Science (LNCS), vol 10261, pp. 449–456 (2017)

    Chapter  Google Scholar 

  17. Polpinij, J., Srikanjanapert, N., Sopon, P.: Word2Vec approach for sentiment classification relating to hotel reviews. In: Meesad, P., et al. (eds.) Recent Advances in Information and Communication Technology 2017. Advances in Intelligence Systems and Computing, vol. 556, pp. 308–316 (2017)

    Google Scholar 

  18. Facebook Open Source: Word vectors for 157 languages. https://fasttext.cc/docs/en/crawl-vectors.html. Accessed 15 Apr 2019

  19. Horn, F.: Context encoder as a simple but powerful extension of Word2Vec. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, pp. 10–14, Aug 2017

    Google Scholar 

  20. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, pp. 1188–1196, June 2014

    Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543, Oct 2014

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rangsipan Marukatat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Marukatat, R. (2020). A Comparative Study of Using Bag-of-Words and Word-Embedding Attributes in the Spoiler Classification of English and Thai Text. In: Lee, R. (eds) Applied Computing and Information Technology. ACIT 2019. Studies in Computational Intelligence, vol 847. Springer, Cham. https://doi.org/10.1007/978-3-030-25217-5_7

Download citation

Publish with us

Policies and ethics