Skip to main content

Thai Words Segmentation Using an Unsupervised Learning Technique

  • Conference paper
  • First Online:
  • 291 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1149))

Abstract

Word Segmentation or Tokenization is the process of determining the best likely sequence of words from a sequence of text. For Thai language, word segmentation is not a trivial task as words and sentences in Thai are written continuously without any spaces or delimiters. Most techniques for word segmentation, especially when using machine learning, requires manually tagged data where words begin and end as a training dataset. In this study, an unsupervised machine learning technique that does not require the use of manually tagged data was developed. The technique involves breaking input text into syllables and then uses Genetic Algorithms (GA) to merge the syllables back into words. GA identifies the best segmentation of words by minimizing word distance which is the novel concept developed in this study. It is the sum of all syllable distances of every pair of syllables within a word. The syllable distance is the measure of how far apart each pair of syllables is in a document. The implementation was done using Python and achieves 70% accuracy (F1 measure) while using a 100k untagged words training dataset. The performance also improves with more training data and some tuning of GA parameters.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ando, R.K., Lee, L.: Mostly-unsupervised statistical segmentation of Japanese Kanji sequences. Nat. Lang. Eng. 9(2), 127–149 (2003)

    Article  Google Scholar 

  2. Aroonmanakun, W.: Collocation and Thai word segmentation. In: Proceedings of the Fifth Symposium on Natural Language Processing & the Fifth Oriental COCOSDA Workshop, pp. 68–75 (2002)

    Google Scholar 

  3. Bheganan, P., Richi, N., Xu, Y.: Thai word segmentation with hidden Markov model and decision tree. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Bangkok (2009)

    Google Scholar 

  4. Boonkwan, P., Supnithi, T.: Bidirectional deep learning of context representation for joint word segmentation and POS tagging. In: Le, N.T., van Do, T., Nguyen, N., Thi, H. (eds.) Advanced Computational Methods for Knowledge Engineering (2018)

    Google Scholar 

  5. Chang, J.S., Lin, T.: Unsupervised word segmentation without dictionary. In: The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), pp. 355–359 (2003)

    Google Scholar 

  6. Chanyapornpong, S.: A Thai syllable separation algorithm. Master thesis, Asian Institute of Technology, Thailand (1983)

    Google Scholar 

  7. Chen, S., Xu, Y., Chang, H.: A simple and effective unsupervised word segmentation approach. In: Proceeding of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, USA (2011)

    Google Scholar 

  8. Detorakis, Z., Tambouratzis, G.: Applying a sectioned genetic algorithm to word segmentation. Pattern Anal. Appl. 13(1), 93–104 (2010)

    Article  MathSciNet  Google Scholar 

  9. Haruechaiyasak, C., Kongyoung, S., Dailey, M.: A comparative study on thai word segmentation approaches. In: Proceedings of ECTI-CON (2008)

    Google Scholar 

  10. Jousimo, J.: Thai word segmentation with bi-directional RN (2017). https://sertiscorp.com/thai-word-segmentation-with-bi-directional_rnn

  11. Kazakov, D., Manandhar, S.: Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Mach. Learn. 43, 121–162 (2001)

    Article  Google Scholar 

  12. Khankasikarn, K., Muansuean, N.: Thai word segmentation a lexical semantic approach. In: Proceedings of the Tenth Machine Translation Summit (2005)

    Google Scholar 

  13. Kittinaradorn, R., Chaovavanich, K., Achakulvisut, T., Kaewkasi, C.: Deepcut (2018). https://github.com/rkcosmos/deepcut

  14. Koanantakool, H.T., Karoonboonyanan, T., Wutiwiwatchai, C.: Computers and the Thai language. IEEE Ann. Hist. Comput. 31(1), 46–61 (2009)

    Article  MathSciNet  Google Scholar 

  15. Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: SegGen: a genetic algorithm for linear text segmentation. In: IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007 (2007)

    Google Scholar 

  16. Lapjaturapit, T., Viriyayudhakorn, K., Theeramunkong, T.: Multi-candidate word segmentation using bi-directional LSTM neural networks. In: Proceedings of the 11th International Conference on Embedded Systems and Intelligent Technology in cooperation with the 9th International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES 2018), Khon Kaen, Thailand, pp. 30–35 (2018)

    Google Scholar 

  17. Mohammed, A., Karam, M., Hefny, H.: GA-based parameter optimization for word segmentation. Artif. Intell. Mach. Learn. J. 17(1), 23–32 (2017)

    Google Scholar 

  18. Nectec. Annotated and Multimedia Corpus. National Electronics and Computer Technology Center. https://www.nectec.or.th/corpus/index.php?league=pm. Accessed 21 Nov 2019

  19. Nguyen, T.V., Tran, H.K., Nguyen, T.T.T., Nguyen, H.: Word segmentation for Vietnamese text categorization: an online corpus approach. In: The 4th International Conference on Computer Sciences Research, Innovation and Vision for the Future (2006)

    Google Scholar 

  20. Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001), Tokyo, Japan, November 2001 (2001)

    Google Scholar 

  21. Poowarawan, Y.: Dictionary-based Thai syllable separation. In: Proceedings of the Ninth Electronics Engineering Conference (1986)

    Google Scholar 

  22. PyPI: ttlk 1.2.1 Thai Language Toolkit. https://pypi.org/project/tltk/. Accessed 21 Nov 2019

  23. Theeramunkong, T., Usanavasin, S.: Non-dictionary-based Thai word segmentation using decision trees. In: Proceedings of the First International Conference on Human Language Technology Research, San Diego, California, 18–21 March 2001, pp. 251–256 (2001)

    Google Scholar 

  24. Wang, H., Lepage, Y.: Unsupervised word segmentation using minimum description length for neural machine translation. In: The Association for Natural Language Processing (2018)

    Google Scholar 

  25. Wikipedia. Thai words by number of syllables (2019). https://en.wiktionary.org/wiki/Category:Thai_words_by_number_of_syllables

  26. Zhikov, V., Takamura, H., Okumura, M.: An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. Inf. Media Technol. 8(2), 514–527 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jirapon Sunkpho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sunkpho, J., Hofmann, M. (2020). Thai Words Segmentation Using an Unsupervised Learning Technique. In: Meesad, P., Sodsee, S. (eds) Recent Advances in Information and Communication Technology 2020. IC2IT 2020. Advances in Intelligent Systems and Computing, vol 1149. Springer, Cham. https://doi.org/10.1007/978-3-030-44044-2_9

Download citation

Publish with us

Policies and ethics