Thai Words Segmentation Using an Unsupervised Learning Technique

  • Jirapon SunkphoEmail author
  • Markus Hofmann
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1149)


Word Segmentation or Tokenization is the process of determining the best likely sequence of words from a sequence of text. For Thai language, word segmentation is not a trivial task as words and sentences in Thai are written continuously without any spaces or delimiters. Most techniques for word segmentation, especially when using machine learning, requires manually tagged data where words begin and end as a training dataset. In this study, an unsupervised machine learning technique that does not require the use of manually tagged data was developed. The technique involves breaking input text into syllables and then uses Genetic Algorithms (GA) to merge the syllables back into words. GA identifies the best segmentation of words by minimizing word distance which is the novel concept developed in this study. It is the sum of all syllable distances of every pair of syllables within a word. The syllable distance is the measure of how far apart each pair of syllables is in a document. The implementation was done using Python and achieves 70% accuracy (F1 measure) while using a 100k untagged words training dataset. The performance also improves with more training data and some tuning of GA parameters.


Word segmentation Unsupervised learning Genetic Algorithms 


  1. 1.
    Ando, R.K., Lee, L.: Mostly-unsupervised statistical segmentation of Japanese Kanji sequences. Nat. Lang. Eng. 9(2), 127–149 (2003)CrossRefGoogle Scholar
  2. 2.
    Aroonmanakun, W.: Collocation and Thai word segmentation. In: Proceedings of the Fifth Symposium on Natural Language Processing & the Fifth Oriental COCOSDA Workshop, pp. 68–75 (2002)Google Scholar
  3. 3.
    Bheganan, P., Richi, N., Xu, Y.: Thai word segmentation with hidden Markov model and decision tree. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Bangkok (2009)Google Scholar
  4. 4.
    Boonkwan, P., Supnithi, T.: Bidirectional deep learning of context representation for joint word segmentation and POS tagging. In: Le, N.T., van Do, T., Nguyen, N., Thi, H. (eds.) Advanced Computational Methods for Knowledge Engineering (2018)Google Scholar
  5. 5.
    Chang, J.S., Lin, T.: Unsupervised word segmentation without dictionary. In: The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), pp. 355–359 (2003)Google Scholar
  6. 6.
    Chanyapornpong, S.: A Thai syllable separation algorithm. Master thesis, Asian Institute of Technology, Thailand (1983)Google Scholar
  7. 7.
    Chen, S., Xu, Y., Chang, H.: A simple and effective unsupervised word segmentation approach. In: Proceeding of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, USA (2011)Google Scholar
  8. 8.
    Detorakis, Z., Tambouratzis, G.: Applying a sectioned genetic algorithm to word segmentation. Pattern Anal. Appl. 13(1), 93–104 (2010)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Haruechaiyasak, C., Kongyoung, S., Dailey, M.: A comparative study on thai word segmentation approaches. In: Proceedings of ECTI-CON (2008)Google Scholar
  10. 10.
    Jousimo, J.: Thai word segmentation with bi-directional RN (2017).
  11. 11.
    Kazakov, D., Manandhar, S.: Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Mach. Learn. 43, 121–162 (2001)CrossRefGoogle Scholar
  12. 12.
    Khankasikarn, K., Muansuean, N.: Thai word segmentation a lexical semantic approach. In: Proceedings of the Tenth Machine Translation Summit (2005)Google Scholar
  13. 13.
    Kittinaradorn, R., Chaovavanich, K., Achakulvisut, T., Kaewkasi, C.: Deepcut (2018).
  14. 14.
    Koanantakool, H.T., Karoonboonyanan, T., Wutiwiwatchai, C.: Computers and the Thai language. IEEE Ann. Hist. Comput. 31(1), 46–61 (2009)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: SegGen: a genetic algorithm for linear text segmentation. In: IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007 (2007)Google Scholar
  16. 16.
    Lapjaturapit, T., Viriyayudhakorn, K., Theeramunkong, T.: Multi-candidate word segmentation using bi-directional LSTM neural networks. In: Proceedings of the 11th International Conference on Embedded Systems and Intelligent Technology in cooperation with the 9th International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES 2018), Khon Kaen, Thailand, pp. 30–35 (2018)Google Scholar
  17. 17.
    Mohammed, A., Karam, M., Hefny, H.: GA-based parameter optimization for word segmentation. Artif. Intell. Mach. Learn. J. 17(1), 23–32 (2017)Google Scholar
  18. 18.
    Nectec. Annotated and Multimedia Corpus. National Electronics and Computer Technology Center. Accessed 21 Nov 2019
  19. 19.
    Nguyen, T.V., Tran, H.K., Nguyen, T.T.T., Nguyen, H.: Word segmentation for Vietnamese text categorization: an online corpus approach. In: The 4th International Conference on Computer Sciences Research, Innovation and Vision for the Future (2006)Google Scholar
  20. 20.
    Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001), Tokyo, Japan, November 2001 (2001)Google Scholar
  21. 21.
    Poowarawan, Y.: Dictionary-based Thai syllable separation. In: Proceedings of the Ninth Electronics Engineering Conference (1986)Google Scholar
  22. 22.
    PyPI: ttlk 1.2.1 Thai Language Toolkit. Accessed 21 Nov 2019
  23. 23.
    Theeramunkong, T., Usanavasin, S.: Non-dictionary-based Thai word segmentation using decision trees. In: Proceedings of the First International Conference on Human Language Technology Research, San Diego, California, 18–21 March 2001, pp. 251–256 (2001)Google Scholar
  24. 24.
    Wang, H., Lepage, Y.: Unsupervised word segmentation using minimum description length for neural machine translation. In: The Association for Natural Language Processing (2018)Google Scholar
  25. 25.
    Wikipedia. Thai words by number of syllables (2019).
  26. 26.
    Zhikov, V., Takamura, H., Okumura, M.: An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. Inf. Media Technol. 8(2), 514–527 (2013)Google Scholar

Copyright information

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Thammasat UniversityBangkokThailand
  2. 2.Technological University DublinBlanchardstownIreland

Personalised recommendations