Abstract
Word Segmentation or Tokenization is the process of determining the best likely sequence of words from a sequence of text. For Thai language, word segmentation is not a trivial task as words and sentences in Thai are written continuously without any spaces or delimiters. Most techniques for word segmentation, especially when using machine learning, requires manually tagged data where words begin and end as a training dataset. In this study, an unsupervised machine learning technique that does not require the use of manually tagged data was developed. The technique involves breaking input text into syllables and then uses Genetic Algorithms (GA) to merge the syllables back into words. GA identifies the best segmentation of words by minimizing word distance which is the novel concept developed in this study. It is the sum of all syllable distances of every pair of syllables within a word. The syllable distance is the measure of how far apart each pair of syllables is in a document. The implementation was done using Python and achieves 70% accuracy (F1 measure) while using a 100k untagged words training dataset. The performance also improves with more training data and some tuning of GA parameters.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ando, R.K., Lee, L.: Mostly-unsupervised statistical segmentation of Japanese Kanji sequences. Nat. Lang. Eng. 9(2), 127–149 (2003)
Aroonmanakun, W.: Collocation and Thai word segmentation. In: Proceedings of the Fifth Symposium on Natural Language Processing & the Fifth Oriental COCOSDA Workshop, pp. 68–75 (2002)
Bheganan, P., Richi, N., Xu, Y.: Thai word segmentation with hidden Markov model and decision tree. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Bangkok (2009)
Boonkwan, P., Supnithi, T.: Bidirectional deep learning of context representation for joint word segmentation and POS tagging. In: Le, N.T., van Do, T., Nguyen, N., Thi, H. (eds.) Advanced Computational Methods for Knowledge Engineering (2018)
Chang, J.S., Lin, T.: Unsupervised word segmentation without dictionary. In: The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), pp. 355–359 (2003)
Chanyapornpong, S.: A Thai syllable separation algorithm. Master thesis, Asian Institute of Technology, Thailand (1983)
Chen, S., Xu, Y., Chang, H.: A simple and effective unsupervised word segmentation approach. In: Proceeding of the Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, USA (2011)
Detorakis, Z., Tambouratzis, G.: Applying a sectioned genetic algorithm to word segmentation. Pattern Anal. Appl. 13(1), 93–104 (2010)
Haruechaiyasak, C., Kongyoung, S., Dailey, M.: A comparative study on thai word segmentation approaches. In: Proceedings of ECTI-CON (2008)
Jousimo, J.: Thai word segmentation with bi-directional RN (2017). https://sertiscorp.com/thai-word-segmentation-with-bi-directional_rnn
Kazakov, D., Manandhar, S.: Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Mach. Learn. 43, 121–162 (2001)
Khankasikarn, K., Muansuean, N.: Thai word segmentation a lexical semantic approach. In: Proceedings of the Tenth Machine Translation Summit (2005)
Kittinaradorn, R., Chaovavanich, K., Achakulvisut, T., Kaewkasi, C.: Deepcut (2018). https://github.com/rkcosmos/deepcut
Koanantakool, H.T., Karoonboonyanan, T., Wutiwiwatchai, C.: Computers and the Thai language. IEEE Ann. Hist. Comput. 31(1), 46–61 (2009)
Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: SegGen: a genetic algorithm for linear text segmentation. In: IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007 (2007)
Lapjaturapit, T., Viriyayudhakorn, K., Theeramunkong, T.: Multi-candidate word segmentation using bi-directional LSTM neural networks. In: Proceedings of the 11th International Conference on Embedded Systems and Intelligent Technology in cooperation with the 9th International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES 2018), Khon Kaen, Thailand, pp. 30–35 (2018)
Mohammed, A., Karam, M., Hefny, H.: GA-based parameter optimization for word segmentation. Artif. Intell. Mach. Learn. J. 17(1), 23–32 (2017)
Nectec. Annotated and Multimedia Corpus. National Electronics and Computer Technology Center. https://www.nectec.or.th/corpus/index.php?league=pm. Accessed 21 Nov 2019
Nguyen, T.V., Tran, H.K., Nguyen, T.T.T., Nguyen, H.: Word segmentation for Vietnamese text categorization: an online corpus approach. In: The 4th International Conference on Computer Sciences Research, Innovation and Vision for the Future (2006)
Peng, F., Schuurmans, D.: A hierarchical EM approach to word segmentation. In: Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001), Tokyo, Japan, November 2001 (2001)
Poowarawan, Y.: Dictionary-based Thai syllable separation. In: Proceedings of the Ninth Electronics Engineering Conference (1986)
PyPI: ttlk 1.2.1 Thai Language Toolkit. https://pypi.org/project/tltk/. Accessed 21 Nov 2019
Theeramunkong, T., Usanavasin, S.: Non-dictionary-based Thai word segmentation using decision trees. In: Proceedings of the First International Conference on Human Language Technology Research, San Diego, California, 18–21 March 2001, pp. 251–256 (2001)
Wang, H., Lepage, Y.: Unsupervised word segmentation using minimum description length for neural machine translation. In: The Association for Natural Language Processing (2018)
Wikipedia. Thai words by number of syllables (2019). https://en.wiktionary.org/wiki/Category:Thai_words_by_number_of_syllables
Zhikov, V., Takamura, H., Okumura, M.: An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. Inf. Media Technol. 8(2), 514–527 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sunkpho, J., Hofmann, M. (2020). Thai Words Segmentation Using an Unsupervised Learning Technique. In: Meesad, P., Sodsee, S. (eds) Recent Advances in Information and Communication Technology 2020. IC2IT 2020. Advances in Intelligent Systems and Computing, vol 1149. Springer, Cham. https://doi.org/10.1007/978-3-030-44044-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-44044-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44043-5
Online ISBN: 978-3-030-44044-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)