Abstract
This paper proposes a new hybrid method of using Conditional Random Fields (CRFs) integrated with three dictionaries for Thai word segmentation. Based on TLex (Thai Lexeme Analyser), a pre-processing phase with an unambiguous list dictionary is added to deal with long expressions and long named entities (NE). Following this, the rest of the text is sent to the original TLex system, based on CRFs, to be segmented into words. Next, another dictionary is applied in a post-processing phase to check the number of unknown words in each of the top scored alternative segmentations from TLex in order to choose the best one. Finally, another NE dictionary is employed to merge each segmented named entity into one word. The results show that this hybrid method can improve the precision, recall and F-measure of TLex from 93.63%, 94.91%, and 94.27% to 97.64%, 97.37%, and 97.50%, respectively to become the most accurate Thai word segmentation system presently available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Thairatananond, Y.: Towards the design of a Thai text syllable analyzer. Master Thesis of Science, Asian Institute of Technology, Pathumthani (1981)
Charnyapornpong, S.: A Thai syllable separation algorithm. Master Thesis of Engineering, Asian Institute of Technology, Pathumthani (1983)
Theeramunkong, T., Sornlertlamvanich, V., Tanhermhong, T., Chinnan, W.: Character cluster based Thai information retrieval. In: 5th International Workshop on Information Retrieval with Asian Languages (IRAL2000), pp. 75–80, Hong Kong (2000)
Poovarawan, Y., Imarrom, W.: Thai syllable separator by dictionary (in Thai). In: 9th Annual Meeting on Electrical Engineering of the Thai Universities, Khonkaen (1986)
Varakulsiripunth, R., Ngamwiwit, J., Junwun, S., Chiwattayakul, S., Thipchaksurat, S.: Word segmentation in Thai sentence by longest word mapping (in Thai).(1989). In: Sornlertlamvanich, V. (ed.) Papers on Natural Language Processing: Multilingual Machine Translation and Related Topics (1987–1994), pp. 279–290. NECTEC, Bangkok (1995)
Raruenrom, S.: Dictionary-based Thai word separation (in Thai). Senior Project Report, Department of Computer Engineering, Chulalongkorn University, Bangkok (1991)
Sornlertlamvanich, V.: Word Segmentation for Thai in machine translation system (in Thai). In: Machine Translation, pp. 50–56. NECTEC, Bangkok (1993)
Meknavin, S., Charoenpornsawat, P., Kijsirikul, B.: Feature-based Thai word segmentation. In: Natural Language Processing Pacific Rim Symposium 1997 (NLPRS’97), pp. 41–46, Phuket (1997)
Kooptiwoot, C.: Segmentation of ambiguous Thai words by inductive logic programming (in Thai). Master Thesis of Science, Department of Computer Engineering, Chulalongkorn University, Bangkok (1999)
Sawamipak, D.: Construction of Thai Syntax Analysing Software Under UNIX (in Thai). Thammasat University, Bangkok (1990)
Runapongsa, K., Urathammakul, P.: Rule-based approach and new dictionary for Thai words segmentation (in Thai). Final Report, Faculty of Engineering, Khon Kaen University, Khon Kaen (2006)
Kawtrakul, A., Thumkanon, C., Seriburi, S.: A statistical approach to Thai word filtering. In: 2nd Symposium on Natural Language Processing (SNLP’95), pp. 398–406, Bangkok (1995)
Kruengkrai, C., Uchimoto, K., Kazama, J., Torisawa, K., Isahara, H., Jaruskulchai, C.: A word and character-cluster hybrid model for Thai word segmentation. In: InterBEST 2009 Thai Word Segmentation Workshop, pp. 24–29, Bangkok (2009)
Kosawat, K., Boriboon, M., Chootrakool, P., Chotimongkol, A., Klaithin, S., Kongyoung, S., Kriengket, K., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., Wutiwiwatchai, C.: BEST 2009: Thai word segmentation software contest. In: 8th International Symposium on Natural Language Processing, pp. 83–88, Bangkok (2009)
Suesatpanit, K., Punyabukkana, P., Suchato, A.: Thai word segmentation using character-level information. In: InterBEST 2009 Thai Word Segmentation Workshop, pp. 18–23, Bangkok (2009)
Haruechaiyasak, C., Kongyoung, S.: TLex: Thai lexeme analyser based on the conditional random fields. In: InterBEST 2009 Thai Word Segmentation Workshop, pp. 13–17, Bangkok (2009)
Apache Commons Collections. https://commons.apache.org/proper/commons-collections/
LEXiTRON Thai-English Dictionary. http://lexitron.nectec.or.th
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Kongyoung, S., Rugchatjaroen, A., Kosawat, K. (2018). TLex+: A Hybrid Method Using Conditional Random Fields and Dictionaries for Thai Word Segmentation. In: Theeramunkong, T., Skulimowski, A., Yuizono, T., Kunifuji, S. (eds) Recent Advances and Future Prospects in Knowledge, Information and Creativity Support Systems. KICSS 2015. Advances in Intelligent Systems and Computing, vol 685. Springer, Cham. https://doi.org/10.1007/978-3-319-70019-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-70019-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70018-2
Online ISBN: 978-3-319-70019-9
eBook Packages: EngineeringEngineering (R0)