Skip to main content

Thai Word Segmentation with Hidden Markov Model and Decision Tree

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Abstract

The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aroonmanakul, W.: Collocation and Thai Word Segmentation. In: Joint International Conference of SNLP-Oriental COCOSDA, Thailand, pp. 68–75 (2002)

    Google Scholar 

  2. Christen, P., Churches, T., Hegland, M., Lim, K., Nielsen, O.M., Roberts, S., Zhu, J.: High-Performance Computing Techniques for Record Linkage. In: Australian Health Outcomes Conference, Canberra, Australia, pp.1–14 (2002)

    Google Scholar 

  3. Church, K.W., Robert, L., Mark, L.Y.: A Status Report on ACL/DCL. In: 7th Annual Conference of the UW Centre New OED and Text Research: Using Corpora, Canada, pp. 84—91 (1991)

    Google Scholar 

  4. Civil court of Thailand, http://www.cvcourt.com

  5. Kawtrakul, A., Thumkanon, C., Poovarawan, Y., Varasrai, P., Suktarachan, M.: Automatic Thai Unknown Word Recognition. In: Natural Language Processing Pacific Rim Symposium, Phuket, Thailand, pp. 341–346 (1997)

    Google Scholar 

  6. Nagata, M.: Context-based spelling correction for Japanese OCR. In: 16th conference on Computational linguistics, New Jersey, USA, pp. 806–811 (1996)

    Google Scholar 

  7. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufman, USA (1993)

    Google Scholar 

  8. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. IEEE 77(2), 257–285 (1989)

    Article  Google Scholar 

  9. Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic corpus-based Thai word extraction with the C4.5 learning algorithm. In: 18th conference on Computational linguistics. Saarbrücken, Germany, pp. 802–807 (2000)

    Google Scholar 

  10. Sudprasert, S., Kawtrakul, A.: Thai word segmentation based on Global and Local Unsupervised learning. In: NCSEC, Chonburi, Thailand (2003)

    Google Scholar 

  11. Thai Computational Linguistics Laboratory.: TCL’s Computational Lexicon, http://www.tcllab.org/tcllex/

  12. Theeramunkong, T., Usanavasin, S.: Non-dictionary-based Thai word segmentation using decision trees. In: The first international conference on Human language technology research, New Jersey, USA, pp. 1–5 (2001)

    Google Scholar 

  13. Unicode Consortium.: The Unicode Standard 4.0: Southeast Asian Scripts. Addison Westley, California (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bheganan, P., Nayak, R., Xu, Y. (2009). Thai Word Segmentation with Hidden Markov Model and Decision Tree. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01307-2_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01306-5

  • Online ISBN: 978-3-642-01307-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics