Abstract
The main aim of this work is to implement stochastic Korean Word-Spacing System which is equally robust for both inner-data and external-data. Word-spacing in Korean is influential in deciding semantic and syntactic scope. In order to cope with various problem yielded by word-spacing errors while processing Korean text, this study (a) presents a simple stochastic word-spacing system with only two parameters using relative word-unigram frequencies and odds favoring the inner-spacing probability of disyllables located at the boundary of stochastic-based words; (b) endeavors to diminish training-data-dependency by dynamically creating candidate words list with the longest-radix-selecting algorithm and (c) removes noise from the training-data by refining training procedure. The system thus becomes robust against unseen words and offers similar performance for both inner-data and external-data: it obtained 98.35% and 97.47% precision in word-unit correction from the inner test-data and the external test-data, respectively.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chung, Y.M., Lee, J.Y.: Automatic Word-segmentation at Line-breaks for Korean Text Processing. In: Proceedings of 6th Conference of Korean Society for Information Management, pp. 21–24 (1999)
Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods for Efficient Text Processing. Proceedings of the Korean Information Science Society (B) 30. 1, 486–488 (2003)
Kang, M.Y., Park, S.H., Yoon, A.S., Kwon, H.C.: Potential Governing Relationship and a Korean Grammar Checker Using Partial Parsing. In: Hendtlass, T., Ali, M. (eds.) IEA/AIE 2002. LNCS (LNAI), vol. 2358, pp. 692–702. Springer, Heidelberg (2002)
Kang, S.S.: Automatic Segmentation for Hangul Sentences. In: Proceeding of the 10th Confer- ence on Hangul and Korean Information Processing, pp. 137–142 (1998)
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Kim, S.N., Nam, H.S., Kwon, H.C.: Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 415–419 (1999)
Lee, D.K., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30. 4, 358–370 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23. 7, 776–785 (1996)
Sim, K.S.: Automated Word-Segmentation for Korean Using Mutual Information of Syllables. Journal of KISS(B): Software and Applications 23. 9, 991–1000 (1996)
Yoon, K.S., Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods Using Heuristic Clues. In: Proceedings of the EALPIIT 2003, pp. 5–11 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, My., Choi, Sj., Yoon, As., Kwon, Hc. (2005). Korean Stochastic Word-Spacing with Dynamic Expansion of Candidate Words List. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)