Skip to main content

A Hybrid Approach to Automatic Word-spacing in Korean

  • Conference paper
Innovations in Applied Artificial Intelligence (IEA/AIE 2004)

Abstract

This paper proposes a hybrid automatic word-spacing system for the Korean language, combining stochastic- and knowledge-based approaches. Our system defines the optimal splitting points of an input sentence using two simple parameters: (a) relative word frequency and (b) Syllable n-gram statistics, extracted from large processed corpora that contain 33,643,884 word-tokens. Whereas this method efficiently resolves problems due to eventual data noise using processed training data, and data sparseness using Syllabic n-gram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach, (a) dynamically expanding candidate words with longest-radix selection among possible morphemes and (b) adopting inequivalent treatment between major lexical categories and minor lexical categories. The current combined model remedies drawbacks of the stochastic-based word-spacing algorithm and shows encouraging results: it obtained 97.51% precision in word-unit correction from the external test data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chung, Y.M., Lee, J.Y.: Automatic Word-segmentation at Line-breaks for Korean Text Processing. In: Proceedings of 6th Conference of Korean Society for Information Management, pp. 21–24 (1999)

    Google Scholar 

  2. Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods for Efficient Text Processing. Proceedings of the Korean Information Science Society (B) 30(1), 486–488 (2003)

    Google Scholar 

  3. Kang, M.Y., Choi, S.J., Yoon, A.S., Kwon, H.C.: Stochastic Word-Spacing System with Dynamic Increase of Word List. In: Proceeding of the First International Joint Conference on Natural Language Processing (2004) (to appear)

    Google Scholar 

  4. Kang, S.S.: Automatic Segmentation for Hangul Sentences. In: Proceeding of the 10th Conference on Hangul and Korean Information Processing, pp. 137–142 (1998)

    Google Scholar 

  5. Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)

    Google Scholar 

  6. Kang, S.S.: Korean Morphological Analysis and Information Retrieval. Hongleunggwahag Publisher, Seoul (2002)

    Google Scholar 

  7. Kim, S.N., Nam, H.S., Kwon, H.C.: Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 415–419 (1999)

    Google Scholar 

  8. Lee, D.G., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30(4), 358–370 (2003)

    Google Scholar 

  9. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)

    Google Scholar 

  10. Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23(7), 776–785 (1996)

    Google Scholar 

  11. Sim, K.S.: Automated Word-Segmentation for Korean Using Mutual Information of Syllables. Journal of KISS(B): Software and Applications 23(9), 991–1000 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kang, My., Choi, Sw., Kwon, Hc. (2004). A Hybrid Approach to Automatic Word-spacing in Korean. In: Orchard, B., Yang, C., Ali, M. (eds) Innovations in Applied Artificial Intelligence. IEA/AIE 2004. Lecture Notes in Computer Science(), vol 3029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24677-0_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24677-0_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22007-7

  • Online ISBN: 978-3-540-24677-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics