Skip to main content

Abstract

This paper introduces a domain-adapted word segmentation approach to text where a word delimiter is not used regularly. It depends on an unknown word extraction technique. This approach is essential for language modeling to adapt to new domains since a vocabulary set is activated in a word segmentation step. We have achieved ERR 21.22% in Korean word segmentation. In addition, we show that an incremental domain adaptation of the word segmentation decreases the perplexity of input text gradually. It means that our approach supports an out-of-domain language modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, K. J., Bai, M. H., ”Unknown Word Detection for Chinese by a Corpusbased Learning Mothod”, International Journal of Computational Linguistics and Chinese Language Processing, Vol.3, pp.27-44, 1998

    Google Scholar 

  2. Chen, K. J., Ma, W. Y., ”Unknown word extraction for Chinese documents”, in Proceeding COLING ’02 Proceedings of the 19th international conference on Computational linguistics - Volume 1, 2002

    Google Scholar 

  3. Lafferty, J., McCallum, A., and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in Proceeding of the 18th International Conference on Machine Learning. 282–289. 2001.

    Google Scholar 

  4. Ma, W. Y., Chen, K. J., ”A bottom-up merging algorithm for Chinese unknown word extraction”, in Proceeding SIGHAN ’03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17, 2003

    Google Scholar 

  5. Seymore, K., Rosenfeld, R., ”Using Story Topics for Language Model Adaptation”, in Proceeding of the Eurospeech, 1997

    Google Scholar 

  6. Stolcke, A., ”SRILM - An Extensible Language Modeling Toolkit”, in Proceeding of the International Conference Spoken Language Processing, Denver, Colorado, September 2002.

    Google Scholar 

  7. Varile, G. B., Zampolli, A., ”Survey of the state of the art in human language technology”, Cambridge University Press, pp32-33, 1997

    Google Scholar 

  8. Yang, S. I., Seo, Y. A., Kim, Y. K. and Ra, D., ”Noun Sense Identification of Korean Nominal Compounds Based on Sentential Form Recovery,” ETRI Journal, vol.32, no.5, Oct. 2010, pp.740-749.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Euisok Chung .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this paper

Cite this paper

Chung, E., Jeon, HB., Park, JG., Lee, YK. (2011). Domain-Adapted Word Segmentation for an Out-of-Domain Language Modeling. In: Delgado, RC., Kobayashi, T. (eds) Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1335-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1335-6_9

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1334-9

  • Online ISBN: 978-1-4614-1335-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics