Skip to main content

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5246))

Abstract

Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved without making any error. Several state-of-the-art word segmenters are used to make comparisons on solving these overlapping ambiguities. Preliminary experiments show that about 2% of the 5,507 pseudo ambiguities which are mistakenly segmented by these segmenters can be properly treated by the proposed strategy.

The research is supported by the National Natural Science Foundation of China under grant number 60573187 and the CINACS project.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the 4th SIGHAN Workshop, pp. 123–133 (2005)

    Google Scholar 

  2. Huang, C.N.: Segmentation Problems in Chinese Processing. Applied Linguistics 1, 72–78 (1997)

    Google Scholar 

  3. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference of ICML, pp. 282–289 (2001)

    Google Scholar 

  4. Li, R., Liu, S.H., Ye, S.W., Shi, Z.Z.: A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)

    Google Scholar 

  5. Li, M., Gao, J.F., Huang, C.N., Li, J.F.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: Proceedings of SIGHAN 2003, pp. 1–7 (2003)

    Google Scholar 

  6. Liang, N.Y.: A Chinese automatic segmentation system for written texts – CDWS. Journal of Chinese Information Processing 1(2), 44–52 (1987) (in Chinese)

    Google Scholar 

  7. Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of COLING 2004, Geneva, Switzerland, pp. 562–568 (2004)

    Google Scholar 

  8. Sproat, R., Emerson, T.: The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop, pp. 133–143 (2003)

    Google Scholar 

  9. Sun, M.S., Zuo, Z.P.: Overlapping ambiguities in Chinese text. In: Overlapping ambiguities in Chinese text, pp. 323–338 (1998)

    Google Scholar 

  10. Sun, M.S., Huang, C.N., T’sou, B.K.Y.: 1997. Using character bigram for ambiguity resolution In Chinese word segmentation (5), 332–339 (in Chinese)

    Google Scholar 

  11. Sun, M.S., Zuo, Z.P., T’sou, B.K.Y.: The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing 13(1), 27–37 (1999) (in Chinese)

    Google Scholar 

  12. Swen, B., Yu, S.W.: A graded approach for the efficient resolution of Chinese word segmentation ambiguities. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 19–24 (1999)

    Google Scholar 

  13. Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics, 8(1), 29–48 (2003)

    Google Scholar 

  14. Yu, S.W., Zhu, X.F.: Grammatical Information Dictionary for Contemporary Chinese. In: Grammatical Information Dictionary for Contemporary Chinese, 2nd edition, 2nd edn. Tsinghua University Press (2003) (in Chinese)

    Google Scholar 

  15. Zheng, J.H., Liu, K.Y.: Research on ambiguous word segmentation technique for Chinese text. In: Language Engineering, pp. 201–206. Tsinghua University Press, Beijing (1997) (in Chinese)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Petr Sojka Aleš Horák Ivan Kopeček Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qiao, W., Sun, M., Menzel, W. (2008). Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87391-4_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87390-7

  • Online ISBN: 978-3-540-87391-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics