Enhancing Chinese Word Segmentation with Character Clustering

Liu, Yijia; Che, Wanxiang; Liu, Ting

doi:10.1007/978-3-642-41491-6_6

Yijia Liu²³,
Wanxiang Che²³ &
Ting Liu²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Included in the following conference series:

1649 Accesses

Abstract

In semi-supervised learning framework, clustering has been proved a helpful feature to improve system performance in NER and other NLP tasks. However, there hasn’t been any work that employs clustering in word segmentation. In this paper, we proposed a new approach to compute clusters of characters and use these results to assist a character based Chinese word segmentation system. Contextual information is considered when we perform character clustering algorithm to address character ambiguity. Experiments show our character clusters result in performance improvement. Also, we compare our clusters features with widely used mutual information (MI). When two features integrated, further improvement is achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., Torisawa, K.: Improving Chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. In: Proceedings of the Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011 (2011)
Google Scholar
Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 970–979. Association for Computational Linguistics (2011)
Google Scholar
Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Proceedings of HLT-NAACL, vol. 4. Citeseer (2004)
Google Scholar
Liang, P.: Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology (2005)
Google Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Chen, W., Kazama, J., Uchimoto, K., Torisawa, K.: Improving dependency parsing with subtrees from auto-parsed data. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 570–579. Association for Computational Linguistics (2009)
Google Scholar
Okazaki, N.: Crfsuite: a fast implementation of conditional random fields (crfs) (2007)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Research Center for Social Computing and Information Retrieval School of Computer Science and Technology, Harbin Institute of Technology, China
Yijia Liu, Wanxiang Che & Ting Liu

Authors

Yijia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wanxiang Che
View author publications
You can also search for this author in PubMed Google Scholar
Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Che, W., Liu, T. (2013). Enhancing Chinese Word Segmentation with Character Clustering. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics