A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation

Pei, Wenzhe; Han, Dongxu; Chang, Baobao

doi:10.1007/978-3-642-41491-6_5

Wenzhe Pei²³,
Dongxu Han²³ &
Baobao Chang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Included in the following conference series:

1655 Accesses

Abstract

This paper proposes a refined Hierarchical Dirichlet Process (HDP) model for unsupervised Chinese word segmentation. This model gives a better estimation of the base measure in HDP by using a dictionary-based model. We also show that the initial segmentation state for HDP model plays a very important role in model performance. A better initial segmentation can lead to a better performance. We test our model on PKU and MSRA datasets provided by Second Segmentation Bake-off (SIGHAN 2005) [1] and our model outperforms the state-of-the-art systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, vol. 133 (2005)
Google Scholar
Duan, H., Sui, Z., Tian, Y., Li, W.: The cips-sighan clp 2012 Chinese word segmentation on microblog corpora bakeoff (2012)
Google Scholar
Zhao, H., Kit, C.: An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In: The Third International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad, India (2008)
Google Scholar
Magistry, P., Sagot, B.: Unsupervized word segmentation: the case for mandarin Chinese. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 383–387. Association for Computational Linguistics (2012)
Google Scholar
Wang, H., Zhu, J., Tang, S., Fan, X.: A new unsupervised approach to word segmentation. Computational Linguistics 37(3), 421–454 (2011)
Article Google Scholar
Goldwater, S., Griffiths, T.L., Johnson, M.: A bayesian framework for word segmentation: Exploring the effects of context. Cognition 112(1), 21–54 (2009)
Article Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)
Google Scholar
Casella, G., George, E.I.: Explaining the gibbs sampler. The American Statistician 46(3), 167–174 (1992)
MathSciNet Google Scholar
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 1017–1024. Association for Computational Linguistics (2008)
Google Scholar
Kempe, A.: Experiments in unsupervised entropy-based corpus segmentation. In: Workshop of EACL in Computational Natural Language Learning, pp. 7–13 (1999)
Google Scholar
Tanaka-Ishii, K.: Entropy as an indicator of context boundaries: An experiment using a web search engine. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 93–105. Springer, Heidelberg (2005)
Chapter Google Scholar
Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of Chinese text by use of branching entropy. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 428–435. Association for Computational Linguistics (2006)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Computational Linguistics, Ministry of Education, Institute of Computational Linguistics, School of Electronics Engineering and Computer Science, Peking University, China
Wenzhe Pei, Dongxu Han & Baobao Chang

Authors

Wenzhe Pei
View author publications
You can also search for this author in PubMed Google Scholar
Dongxu Han
View author publications
You can also search for this author in PubMed Google Scholar
Baobao Chang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pei, W., Han, D., Chang, B. (2013). A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics