Domain-Specific Chinese Word Segmentation with Document-Level Optimization

Yan, Qian; Shen, Chenlin; Li, Shoushan; Xia, Fen; Du, Zekai

doi:10.1007/978-3-319-73618-1_30

Domain-Specific Chinese Word Segmentation with Document-Level Optimization

Qian Yan¹⁸,
Chenlin Shen¹⁸,
Shoushan Li¹⁸,
Fen Xia¹⁹ &
…
Zekai Du¹⁹

Conference paper
First Online: 05 January 2018

3266 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10619))

Abstract

Previous studies normally formulate Chinese word segmentation as a character sequence labeling task and optimize the solution in sentence-level. In this paper, we address Chinese word segmentation as a document-level optimization problem. First, we apply a state-of-the-art approach, i.e., long short-term memory (LSTM), to perform character classification; Then, we propose a global objective function on the basis of character classification and achieve global optimization via Integer Linear Programming (ILP). Specifically, we propose several kinds of global constrains in ILP to capture various segmentation knowledge, such as segmentation consistency and domain-specific regulations, to achieve document-level optimization, besides label transition knowledge to achieve sentence-level optimization. Empirical studies demonstrate the effectiveness of the proposed approach to domain-specific Chinese word segmentation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Xue, N.W.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)
MathSciNet Google Scholar
Gao, J.F., Li, M., Wu, A., Huang, C.N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4), 531–574 (2005)
Article MATH Google Scholar
Chen, C., Ng, V.I.: Joint modeling for Chinese event extraction with rich linguistic features. In: Proceedings of COLING, pp. 529–544 (2012)
Google Scholar
Zhang, R.Q., Yasuda, K., Sumita, E.: Improved statistical machine translation by multiple Chinese word segmentation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 216–223 (2008)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for Sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171 (2005)
Google Scholar
Andrew, G.: A hybrid Markov/semi-Markov conditional random field for sequence segmentation. In: Proceedings of EMNLP, pp. 465–472 (2006)
Google Scholar
Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of ACL, pp. 421–431 (2016)
Google Scholar
Shi, Y.X., Wang, M.Q.: A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks. In: Proceedings of IJCAI, pp. 1707–1712 (2007)
Google Scholar
Li, S.S., Zhou, G.G., Huang, C.R.: Active learning for Chinese word segmentation. In: Proceedings of COLING, pp. 683–692 (2012)
Google Scholar
Zhao, H., Huang, C.N., Li, M., Lu, B.L.: Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of PACLIC, pp. 87–94 (2006)
Google Scholar
Zheng, X.Q., Chen, H.Y., Xu, T.Y.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of EMNLP, pp. 647–657 (2013)
Google Scholar
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 293–303 (2014)
Google Scholar
Chen, X.C., Qiu, X.P., Zhu, C.X., Huang, X.J.: Gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 1744–1753 (2015)
Google Scholar
Chen, X.C., Qiu, X.P., Zhu, C.X., Liu, P.F., Huang, X.J.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of EMNLP, pp. 1197–1206 (2015)
Google Scholar
Xu, J., Sun, X.: Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 567–572 (2016)
Google Scholar
Li, S., Xue, N.: Effective document-level features for Chinese patent word segmentation. In: Proceedings of ACL, pp. 199–205 (2013)
Google Scholar
Barzilay, R., Lapata, M.: Aggregation via set partitioning for natural language generation. In: Proceedings of ACL, pp. 359–366 (2006)
Google Scholar
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: the 90% Solution. In: Proceedings of NAACL, pp. 57–60 (2006)
Google Scholar

Download references

Acknowledgments

This research work has been partially supported by three NSFC grants, No. 61375073, No. 61672366 and No. 61331011.

Author information

Authors and Affiliations

Natural Language Processing Lab, School of Computer Science and Technology, Soochow University, Suzhou, China
Qian Yan, Chenlin Shen & Shoushan Li
Beijing Wisdom Uranium Technology Co., Ltd., Beijing, China
Fen Xia & Zekai Du

Authors

Qian Yan
View author publications
You can also search for this author in PubMed Google Scholar
Chenlin Shen
View author publications
You can also search for this author in PubMed Google Scholar
Shoushan Li
View author publications
You can also search for this author in PubMed Google Scholar
Fen Xia
View author publications
You can also search for this author in PubMed Google Scholar
Zekai Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shoushan Li .

Editor information

Editors and Affiliations

Fudan University, Shanghai, China
Xuanjing Huang
Singapore Management University, Singapore, Singapore
Jing Jiang
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, Q., Shen, C., Li, S., Xia, F., Du, Z. (2018). Domain-Specific Chinese Word Segmentation with Document-Level Optimization. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-73618-1_30
Published: 05 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics