Abstract
Previous studies normally formulate Chinese word segmentation as a character sequence labeling task and optimize the solution in sentence-level. In this paper, we address Chinese word segmentation as a document-level optimization problem. First, we apply a state-of-the-art approach, i.e., long short-term memory (LSTM), to perform character classification; Then, we propose a global objective function on the basis of character classification and achieve global optimization via Integer Linear Programming (ILP). Specifically, we propose several kinds of global constrains in ILP to capture various segmentation knowledge, such as segmentation consistency and domain-specific regulations, to achieve document-level optimization, besides label transition knowledge to achieve sentence-level optimization. Empirical studies demonstrate the effectiveness of the proposed approach to domain-specific Chinese word segmentation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Xue, N.W.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)
Gao, J.F., Li, M., Wu, A., Huang, C.N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4), 531–574 (2005)
Chen, C., Ng, V.I.: Joint modeling for Chinese event extraction with rich linguistic features. In: Proceedings of COLING, pp. 529–544 (2012)
Zhang, R.Q., Yasuda, K., Sumita, E.: Improved statistical machine translation by multiple Chinese word segmentation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 216–223 (2008)
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for Sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171 (2005)
Andrew, G.: A hybrid Markov/semi-Markov conditional random field for sequence segmentation. In: Proceedings of EMNLP, pp. 465–472 (2006)
Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of ACL, pp. 421–431 (2016)
Shi, Y.X., Wang, M.Q.: A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks. In: Proceedings of IJCAI, pp. 1707–1712 (2007)
Li, S.S., Zhou, G.G., Huang, C.R.: Active learning for Chinese word segmentation. In: Proceedings of COLING, pp. 683–692 (2012)
Zhao, H., Huang, C.N., Li, M., Lu, B.L.: Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of PACLIC, pp. 87–94 (2006)
Zheng, X.Q., Chen, H.Y., Xu, T.Y.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of EMNLP, pp. 647–657 (2013)
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 293–303 (2014)
Chen, X.C., Qiu, X.P., Zhu, C.X., Huang, X.J.: Gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 1744–1753 (2015)
Chen, X.C., Qiu, X.P., Zhu, C.X., Liu, P.F., Huang, X.J.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of EMNLP, pp. 1197–1206 (2015)
Xu, J., Sun, X.: Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 567–572 (2016)
Li, S., Xue, N.: Effective document-level features for Chinese patent word segmentation. In: Proceedings of ACL, pp. 199–205 (2013)
Barzilay, R., Lapata, M.: Aggregation via set partitioning for natural language generation. In: Proceedings of ACL, pp. 359–366 (2006)
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: the 90% Solution. In: Proceedings of NAACL, pp. 57–60 (2006)
Acknowledgments
This research work has been partially supported by three NSFC grants, No. 61375073, No. 61672366 and No. 61331011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Yan, Q., Shen, C., Li, S., Xia, F., Du, Z. (2018). Domain-Specific Chinese Word Segmentation with Document-Level Optimization. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-73618-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)