Neural Chinese Word Segmentation as Sequence to Sequence Translation

Shi, Xuewen; Huang, Heyan; Jian, Ping; Guo, Yuhang; Wei, Xiaochi; Tang, Yi-Kun

doi:10.1007/978-981-10-6805-8_8

Xuewen Shi¹⁵,
Heyan Huang¹⁵,
Ping Jian¹⁵,
Yuhang Guo¹⁵,
Xiaochi Wei¹⁵ &
…
Yi-Kun Tang¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 774))

Included in the following conference series:

Chinese National Conference on Social Media Processing

1900 Accesses
3 Citations
1 Altmetric

Abstract

Recently, Chinese word segmentation (CWS) methods using neural networks have made impressive progress. Most of them regard the CWS as a sequence labeling problem which construct models based on local features rather than considering global information of input sequence. In this paper, we cast the CWS as a sequence translation problem and propose a novel sequence-to-sequence CWS model with an attention-based encoder-decoder framework. The model captures the global information from the input and directly outputs the segmented sequence. It can also tackle other NLP tasks with CWS jointly in an end-to-end mode. Experiments on Weibo, PKU and MSRA benchmark datasets show that our approach has achieved competitive performances compared with state-of-the-art methods. Meanwhile, we successfully applied our proposed model to jointly learning CWS and Chinese spelling correction, which demonstrates its applicability of multi-task fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Executable source code is available at https://github.com/SourcecodeSharing/CWSpostediting.
2.
All data and the program are available at https://github.com/FudanNLP/NLPCC-WordSeg-Weibo.
3.
http://www.weibo.com.
4.
All data and the program are available at http://sighan.cs.uchicago.edu/bakeoff2005/.
5.
Implementations are available at https://github.com/lisa-groundhog/GroundHog.
6.
Available online at https://github.com/HIT-SCIR/ltp.

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. arXiv preprint arXiv:1606.04300 (2016)
Che, W., Li, Z., Liu, T.: LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, pp. 13–16. Association for Computational Linguistics (2010)
Google Scholar
Chen, X., Qiu, X., Zhu, C., Huang, X.: Gated recursive neural network for Chinese word segmentation. In: ACL, vol. 1, pp. 1744–1753 (2015)
Google Scholar
Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: EMNLP, pp. 1197–1206 (2015)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, vol. 133 (2005)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. ArXiv e-prints, May 2017
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML, vol. 1, pp. 282–289 (2001)
Google Scholar
Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 605. Association for Computational Linguistics (2004)
Google Scholar
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: ACL, vol. 1, pp. 293–303 (2014)
Google Scholar
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. p. 562. Association for Computational Linguistics (2004)
Google Scholar
Qiu, P., Qiu, X., Huang, X.: A new psychometric-inspired evaluation metric for Chinese word segmentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 2185–2194 (2016)
Google Scholar
Qiu, X., Qian, P., Shi, Z.: Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC-2016. LNCS, vol. 10102, pp. 901–906. Springer, Cham (2016). doi:10.1007/978-3-319-50496-4_84
Chapter Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MATH MathSciNet Google Scholar
Sun, X., Li, W., Wang, H., Lu, Q.: Feature-frequency-adaptive on-line training for fast and accurate natural language processing. Comput. Linguist. 40(3), 563–586 (2014)
Article MathSciNet Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Xue, N., Shen, L.: Chinese word segmentation as LMR tagging. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, pp. 176–179 (2003)
Google Scholar
Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H., et al.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of the 3rd CIPSSIGHAN Joint Conference on Chinese Language Processing (CLP 2014), pp. 126–132 (2014)
Google Scholar
Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for Chinese word segmentation In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 311–321 (2013)
Google Scholar
Zhao, H., Huang, C.N., Li, M., Lu, B.L.: A unified character-based tagging framework for chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. (TALIP) 9(2), 5 (2010)
Google Scholar
Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and pos tagging. In: EMNLP, pp. 647–657 (2013)
Google Scholar
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC, pp. 23–28 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Basic Research Program (973) of China (No. 2013CB329303) and the National Natural Science Foundation of China (No. 61132009).

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, Beijing Institute of Technology, Beijing, 100081, China
Xuewen Shi, Heyan Huang, Ping Jian, Yuhang Guo, Xiaochi Wei & Yi-Kun Tang

Authors

Xuewen Shi
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Jian
View author publications
You can also search for this author in PubMed Google Scholar
Yuhang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Kun Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping Jian .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xueqi Cheng
Beijing Jinri Toutiao Technology Co. Ltd , Beijing, China
Weiying Ma
Arizona State University , Tempe, Arizona, USA
Huan Liu
Institute of Computing Technology, Chinese Academy of Sciences , Beijing, China
Huawei Shen
Renmin University of China , Beijing, China
Shizheng Feng
Microsoft Asia Research , Beijing, China
Xing Xie

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, X., Huang, H., Jian, P., Guo, Y., Wei, X., Tang, YK. (2017). Neural Chinese Word Segmentation as Sequence to Sequence Translation. In: Cheng, X., Ma, W., Liu, H., Shen, H., Feng, S., Xie, X. (eds) Social Media Processing. SMP 2017. Communications in Computer and Information Science, vol 774. Springer, Singapore. https://doi.org/10.1007/978-981-10-6805-8_8

Download citation

DOI: https://doi.org/10.1007/978-981-10-6805-8_8
Published: 26 October 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6804-1
Online ISBN: 978-981-10-6805-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics