Skip to main content

Domain-Specific Chinese Word Segmentation with Document-Level Optimization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10619))

Abstract

Previous studies normally formulate Chinese word segmentation as a character sequence labeling task and optimize the solution in sentence-level. In this paper, we address Chinese word segmentation as a document-level optimization problem. First, we apply a state-of-the-art approach, i.e., long short-term memory (LSTM), to perform character classification; Then, we propose a global objective function on the basis of character classification and achieve global optimization via Integer Linear Programming (ILP). Specifically, we propose several kinds of global constrains in ILP to capture various segmentation knowledge, such as segmentation consistency and domain-specific regulations, to achieve document-level optimization, besides label transition knowledge to achieve sentence-level optimization. Empirical studies demonstrate the effectiveness of the proposed approach to domain-specific Chinese word segmentation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Xue, N.W.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)

    MathSciNet  Google Scholar 

  2. Gao, J.F., Li, M., Wu, A., Huang, C.N.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 31(4), 531–574 (2005)

    Article  MATH  Google Scholar 

  3. Chen, C., Ng, V.I.: Joint modeling for Chinese event extraction with rich linguistic features. In: Proceedings of COLING, pp. 529–544 (2012)

    Google Scholar 

  4. Zhang, R.Q., Yasuda, K., Sumita, E.: Improved statistical machine translation by multiple Chinese word segmentation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 216–223 (2008)

    Google Scholar 

  5. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for Sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171 (2005)

    Google Scholar 

  6. Andrew, G.: A hybrid Markov/semi-Markov conditional random field for sequence segmentation. In: Proceedings of EMNLP, pp. 465–472 (2006)

    Google Scholar 

  7. Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of ACL, pp. 421–431 (2016)

    Google Scholar 

  8. Shi, Y.X., Wang, M.Q.: A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks. In: Proceedings of IJCAI, pp. 1707–1712 (2007)

    Google Scholar 

  9. Li, S.S., Zhou, G.G., Huang, C.R.: Active learning for Chinese word segmentation. In: Proceedings of COLING, pp. 683–692 (2012)

    Google Scholar 

  10. Zhao, H., Huang, C.N., Li, M., Lu, B.L.: Effective tag set selection in Chinese word segmentation via conditional random field modeling. In: Proceedings of PACLIC, pp. 87–94 (2006)

    Google Scholar 

  11. Zheng, X.Q., Chen, H.Y., Xu, T.Y.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of EMNLP, pp. 647–657 (2013)

    Google Scholar 

  12. Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 293–303 (2014)

    Google Scholar 

  13. Chen, X.C., Qiu, X.P., Zhu, C.X., Huang, X.J.: Gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 1744–1753 (2015)

    Google Scholar 

  14. Chen, X.C., Qiu, X.P., Zhu, C.X., Liu, P.F., Huang, X.J.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of EMNLP, pp. 1197–1206 (2015)

    Google Scholar 

  15. Xu, J., Sun, X.: Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of ACL, pp. 567–572 (2016)

    Google Scholar 

  16. Li, S., Xue, N.: Effective document-level features for Chinese patent word segmentation. In: Proceedings of ACL, pp. 199–205 (2013)

    Google Scholar 

  17. Barzilay, R., Lapata, M.: Aggregation via set partitioning for natural language generation. In: Proceedings of ACL, pp. 359–366 (2006)

    Google Scholar 

  18. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: the 90% Solution. In: Proceedings of NAACL, pp. 57–60 (2006)

    Google Scholar 

Download references

Acknowledgments

This research work has been partially supported by three NSFC grants, No. 61375073, No. 61672366 and No. 61331011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shoushan Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, Q., Shen, C., Li, S., Xia, F., Du, Z. (2018). Domain-Specific Chinese Word Segmentation with Document-Level Optimization. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73618-1_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73617-4

  • Online ISBN: 978-3-319-73618-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics