Effective Neural Solution for Multi-criteria Word Segmentation

He, Han; Wu, Lei; Yan, Hua; Gao, Zhimin; Feng, Yi; Townsend, George

doi:10.1007/978-981-13-1927-3_14

Han He⁶,
Lei Wu⁶,
Hua Yan⁶,
Zhimin Gao⁷,
Yi Feng⁸ &
…
George Townsend⁸

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 105))

1630 Accesses
5 Citations

Abstract

We present a novel and elegant deep learning solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS) challenge. Our innovative design requires no private layers in model architecture, instead, introduces two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. The rest of the model including Long Short-Term Memory (LSTM) layer and Conditional Random Fields (CRFs) layer remains unchanged and is shared across all datasets, keeping the size of parameter collection minimal and constant. On Bakeoff 2005 and Bakeoff 2008 datasets, our innovative design has surpassed the previous multi-criteria learning results. Testing results on two out of four datasets even have surpassed the latest state-of-the-art single-criterion learning scores. To the best knowledge, our design is the first one that has achieved the latest state-of-the-art performance on such large-scale datasets. Source codes and corpora of this paper are available on GitHub (https://github.com/hankcs/multi-criteria-cws).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/hankcs/HanLP.
2.
http://www.sighan.org/bakeoff2003/score This script rounds a score to one decimal place.

References

Xue, N.: Chinese word segmentation as character tagging. IJCLCLP (2003)
Google Scholar
Jin, K.L., Ng, H.T., Guo, W.: A maximum entropy approach to chinese word segmentation. In: Proceedings of the Fourth Sighan Workshop on Chinese Language Processing (2005)
Google Scholar
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Peng, F., Feng, F., Mccallum, A.: Chinese segmentation and new word detection using conditional random fields, pp. 562–568 (2004)
Google Scholar
Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. EMNLP (2013)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.P.: Natural language processing (Almost) from scratch. J. Machine Learning Res. (2011)
Google Scholar
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for Chinese word segmentation. ACL (2014)
Google Scholar
Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. EMNLP (2015)
Google Scholar
Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. ACL (2016)
Google Scholar
Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y., Huang, F.: Fast and Accurate Neural Word Segmentation for Chinese (April 2017). arXiv:1704.07047
Chen, X., Shi, Z., Qiu, X., Huang, X.: Adversarial multi-criteria learning for Chinese word segmentation. 1704 (2017). arXiv:1704.07556
Huang, C., Zhao, H.: Chinese word segmentation: a decade review. J. Chin. Inf. Process. 21(3), 8–19 (2007)
MathSciNet Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for sighan bakeoff 2005, 168–171 (2005)
Google Scholar
Zhao, H., Huang, C., Li, M., Lu, B.L.: Effective tag set selection in Chinese word segmentation via conditional random field modeling. PACLIC (2006)
Google Scholar
Zhao, H., Huang, C.N., Li, M., Lu, B.L.: A unified character-based tagging framework for chinese word segmentation. Acm Trans. Asian Language Inf. Process. 9(2), 1–32 (2010)
Article Google Scholar
Sun, X., Wang, H., Li, W.: Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection, pp. 253–262 (2012)
Google Scholar
Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. NLPCC/ICCPOL (2016)
Google Scholar
Li, Z., Chao, J., Zhang, M., Chen, W.: Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 1783–1792 (2015)
Google Scholar
Chao, J., Li, Z., Chen, W., Zhang, M.: Exploiting heterogeneous annotations for Weibo word segmentation and POS tagging. NLPCC (2015)
Google Scholar
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F.B., Wattenberg, M., Corrado, G., Hughes, M., Dean, J.: Google’s Multilingual Neural Machine Translation System - Enabling Zero-Shot Translation. cs.CL (2016)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. CoRR (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5–6), 602–610 (2005)
Article Google Scholar
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al.: Dynet: the dynamic neural network toolkit. (2017) arXiv preprint arXiv:1701.03980
Emerson, T.: The second international chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 123–133 (2005)
Google Scholar
MOE, P.: The fourth international chinese language processing bakeoff: Chinese word segmentation, named entity recognition and chinese pos tagging. In: Proceedings of the sixth SIGHAN workshop on Chinese language processing (2008)
Google Scholar
Zhang, Y., Clark, S.: Chinese segmentation with a word-based perceptron algorithm. In: Czech Republic, Association for Computational Linguistics, pp. 840–847. Prague (2007)
Google Scholar
Zhao, H., Kit, C.: Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In: The Sixth SIGHAN Workshop on Chinese Language Processing, pp. 106–111 (2008)
Google Scholar
Sun, X., Zhang, Y., Matsuzaki, T., Tsuruoka, Y., Tsujii, J.: A discriminative latent variable chinese segmenter with hybrid word/character information, pp. 56–64 (2009)
Google Scholar
Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for chinese word segmentation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, Association for Computational Linguistics, pp. 311–321 (October 2013)
Google Scholar
Chen, X., Qiu, X., Zhu, C., Huang, X.: Gated recursive neural network for Chinese word segmentation. ACL (2015)
Google Scholar
Wang, C., Xu, B.: Convolutional neural network with word embeddings for chinese word segmentation. (2017). arXiv preprint arXiv:1711.04411

Download references

Author information

Authors and Affiliations

Computer and Software Engineering, Institutional Research, 2700 Bay Area Blvd., Houston, TX, 77058, USA
Han He, Lei Wu & Hua Yan
Computer Science Department, 3551 Cullen Blvd., Houston, TX, 77204, USA
Zhimin Gao
Department of Computer Science, Algoma University, 1520 Queen Street East, Sault Ste., Marie, ON, P6A 2G4, Canada
Yi Feng & George Townsend

Authors

Han He
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hua Yan
View author publications
You can also search for this author in PubMed Google Scholar
Zhimin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yi Feng
View author publications
You can also search for this author in PubMed Google Scholar
George Townsend
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wu .

Editor information

Editors and Affiliations

School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, Odisha, India
Suresh Chandra Satapathy
Department of Electronics and Communication Engineering, Shri Ramswaroop Memorial Group of Professional Colleges, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Swagatam Das

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, H., Wu, L., Yan, H., Gao, Z., Feng, Y., Townsend, G. (2019). Effective Neural Solution for Multi-criteria Word Segmentation. In: Satapathy, S., Bhateja, V., Das, S. (eds) Smart Intelligent Computing and Applications . Smart Innovation, Systems and Technologies, vol 105. Springer, Singapore. https://doi.org/10.1007/978-981-13-1927-3_14

Download citation

DOI: https://doi.org/10.1007/978-981-13-1927-3_14
Published: 05 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1926-6
Online ISBN: 978-981-13-1927-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Effective Neural Solution for Multi-criteria Word Segmentation