Abstract
Since the professional technical literature include amounts of complex noun phrases, identifying those phrases has an important practical value for such tasks as machine translation. Through analysis of those phrases in Chinese-English bilingual sentence pairs from the aircraft technical publications, we present an annotation specification based on the existing specification to label those phrases and a method for the complex noun phrase identification. In addition to the basic features including the word and the part-of-speech, we incorporate the word clustering features trained by Brown clustering model and Word Vector Class (WVC) model on a large unlabeled data into the machine learning model. Experimental results indicate that the combination of different word clustering features and basic features can leverage system performance, and improve the F-score by 1.83 % in contrast with the method only adding the basic features.
This work is supported by Humanities and Social Sciences Foundation for the Youth Scholars of Ministry of Education of China (№-14YJC740126) and National Natural Science Foundation of China (№-61402299).
References
Xu, H.: Application of commercial aircraft technical publication specifications. J. Aviat. Maint. Eng. 6, 91–93 (2012)
Zhou, Q.: Annotation scheme for Chinese treebank. J. Chin. Inf. 18(4), 1–8 (2004)
Koo, T., Carreras, X., Collins, M.: Simple semi-supervised dependency parsing. In: Proceedings of 46th Annual Meetings of the Association for Computational Linguistics (ACL), pp. 595–603 (2008)
Candito, M., Crabbé, B.: Improving generative statistical parsing with semi-supervised word clustering. In: Proceedings of the 11th International Conference on Parsing Technologies. Association for Computational Linguistics, pp. 138–141 (2009)
Liang, P.: Semi-supervised learning for natural language. Massachusetts Institute of Technology (2005)
Zhu, X., Goldberg, A.B.: Introduction to semi-supervised learning. J. Synth. Lect. Artif. Intell. Mach. Learn. 3(1), 1–130 (2009)
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–497 (1992)
Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data, pp. 139–141 (2001)
Sun, R., Liu, Q.: Chinese base noun phrase identification based on mutual information. J. Chin. Comput. Commun. 11, 71–72 (2012)
Meng, W., Zhu, H., Xu, Y.: A study of automatic acquisition of Chinese compound noun phrases based on corpus. J. Leshan Teach. 12, 57–61 (2014)
Guochen, L., Jianbing, D., et al.: Chinese base-chunk identification based on distributed character representation. J. Chin. Inf. 28(6), 18–25 (2014)
Kaixu, Z., Changle, Z.: Unsupervised feature learning for Chinese lexicon based on auto-encoder. J. Chin. Inf. 27(5), 1–7 (2013)
Munkhdalai, T., Li, M., Batsuren, K., et al.: Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations. J. Cheminf. 7, s9 (2015)
Wu, Y.-C.: A top-down information theoretic word clustering algorithm for phrase recognition. J. Inf. Sci. 275, 213–225 (2014)
Zhu, L., Chao, L.S., Wong, D.F., et al.: A noun-phrase chunking model based on SBCB ensemble learning algorithm. In: International Conference on Machine Learning and Cybernetics (ICMLC). IEEE, pp. 11–16 (2012)
Konkol, M., Brychcín, T., Konopík, M.: Latent semantics in named entity recognition. J. Expert Syst. Appl. 42, 3470–3479 (2015)
Yu, S., Huiming, D., Xuefeng, Z.: The basic processing of contemporary Chinese corpus at Peking university. J. Chin. Inf. Process. 16(5), 49–64 (2002)
Wang, Z.: A contrastive study between English and Chinese of attributive-centered structure. Liaoning Normal University (2012)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Lai, S., Liu, k., Xu, L., Zhao, J.: How to Generate a Good Word Embedding? arXiv preprint (2015). arXiv:1507.05523
Qian, Y., Suen, C.Y.: Clustering combination method. In: 15th International Conference on IEEE, vol. 2, pp. 732–735 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
The Table 6.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xue, L., Zhang, G., Zhou, Q., Ye, N. (2015). Incorporating Word Clustering into Complex Noun Phrase Identification. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-25816-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)