Applying Machine Learning to Chinese Entity Detection and Tracking

Qian, Donglei; Li, Wenjie; Yuan, Chunfa; Lu, Qin; Wu, Mingli

doi:10.1007/978-3-540-70939-8_14

Donglei Qian^1,2,
Wenjie Li¹,
Chunfa Yuan²,
Qin Lu¹ &
…
Mingli Wu¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1509 Accesses
2 Citations

Abstract

This paper presents a Chinese entity detection and tracking system that takes advantages of character-based models and machine learning approaches. An entity here is defined as a link of all its mentions in text together with the associated attributes. Entity mentions of different types normally exhibit quite different linguistic patterns. Six separate Conditional Random Fields (CRF) models that incorporate character N-gram and word knowledge features are built to detect the extent and the head of three types of mentions, namely named, nominal and pronominal mentions. For each type of mentions, attributes are identified by Support Vector Machine (SVM) classifiers which take mention heads and their context as classification features. Mentions can then be merged into a unified entity representation by examining their attributes and connections in a rule-based coreference resolution process. The system is evaluated on ACE 2005 corpus and achieves competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Linguistic data consortium (LDC): ACE (Automatic Content Extraction) Chinese annotation guidelines for entities. Version 5.5 (2005)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning of ICML-2001 (2001)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of Human Language Technology of NAACL-2003 (2003)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter. In: Proceedings of SIGHAN Workshop on Chinese Language Processing (2005)
Google Scholar
Chen, W., Zhang, Y., Hitoshi, I.: Named entity recognition with conditional random fields. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 118–121 (2006)
Google Scholar
Wu, Y., Yang, J., Lin, Q.: Description of the NCU Chinese word segmentation and named entity recognition system for SIGHAN Bakeoff 2006. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 209–212 (2006)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European Conference on Machine Learning (1998)
Google Scholar
Grishman, R., Sundheim, B.: Design of the muc-6 evaluation. In: Proceedings of MUC-6 (1995)
Google Scholar
Krupka, G.R., Hausman, K.: Description of the NetOwl TM extractor system as used for MUC-7. In: Proceedings of the MUC-7 (1998)
Google Scholar
Zhou, Y., Huang, C., Gao, J., Wu, L.: Transformation based Chinese entity detection and tracking. In: Proceedings of International Joint Conference on Natural Language Processing, pp. 232–237 (2005)
Google Scholar
Bikel, D.M., Schwartz, R., Weischedel, R.M.: An algorithm that learns what’s in a name. The Machine Learning Journal, Special Issue on Natural Language Learning (1999)
Google Scholar
Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of CoNLL-2003 (2003)
Google Scholar
Guo, H., Jiang, J., Hu, G., Zhang, T.: Chinese named entity recognition based on multilevel linguistic features. In: Proceedings of IJCNLP-2004 (2004)
Google Scholar
Li, H., Huang, C., Gao, J., Fan, X.: The use of SVM for Chinese new word identification. In: Proceedings of IJCNLP2004 (2004)
Google Scholar
Wu, Y., Zhao, J., Xu, B.: Chinese named entity recognition model based on multiple features. In: Proceedings of HLT/EMNLP, pp. 427–434 (2005)
Google Scholar
Hobbs, J.R.: Resolving pronoun references. Lingua 44, 311–338 (1978)
Article Google Scholar
Soon, W.M., Lim, D.C.Y., Ng, H.T.: Machine learning approach to coreference resolution of noun phrases. In: Computational Linguistics, pp. 521–544 (2001)
Google Scholar
Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N., Roukos, S.: A mention-synchronous coreference resolution algorithm based on the bell tree. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 135–142 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing, The Hong Kong Polytechnic University, Hong Kong
Donglei Qian, Wenjie Li, Qin Lu & Mingli Wu
Department of Computer Science and Technology, Tsinghua University, China
Donglei Qian & Chunfa Yuan

Authors

Donglei Qian
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Li
View author publications
You can also search for this author in PubMed Google Scholar
Chunfa Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Qin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Mingli Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, D., Li, W., Yuan, C., Lu, Q., Wu, M. (2007). Applying Machine Learning to Chinese Entity Detection and Tracking. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics