Statistical Part-of-Speech Tagging for Classical Chinese

Huang, Liang; Peng, Yinan; Wang, Huan; Wu, Zhenyu

doi:10.1007/3-540-46154-X_15

Liang Huang³,
Yinan Peng³,
Huan Wang⁴ &
…
Zhenyu Wu³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2448))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

636 Accesses
5 Citations

Abstract

Classical Chinese is essentially different from Modern Chinese, in both syntax and morphology. While there has recently been a number of works on partof- speech (PoS) tagging for Modern Chinese, the PoS tagging for Classical Chinese is largely neglected. To the best of our knowledge, this is the first work in the area. Fortunately however, in terms of tagging, Classical Chinese is easier than Modern Chinese in that most Classical Chinese words are single-character-formed, thus no segmentation is needed. So in this paper, we will propose and analyze a simple statistical approach for PoS tagging of Classical Chinese. We first designed a tagset for Classical Chinese that is later shown to be accurate and efficient. Then we apply the hidden Markov model (HMM) Viterbi algorithm and made several improvements, such as sparse data problem handling and unknown word guessing, both designed particularly for Classical Chinese. As the training set grows larger, the accuracies for bigram and trigram increase to 94.9% and 97.6 %, respectively. The contribution of our work also lies in proposing and solving some previously unseen problems in processing Classical Chinese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Viterbi, A.: Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Trans. on Information Theory 13:260–269. 1967.
Article MATH Google Scholar
Leech, G. et al.: The Automatic Grammatical Tagging of the LOB Corpus, ICAME News, 7 (1983), pp. 13–33.
Google Scholar
Merialdo, B.: Tagging Text with a Probabilistic Model, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1991, pp. 809–812.
Google Scholar
Brill, E.: A simple rule-based part-of-speech tagger, In: Proceeding of the 3^rd Conference on Applied Natural Language Processing (ACL), 1992, pp. 152–155.
Google Scholar
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4), 1995, pp. 543–565.
Google Scholar
Ratnaparkhi, A. et al.: A Maximum Entropy Model for Part-of-Speech Tagging. In: Proceedings of Conference on Empirical Methods in Natural Language Processing(EMNLP-1), 1996, pp. 133–142.
Google Scholar
Charniak, E. et al.: Equations for Part-of-Speech Tagging. In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93), 1993. pp. 784–789.
Google Scholar
Lua, K.: Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm, Proceedings of Conference on Chinese Computing, Singapore, Jun. 1996, pp. 45–49.
Google Scholar
Hindle, D.: Acquiring disambiguation rules from text. In: Proceedings of 27th Annual Meeting of the Association for Computational Linguistics, 1989.
Google Scholar
Brant, T.: TnT-A Statistical Part-of-Speech Tagger. In: Proceedings of the 6th Applied NLP Conference (ANLP-2000), 2000, pp. 224–231.
Google Scholar
Allen, J.: Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc., 1995.
Google Scholar
Wei, P. et al.: Historical Corpora for Synchronic and Diachronic Linguistics Studies, Pacific Neighborhood Consortium, 1997.
Google Scholar
Nakagawa, T. et al.: Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines, Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Shanghai Jiaotong University, No. 1954 Huashan Road, 200030, Shanghai, P.R. China
Liang Huang, Yinan Peng & Zhenyu Wu
Department of Chinese Literature and Linguistics, East China Normal University, No. 3663 North Zhongshan Road, 200062, Shanghai, P.R. China
Huan Wang

Authors

Liang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Huan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics Department of Programming Systems and Communication, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Ivan Kopeček & Karel Pala &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, L., Peng, Y., Wang, H., Wu, Z. (2002). Statistical Part-of-Speech Tagging for Classical Chinese. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2002. Lecture Notes in Computer Science(), vol 2448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46154-X_15

Download citation

DOI: https://doi.org/10.1007/3-540-46154-X_15
Published: 23 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44129-8
Online ISBN: 978-3-540-46154-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics