Abstract
Classical Chinese is essentially different from Modern Chinese, in both syntax and morphology. While there has recently been a number of works on partof- speech (PoS) tagging for Modern Chinese, the PoS tagging for Classical Chinese is largely neglected. To the best of our knowledge, this is the first work in the area. Fortunately however, in terms of tagging, Classical Chinese is easier than Modern Chinese in that most Classical Chinese words are single-character-formed, thus no segmentation is needed. So in this paper, we will propose and analyze a simple statistical approach for PoS tagging of Classical Chinese. We first designed a tagset for Classical Chinese that is later shown to be accurate and efficient. Then we apply the hidden Markov model (HMM) Viterbi algorithm and made several improvements, such as sparse data problem handling and unknown word guessing, both designed particularly for Classical Chinese. As the training set grows larger, the accuracies for bigram and trigram increase to 94.9% and 97.6 %, respectively. The contribution of our work also lies in proposing and solving some previously unseen problems in processing Classical Chinese.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Viterbi, A.: Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Trans. on Information Theory 13:260–269. 1967.
Leech, G. et al.: The Automatic Grammatical Tagging of the LOB Corpus, ICAME News, 7 (1983), pp. 13–33.
Merialdo, B.: Tagging Text with a Probabilistic Model, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1991, pp. 809–812.
Brill, E.: A simple rule-based part-of-speech tagger, In: Proceeding of the 3rd Conference on Applied Natural Language Processing (ACL), 1992, pp. 152–155.
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4), 1995, pp. 543–565.
Ratnaparkhi, A. et al.: A Maximum Entropy Model for Part-of-Speech Tagging. In: Proceedings of Conference on Empirical Methods in Natural Language Processing(EMNLP-1), 1996, pp. 133–142.
Charniak, E. et al.: Equations for Part-of-Speech Tagging. In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93), 1993. pp. 784–789.
Lua, K.: Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm, Proceedings of Conference on Chinese Computing, Singapore, Jun. 1996, pp. 45–49.
Hindle, D.: Acquiring disambiguation rules from text. In: Proceedings of 27th Annual Meeting of the Association for Computational Linguistics, 1989.
Brant, T.: TnT-A Statistical Part-of-Speech Tagger. In: Proceedings of the 6th Applied NLP Conference (ANLP-2000), 2000, pp. 224–231.
Allen, J.: Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc., 1995.
Wei, P. et al.: Historical Corpora for Synchronic and Diachronic Linguistics Studies, Pacific Neighborhood Consortium, 1997.
Nakagawa, T. et al.: Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines, Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, L., Peng, Y., Wang, H., Wu, Z. (2002). Statistical Part-of-Speech Tagging for Classical Chinese. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2002. Lecture Notes in Computer Science(), vol 2448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46154-X_15
Download citation
DOI: https://doi.org/10.1007/3-540-46154-X_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44129-8
Online ISBN: 978-3-540-46154-8
eBook Packages: Springer Book Archive