Abstract
Using a large synchronous Chinese corpus, we show how word and character entropy variations exhibit interesting differences in terms of time and space for different Chinese speech communities. We find that word entropy values are affected by the quality of the segmentation process. We also note that word entropies can be affected by proper nouns, which is the most volatile segment of the stable lexicon of the language. Our word and character entropy results provide interesting comparison with the earlier results and the average joint character entropies (a.k.a. entropy rates) of Chinese up to order 20 provided by us indicate that the limits of the conditional character entropies of Chinese for the different speech communities should be about 1 (or less). This invites questions on whether early convergence of character entropies would also entail word entropy convergence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bell, T.C.: Text Compression. Prentice-Hall, Englewood Cliffs (1990)
Brown, P., Della Pietra, S., Della Pietra, V., Lai, J.C., Mercer, R.L.: An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18(1), 31 (1992)
Yuan, C.: Xiandai Hanyu Dingliang Fenxi (Quantitative Analysis of Modern Chinese). Shanghai Education Press (1989)
Cover, T.M., King, R.: A Convergent Gambling Estimate of the Entropy of English. IEEE Trans. on Information Theory, IT 24(4), 413–421 (1978)
Zhiwei, F.: Shuxue Yu Yuyan (Mathematics and Language). Hunan Education Press (February 1991)
Xuanjing, H., Lide, W., Yikun, G., Bingwei, L.: Computation of the Entropy of Modern Chinese and the Probability Estimation of Sparse Event in Statistical Language Model. Acta Electronica Sinica 28(8), 110–112 (2000)
Di, J.: An Entropy Value of Classical Tibetan Language and Some Other Questions. In: Proceedings of 1998 International Conference on Chinese Information Processing, November 18-20 (1998)
Yuan, L., Dejin, W., Sheying, Z.: The Probability Distribution and Entropy and Redundancy in Printed Chinese. In: Proceedings of International Conference on Chinese Information Processing, August 1987, pp. 505–509 (1987)
Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)
Shannon, C.E.: Prediction and Entropy of Printed English. Bell System Technical Journal 3, 50–64 (1951)
Guiqing, S., Bingzeng, X.: Hanzi Zipin Fenbu, Zui Jia Bianma Yu Shuru Wenti (Character Frequency Distribution, Optimal Encoding and Input of Chinese). Acta Electronic Sinica 12(4), 94–96 (1984)
Teahan, W.J., Cleary, J.G.: The Entropy of English using PPM-based Models. In: Proceedings of Data Compression Conference (DCC 1996), pp. 53–62 (1996)
Xiaopeng, T.: The Design and Application of Language Model for the Minimum Entropy of Chinese Character (2003) (manuscript)
Tsou, B.K., Tsoi, W.F., Lai, T.B.Y., Hu, J., Chan, S.W.K.: LIVAC, A Chinese Synchronous Corpus, and Some Applications. In: Proceedings of the ICCLC International Conference on Chinese Language Computing, Chicago, pp. 233–238 (2000), http://livac.org
Weaver, W., Shannon, C.E.: The Mathematical Theory of Communication. University of Illinois Press, Urbana (1949)
Jun, W., Zuoying, W.: Hanyu Xinxi Shang He Yuyan Muxing De Fuzadu (Entropy and Complexity of Language Model of Chinese). Acta Electronica Sinica 24(10), 69–71 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tsou, B.K., Lai, T.B.Y., Chow, Kp. (2005). Comparing Entropies within the Chinese Language. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)