Abstract
This paper presents the statistical characteristics of Chinese language based on huge text corpora.From our investigation,we find that in writing Chinese it is more likely to use long words,while in other language styles the words are shorter.In large text corpora,the number of bigram and trigram can be estimated by the size of the corpus.In the recognition experiments,we find the correlation is weak between the perplexity to either the size of the training set or the recognition character error rate.However,in order to attain good performance,the large training set above tens of million words is necessary.
The work described in the paper is funded by the National Key Fundamental Research Program (the 973 Project)under No.G1998030504, and the National 863 High-tech Project under No.863-306-ZD03-01-1.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Berenson, M., Levine, D.and Mercer, R.L.,“Applied Statistics,A First Course.” Prentice-Hall International,1988.
Chen, S., Beeferman, D.,and Rosenfeld, R.(1998).“Evaluation Metrics for Language Models.” In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,1998.
Clarkson, P.and Robinson, T.“Towards Improved Language Model Evaluation Measures ”, In proceedings of EUAROSPEECH’ 99,Sep.5-9,1999 Budapest,Hungary.
Jelinek, F.(1998).“Statistical Methods for Speech recognition ”,The MIT Press,1998.
Katz, S.M,“Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics,Speech and Signal Processing,35(3):400–401,,1987.
Zhang, H., Huang, T._and Xu, B.(2000),The NLPR Chinese Language Model Toolkit (V1.0)for Large Text corpus, 2000 International Conference on Multilingual Information Processing (2000 ICMIP),Urumqi,China, July 20–25,2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, H., Xu, B., Huang, T. (2000). Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora1 . In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_37
Download citation
DOI: https://doi.org/10.1007/3-540-40063-X_37
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41180-2
Online ISBN: 978-3-540-40063-9
eBook Packages: Springer Book Archive