Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora1

Zhang, Hong; Xu, Bo; Huang, Taiyi

doi:10.1007/3-540-40063-X_37

Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora¹

Hong Zhang⁷,
Bo Xu⁷ &
Taiyi Huang⁷

Conference paper
First Online: 26 October 2001

924 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1948))

Abstract

This paper presents the statistical characteristics of Chinese language based on huge text corpora.From our investigation,we find that in writing Chinese it is more likely to use long words,while in other language styles the words are shorter.In large text corpora,the number of bigram and trigram can be estimated by the size of the corpus.In the recognition experiments,we find the correlation is weak between the perplexity to either the size of the training set or the recognition character error rate.However,in order to attain good performance,the large training set above tens of million words is necessary.

The work described in the paper is funded by the National Key Fundamental Research Program (the 973 Project)under No.G1998030504, and the National 863 High-tech Project under No.863-306-ZD03-01-1.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berenson, M., Levine, D.and Mercer, R.L.,“Applied Statistics,A First Course.” Prentice-Hall International,1988.
Google Scholar
Chen, S., Beeferman, D.,and Rosenfeld, R.(1998).“Evaluation Metrics for Language Models.” In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,1998.
Google Scholar
Clarkson, P.and Robinson, T.“Towards Improved Language Model Evaluation Measures ”, In proceedings of EUAROSPEECH’ 99,Sep.5-9,1999 Budapest,Hungary.
Google Scholar
Jelinek, F.(1998).“Statistical Methods for Speech recognition ”,The MIT Press,1998.
Google Scholar
Katz, S.M,“Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics,Speech and Signal Processing,35(3):400–401,,1987.
Google Scholar
Zhang, H., Huang, T._and Xu, B.(2000),The NLPR Chinese Language Model Toolkit (V1.0)for Large Text corpus, 2000 International Conference on Multilingual Information Processing (2000 ICMIP),Urumqi,China, July 20–25,2000.
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences, P.O Box 2728, 100080, Beijing, P.R China
Hong Zhang, Bo Xu & Taiyi Huang

Authors

Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Taiyi Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, 100080, Beijing, China
Tieniu Tan
Computer Department, Media Laboratory, Tsinghua University, 100084, Beijing, China
Yuanchun Shi
Institute of Computing Technology, Chinese Academy of Sciences, P.O.Box 2704, 100080, Beijing, China
Wen Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Xu, B., Huang, T. (2000). Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora¹ . In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_37

Download citation

DOI: https://doi.org/10.1007/3-540-40063-X_37
Published: 26 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41180-2
Online ISBN: 978-3-540-40063-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics