Skip to main content

Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora1

  • Conference paper
  • First Online:
  • 924 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1948))

Abstract

This paper presents the statistical characteristics of Chinese language based on huge text corpora.From our investigation,we find that in writing Chinese it is more likely to use long words,while in other language styles the words are shorter.In large text corpora,the number of bigram and trigram can be estimated by the size of the corpus.In the recognition experiments,we find the correlation is weak between the perplexity to either the size of the training set or the recognition character error rate.However,in order to attain good performance,the large training set above tens of million words is necessary.

The work described in the paper is funded by the National Key Fundamental Research Program (the 973 Project)under No.G1998030504, and the National 863 High-tech Project under No.863-306-ZD03-01-1.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berenson, M., Levine, D.and Mercer, R.L.,“Applied Statistics,A First Course.” Prentice-Hall International,1988.

    Google Scholar 

  2. Chen, S., Beeferman, D.,and Rosenfeld, R.(1998).“Evaluation Metrics for Language Models.” In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,1998.

    Google Scholar 

  3. Clarkson, P.and Robinson, T.“Towards Improved Language Model Evaluation Measures ”, In proceedings of EUAROSPEECH’ 99,Sep.5-9,1999 Budapest,Hungary.

    Google Scholar 

  4. Jelinek, F.(1998).“Statistical Methods for Speech recognition ”,The MIT Press,1998.

    Google Scholar 

  5. Katz, S.M,“Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics,Speech and Signal Processing,35(3):400–401,,1987.

    Google Scholar 

  6. Zhang, H., Huang, T._and Xu, B.(2000),The NLPR Chinese Language Model Toolkit (V1.0)for Large Text corpus, 2000 International Conference on Multilingual Information Processing (2000 ICMIP),Urumqi,China, July 20–25,2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, H., Xu, B., Huang, T. (2000). Statistical Analysis of Chinese Language and Language Modeling Based on Huge Text Corpora1 . In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-40063-X_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41180-2

  • Online ISBN: 978-3-540-40063-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics