Statistical Augmentation of a Chinese Machine-Readable Dictionary

  • P. Fung
  • D. Wu
Part of the Text, Speech and Language Technology book series (TLTB, volume 11)


We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.


Lexical Item Chinese Word Word Boundary Word Segmentation Unknown Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. BDC. 1992. The BDC Chinese-English electronic dictionary (version 2.0). Behavior Design Corporation.Google Scholar
  2. Chang, C.-H. and Chen, C.-D. 1993. HMM-based part-of-speech tagging for Chinese corpora. In Proceedings of the Workshop on Very Large Corpora, pp. 40–47, Columbus, Ohio.Google Scholar
  3. Chen, Y. and Chen, S. 1983. Chinese idioms and their English equivalents. Hong Kong: Shang Wu Yin Shu Ju.Google Scholar
  4. Chiang, T.-H., Chang, J.-S. Lin, M.-Y. and Su, K.-Y. 1992. Statistical models for word segmentation and unknown resolution. In Proceedings of ROCLING-92, pp. 121–146.Google Scholar
  5. FDMC. 1986. Xiandai hanyu pinlu cidian (Frequency dictionary of modern Chinese). Beijing Language Institute Press.Google Scholar
  6. Lin, M.-Y., Chiang, T.-H. and Su, K.-Y. 1993. A preliminary study on unknown word problem in Chinese word segmentation. In Proceedings of ROCLING-93, pp. 119–141.Google Scholar
  7. Lin, Y.-C., Chiang, T.-H. and Su, K.-Y. 1992. Discrimination oriented probabilistic tagging. In Proceedings of ROCLING-92, pp. 85–96.Google Scholar
  8. Liu, Y. 1987. New advances in computers and natural language processing in China. Information Science,8:64–70. In Chinese.Google Scholar
  9. Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19 (1): 143–177.Google Scholar
  10. Sproat, R., Shih, C., Gale, W. and Chang, N. 1994. A stochastic word segmentation algorithm for a Mandarin text-to-speech system. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, pp. 66–73, Las Cruces, New Mexico.CrossRefGoogle Scholar
  11. Wu, D. 1994. Aligning a parallel English-Chinese corpus statistically with lexical criteria. In Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, pp. 80–87, Las Cruces, New Mexico.CrossRefGoogle Scholar
  12. Wu, D. and Fung, P. 1994. Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. In Proceedings of the 4th Conference on Applied Natural Language Processing, pp. 180–181, Stuttgart, Germany.Google Scholar
  13. Wu, Z. and Tseng, G. 1993. Chinese text segmentation for text retrieval: Achievements and problems. Journal of The American Society for Information Science, 44 (9): 532–542.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1999

Authors and Affiliations

  • P. Fung
  • D. Wu

There are no affiliations available

Personalised recommendations