Statistical Augmentation of a Chinese Machine-Readable Dictionary
We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.
KeywordsLexical Item Chinese Word Word Boundary Word Segmentation Unknown Word
Unable to display preview. Download preview PDF.
- BDC. 1992. The BDC Chinese-English electronic dictionary (version 2.0). Behavior Design Corporation.Google Scholar
- Chang, C.-H. and Chen, C.-D. 1993. HMM-based part-of-speech tagging for Chinese corpora. In Proceedings of the Workshop on Very Large Corpora, pp. 40–47, Columbus, Ohio.Google Scholar
- Chen, Y. and Chen, S. 1983. Chinese idioms and their English equivalents. Hong Kong: Shang Wu Yin Shu Ju.Google Scholar
- Chiang, T.-H., Chang, J.-S. Lin, M.-Y. and Su, K.-Y. 1992. Statistical models for word segmentation and unknown resolution. In Proceedings of ROCLING-92, pp. 121–146.Google Scholar
- FDMC. 1986. Xiandai hanyu pinlu cidian (Frequency dictionary of modern Chinese). Beijing Language Institute Press.Google Scholar
- Lin, M.-Y., Chiang, T.-H. and Su, K.-Y. 1993. A preliminary study on unknown word problem in Chinese word segmentation. In Proceedings of ROCLING-93, pp. 119–141.Google Scholar
- Lin, Y.-C., Chiang, T.-H. and Su, K.-Y. 1992. Discrimination oriented probabilistic tagging. In Proceedings of ROCLING-92, pp. 85–96.Google Scholar
- Liu, Y. 1987. New advances in computers and natural language processing in China. Information Science,8:64–70. In Chinese.Google Scholar
- Smadja, F. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19 (1): 143–177.Google Scholar
- Wu, D. and Fung, P. 1994. Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. In Proceedings of the 4th Conference on Applied Natural Language Processing, pp. 180–181, Stuttgart, Germany.Google Scholar