A new indexing method based on word proximity for Chinese text retrieval
- 40 Downloads
This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it’s difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text coutent so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.
Keywordsinformation retrieval vector space model automatic indexing proximity-based indexing
Unable to display preview. Download preview PDF.
- Salton G. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
- Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.Google Scholar
- Chien Lee-Feng. Fast and quasi-natural language search for gigabytes of Chinese texts. InACM SIGIR’95, Seattle, 1995, pp.112–120.Google Scholar
- Wilkinson R. Chinese document retrieval at TREC-6. InText Retrieval Conference (TREC-6) NIST, Gaithersburg, Maryland, 1997, pp.25–30.Google Scholar
- Du L, Sun Y F. The application of NLP in the chinese information retrieval. InSCIPL’98, Hong Kong, 1998, pp.32–38.Google Scholar
- Leong M K, Zhou H. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. InText Retrieval Conference (TREC-6), NIST, Gaithersburg, Maryland, 1997, pp.551–558.Google Scholar
- He J, Xu J. Berkeley Chinese information retrieval at TREC-5: Technical report. InText Retrieval Conference (TREC-5), NIST, Gaithersburg, Maryland, 1996, pp.191–196.Google Scholar
- Wu Li-deet al. Fudan abstract system of Chinese text.Communications of COLIPS, 1996, 6(1): 35–39.Google Scholar
- Liu K Y. The evaluation of the modern Chinese word segmentation.Applied Linguistics, 1997, 21(1): 101–106.Google Scholar
- Liu Y. Modern Chinese Word Segmentation Specification and Methodology for Information Processing. Tsinghua University Press, 1994.Google Scholar