Advertisement

Journal of Computer Science and Technology

, Volume 15, Issue 3, pp 280–286 | Cite as

A new indexing method based on word proximity for Chinese text retrieval

  • Du Lin 
  • Sun Yufang 
Article

Abstract

This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it’s difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text coutent so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.

Keywords

information retrieval vector space model automatic indexing proximity-based indexing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Salton G. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
  2. [2]
    Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.Google Scholar
  3. [3]
    Chien Lee-Feng. Fast and quasi-natural language search for gigabytes of Chinese texts. InACM SIGIR’95, Seattle, 1995, pp.112–120.Google Scholar
  4. [4]
    Wilkinson R. Chinese document retrieval at TREC-6. InText Retrieval Conference (TREC-6) NIST, Gaithersburg, Maryland, 1997, pp.25–30.Google Scholar
  5. [5]
    Du L, Sun Y F. The application of NLP in the chinese information retrieval. InSCIPL’98, Hong Kong, 1998, pp.32–38.Google Scholar
  6. [6]
    Leong M K, Zhou H. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. InText Retrieval Conference (TREC-6), NIST, Gaithersburg, Maryland, 1997, pp.551–558.Google Scholar
  7. [7]
    He J, Xu J. Berkeley Chinese information retrieval at TREC-5: Technical report. InText Retrieval Conference (TREC-5), NIST, Gaithersburg, Maryland, 1996, pp.191–196.Google Scholar
  8. [8]
    Wu Li-deet al. Fudan abstract system of Chinese text.Communications of COLIPS, 1996, 6(1): 35–39.Google Scholar
  9. [9]
    Sun M, Huang C. Identifying Chinese names in unrestricted texts.Communications of COLIPS, 1994, 4(2): 113–122.MathSciNetGoogle Scholar
  10. [10]
    Liu K Y. The evaluation of the modern Chinese word segmentation.Applied Linguistics, 1997, 21(1): 101–106.Google Scholar
  11. [11]
    Liu Y. Modern Chinese Word Segmentation Specification and Methodology for Information Processing. Tsinghua University Press, 1994.Google Scholar

Copyright information

© Science Press, Beijing China and Allerton Press Inc. 2000

Authors and Affiliations

  • Du Lin 
    • 1
  • Sun Yufang 
    • 1
  1. 1.Open System & Chinese Information Processing Center, Institute of SoftwareChinese Academy of SciencesBeijingP.R. China

Personalised recommendations