Skip to main content
Log in

A new indexing method based on word proximity for Chinese text retrieval

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it’s difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text coutent so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Salton G. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

  2. Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.

  3. Chien Lee-Feng. Fast and quasi-natural language search for gigabytes of Chinese texts. InACM SIGIR’95, Seattle, 1995, pp.112–120.

  4. Wilkinson R. Chinese document retrieval at TREC-6. InText Retrieval Conference (TREC-6) NIST, Gaithersburg, Maryland, 1997, pp.25–30.

  5. Du L, Sun Y F. The application of NLP in the chinese information retrieval. InSCIPL’98, Hong Kong, 1998, pp.32–38.

  6. Leong M K, Zhou H. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. InText Retrieval Conference (TREC-6), NIST, Gaithersburg, Maryland, 1997, pp.551–558.

  7. He J, Xu J. Berkeley Chinese information retrieval at TREC-5: Technical report. InText Retrieval Conference (TREC-5), NIST, Gaithersburg, Maryland, 1996, pp.191–196.

  8. Wu Li-deet al. Fudan abstract system of Chinese text.Communications of COLIPS, 1996, 6(1): 35–39.

    Google Scholar 

  9. Sun M, Huang C. Identifying Chinese names in unrestricted texts.Communications of COLIPS, 1994, 4(2): 113–122.

    MathSciNet  Google Scholar 

  10. Liu K Y. The evaluation of the modern Chinese word segmentation.Applied Linguistics, 1997, 21(1): 101–106.

    Google Scholar 

  11. Liu Y. Modern Chinese Word Segmentation Specification and Methodology for Information Processing. Tsinghua University Press, 1994.

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported by the National ‘863’ High-Tech Programme of China under Grant No.863-306-ZD-10-21, and the National Natural Science Foundation of China under Grant No.69983009.

DU Lin was born in 1965. He received the B.S. degree from Chongqing University in 1990 and the Ph.D. degree in computer science from the Institute of Software, Chinese Academy of Sciences, in 1999. Since 1995, he has been working on Chinese information retrieval.

SUN Yufang was born in 1947. He received the M.S. degree from the Institute of Softeware, CAS in 1983. Since 1985, he has been working on Chinese information processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, L., Sun, Y. A new indexing method based on word proximity for Chinese text retrieval. J. Comput. Sci. & Technol. 15, 280–286 (2000). https://doi.org/10.1007/BF02948815

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02948815

Keywords

Navigation