Advertisement

Chinese Information Retrieval System Based on Vector Space Model

  • Jianfeng TangEmail author
  • Jie Huang
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 905)

Abstract

This paper designs and implements a Chinese information retrieval system based on vector space model. The system will be able to quickly retrieve 50 thousand news documents to help users find the information they want in a short period of time, so as to help their own production and life. The beginning of this system is data capture, that is, data acquisition module, and then the preprocessing of the document, including Chinese word segmentation, discontinuation of words, document feature words extraction, and document vector representation and so on. The system first carries out the data capture, forms the document, then carries out Chinese word segmentation to the document, and handles the discontinuation of words in the Chinese word segmentation. After that, the document word item is extracted, and then vectors are represented and stored on disk. At the same time, the document word items are inverted indexed and stored on disk. When the user inquires, the user query statement is regarded as a document, the document is preprocessed, then the index is read, the related documents read to the vector similarity are calculated. Finally, the comprehensive results are sorted according to the calculated scores. In the experimental result and test part, several examples are given to analyze the recall rate and precision rate, and one of the practical problems is solved by interpolation method.

Keywords

Chinese information retrieval Vector space model Rule execution Data acquisition 

References

  1. 1.
    Christopher, D.M., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval, pp. 14–92. Cambridge University Press, Cambridge (2009)Google Scholar
  2. 2.
    Gao, J.F., Li, M., Huang, C.N., Wu, A.: Chinese word segmentation and named entity recognition: a pragmatic approach. Comput. Linguist. 4, 19–20 (2005)zbMATHGoogle Scholar
  3. 3.
    Zheng, Z.H., Wu, X.U.Y., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 1, 17–22 (2004)Google Scholar
  4. 4.
    Jing, L.P., Huang, H.K., Shi, H.B.: Improved feature selection approach TFIDF in text mining. In: Proceedings of 1st Information Conference on Machine Learning and Cybernetic, vol. 6, pp. 22–23 (2002)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Software EngineeringTongji UniversityShanghaiChina

Personalised recommendations