Chinese Information Retrieval System Based on Vector Space Model
This paper designs and implements a Chinese information retrieval system based on vector space model. The system will be able to quickly retrieve 50 thousand news documents to help users find the information they want in a short period of time, so as to help their own production and life. The beginning of this system is data capture, that is, data acquisition module, and then the preprocessing of the document, including Chinese word segmentation, discontinuation of words, document feature words extraction, and document vector representation and so on. The system first carries out the data capture, forms the document, then carries out Chinese word segmentation to the document, and handles the discontinuation of words in the Chinese word segmentation. After that, the document word item is extracted, and then vectors are represented and stored on disk. At the same time, the document word items are inverted indexed and stored on disk. When the user inquires, the user query statement is regarded as a document, the document is preprocessed, then the index is read, the related documents read to the vector similarity are calculated. Finally, the comprehensive results are sorted according to the calculated scores. In the experimental result and test part, several examples are given to analyze the recall rate and precision rate, and one of the practical problems is solved by interpolation method.
KeywordsChinese information retrieval Vector space model Rule execution Data acquisition
- 1.Christopher, D.M., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval, pp. 14–92. Cambridge University Press, Cambridge (2009)Google Scholar
- 3.Zheng, Z.H., Wu, X.U.Y., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 1, 17–22 (2004)Google Scholar
- 4.Jing, L.P., Huang, H.K., Shi, H.B.: Improved feature selection approach TFIDF in text mining. In: Proceedings of 1st Information Conference on Machine Learning and Cybernetic, vol. 6, pp. 22–23 (2002)Google Scholar