How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics
The first objective of this contribution is to give a description of our textual information retrieval system based on distributional semantics. The central idea of the approach is to represent the retrievable units and the user queries in a unified way as projections in a vector space of pertinent terms. The projections are derived from a co-occurrence matrix computed on large reference (textual) corpora collecting the distributional semantic information. A similarity computation based on the cosine measure is then used to characterize the semantic proximity between queries and documents.
Retrieval effectiveness can be further improved by the use of relevance feedback techniques. A simple feedback method where document relevance is interactively integrated to the original query will also be presented and evaluated.
Although our first experiments lead to quite promising results, one major drawback of our IR system in its original form is that the satisfaction of a query requires the evaluation of the similarities between that query and all the documents in the textual base. Therefore, the second objective of this contribution is to investigate how clustering techniques can be applied to the textual database in order to retrieve the documents satisfying a query through a partial exploration of the base. A tentative solution based on hierarchical clustering will be suggested.
KeywordsInformation Retrieval Indexing Structure Relevance Feedback Information Retrieval System Original Query
Unable to display preview. Download preview PDF.
- Allen, J. (1995): Relevance Feedback with Too Much Data,In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA.Google Scholar
- Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TREC3,In the third Text REtrieval Conference (TREC-3), NIST Special Publication 500–225.Google Scholar
- Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures 9 Algorithms. Prentice Hall.Google Scholar
- Gallant, S.I. et al. (1992): HNC’s MVlatchPlus System. SIGIR FORUM, 16(2).Google Scholar
- Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, England.Google Scholar
- Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copehagen, Denmark.Google Scholar
- Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland.Google Scholar
- Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments,In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA.Google Scholar
- Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Concept of the Distributional Semantics. In Proceedings of the 3’’ d International Conference on Statistical Analysis of Textual Data. Rome, Italy, December.Google Scholar
- Schütze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.Google Scholar
- Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. McGraw Hill.Google Scholar
- Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science.Google Scholar
- Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Precision Measurement. Information Processing Management, 12.Google Scholar
- Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4).Google Scholar