# Inferring Probability of Relevance Using the Method of Logistic Regression

## Abstract

This research evaluates a model for probabilistic text and document retrieval; the model utilizes the technique of *logistic regression* to obtain equations which rank documents by probability of relevance as a function of document and query properties. Since the model infers probability of relevance from statistical clues present in the texts of documents and queries, we call it *logistic inference*. By transforming the distribution of each statistical clue into its standardized distribution (one with mean μ = 0 and standard deviation σ = 1), the method allows one to apply logistic coefficients derived from a training collection to other document collections, with little loss of predictive power. The model is applied to three well-known information retrieval test collections, and the results are compared directly to the particular vector space model of retrieval which uses term-frequency/inverse-document-frequency (tfidf) weighting and the cosine similarity measure. In the comparison, the logistic inference method performs significantly better than (in two collections) or equally well as (in the third collection) the tfidf/cosine vector space model. The differences in performances of the two models were subjected to statistical tests to see if the differences are statistically significant or could have occurred by chance.

## Keywords

Logistic Regression Information Retrieval Vector Space Model Logistic Inference Test Collection## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Salton G et al. The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, NJ, 1971Google Scholar
- 2.Salton G. Text processing: the transformation, analysis and retrieval of information by computer. Addison Wesley, Reading, MA-Menlo Park, CA, 1989Google Scholar
- 3.Salton G, McGill M. Introduction to modern information retrieval. McGraw-Hill, New York, 1983MATHGoogle Scholar
- 4.Sparck-Jones K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 1972; 28: 11–21CrossRefGoogle Scholar
- 5.Salton G Buckley C. Term weighting approaches in automatic text retrieval. Information Processing and Management 1988; 24: 513–523CrossRefGoogle Scholar
- 6.Robertson, S. The probability ranking principle in IR. Journal of Documentation 1977; 33: 294–304CrossRefGoogle Scholar
- 7.Robertson S Sparck-Jones K. Relevance weighting of search terms. Journal of the ASIS 1976; 27: 129–145Google Scholar
- 8.Cooper W. Inconsistencies and misnomers in probabilistic IR. In: Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, Chicago, III, Oct 13–16, 1991, pp 57–61Google Scholar
- 9.Fuhr N Huther H. Optimum probability estimation from empirical distributions. Information Processing and Management 1989; 25: 493–507CrossRefGoogle Scholar
- 10.Hosmer D Lemeshow S. Applied logistic regression. John Wiley & Sons, New York, 1989Google Scholar
- 11.Fox E. Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. PhD dissertation, Computer Science, Cornell University, 1983Google Scholar
- 12.Fuhr N. Optimal polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Informations Systems 1989; 7: 183–204CrossRefGoogle Scholar
- 13.Fuhr N Buckley C. A probabilistic learning approach for document indexing. ACM Transactions on Informations Systems 1991 9: 223–248CrossRefGoogle Scholar
- 14.Haines D Croft B. Relevance feedback and inference networks. Proceedings of the 1993 SIGIR International Conference on Information Retrieva 1, Pittsburgh, Pa, June 27-July I, 1993, pp 2–12Google Scholar
- 15.Turtle H. Inference networks for document retrieval. PhD Dissertation, University of Massachusetts, COINS Technical Report 90–92, February, 1991Google Scholar
- 16.Fung R Crawford S Appelbaum L Tong R. An architecture for probabilistic concept-bases information retrieval. In: Proceedings of the 13th international conference on research and development in information retrieval. Brussels, Belgium, September 5–7, 1990, pp. 455–467Google Scholar
- 17.Swanson D. Information retrieval as a trial-and-error process. Library Quarterly 1977; 47: 128–148CrossRefGoogle Scholar
- 18.Hull D. Using statistical testing in the evaluation of retrieval experiments. Proceedings of the 1993 SIGIR international conference on information retrieval. Pittsburgh, Pa, June 27-July 1, 1993, pp. 329–338Google Scholar
- 19.Yu C Buckley C Lam H Salton G. A generalized term dependence model in information retrieval. Information Technology: Research and Development 1983; 2: 129–154Google Scholar
- 20.Cooper W Gey F Chen A. Information retrieval from the TIPSTER collection: an application of staged logistic regression. In: Proceedings of the First NIST Text Retrieval Conference, National Institute for Standards and Technology, Washington, DC, November 4–6, 1992, NIST Special Publication 500–207, March 1993, pp 73–88Google Scholar
- 21.Harman, D. Overview of the first TREC conference. In: Proceedings of the 1993 SIGIR international conference on information retrieva I, Pittsburgh, Pa, June 27-July 1, 1993, pp 36–47Google Scholar
- 22.Gey F. Probabilistic dependence and logistic inference in information retrieval. PhD dissertation, University of California, Berkeley, 1993Google Scholar