Abstract
Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. This could limit model robustness and effectiveness. In this study, we propose a Latent Dirichlet relevance model, which relaxes this assumption. Our approach derives from current research on Latent Dirichlet Allocation (LDA) topic models. LDA has been extensively explored, especially for discovering a set of topics from a corpus. LDA itself, however, has a limitation that is also addressed in our work. Topics generated by LDA from a corpus are synthetic, i.e., they do not necessarily correspond to topics identified by humans for the same corpus. In contrast, our model explicitly considers the relevance relationships between documents and given topics (queries). Thus unlike standard LDA, our model is directly applicable to goals such as relevance feedback for query modification and text classification, where topics (classes and queries) are provided upfront. Thus although the focus of our paper is on improving relevance-based language models, in effect our approach bridges relevance-based language models and LDA addressing limitations of both.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adrieu, C., Freitas, N., Doucet, A., Jordan, M.: An Introduction to Markov Chain Monte Carlo for Machine Learning. Machine Learning 50 (2003)
Blei, M., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003)
Casella, G., George, E.: Explaining the Gibbs Sampler. The American Statistician 46(3) (1992)
Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model. In: Proceedings of the 20th NIPS (2006)
Erosheva, E., Fienberg, S., Lafferty, J.: Mixed-membership Models of Scientific Publication. In: Proceedings of National Academy of Science, PNAS (2004)
Griffiths, T., Steyvers, M.: Finding Scientific Topics. In: Proceedings of National Academy of Science, PNAS (2004)
Hiemstra, D., Robertson, S., Zaragoza, H.: Parsimonious Language Models for Information Retrieval. In: Proceedings of the 27th ACM SIGIR (2004)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of the 15th UAI (1999)
Lavrenko, V., Croft, W.B.: Relevance-based Language Models. In: Proceedings of the 24th ACM SIGIR (2001)
Lavrenko, V., Croft, W.B.: Relevance Models in Information Retrieval. In: Croft, B., Lafferty, J. (eds.) Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)
Liu, X., Croft, B.: Passage Retrieval Based on Language Models. In: Proceedings of the 11th ACM CIKM (2002)
Rijsbergen, C., Robertson, S., Porter, M.: New Models in Probabilistic Information Retrieval, British Library Research and Development Report, 5587 (1980)
Robertson, S., Sparck-Jones, K.: Relevance Weighting of Search Terms. Journal of American Society for Information Science 27 (1988)
Sparck-Jones, A., Robertson, S., Hiemstra, D., Zaragoza, H.: Language Modelling and Relevance. In: Croft, B., Lafferty, J. (eds.) Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)
Steyvers, M., Griffiths, T.: Probabilistic Topic Models. In: Landauer, T., et al. (eds.) Latent Semantic Analysis: A Road to Meaning. Lawrence Erlbaum, Mahwah (2006)
Wei, X., Croft, B.: LDA-based Document Models for Ad-hoc Retrieval. In: Proceedings of the 29th ACM SIGIR (2006)
Zhang, Y., Callan, J., Minka, T.: Novelty and Redundancy Detection in Adaptive Filtering. In: Proceedings of the 25th ACM SIGIR (2002)
Zhou, D., Manavoglu, E., Li, J., Giles, L., Zha, H.: Probabilistic Models for Discovering E-Communities. In: Proceedings of the 15th ACM WWW (2006)
Lucene, http://lucene.apache.org/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ha-Thuc, V., Srinivasan, P. (2009). A Latent Dirichlet Framework for Relevance Modeling. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-04769-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)