Skip to main content
Log in

Query expansion using pseudo relevance feedback on wikipedia

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

One of the major challenges in Web search pertains to the correct interpretation of users’ intent. Query Expansion is one of the well-known approaches for determining the intent of the user by addressing the vocabulary mismatch problem. A limitation of the current query expansion approaches is that the relations between the query terms and the expanded terms is limited. In this paper, we capture users’ intent through query expansion. We build on earlier work in the area by adopting a pseudo-relevance feedback approach; however, we advance the state of the art by proposing an approach for feature learning within the process of query expansion. In our work, we specifically consider the Wikipedia corpus as the feedback collection space and identify the best features within this context for term selection in two supervised and unsupervised models. We compare our work with state of the art query expansion techniques, the results of which show promising robustness and improved precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Wikipedia articles https://en.wikipedia.org/wiki/Obama_Family and https://en.wikipedia.org/wiki/Family_tree respectively.

  2. https://en.wikipedia.org/wiki/Hotel_California_(disambiguation)

  3. https://en.wikipedia.org/wiki/Land_mine_(disambiguation)

References

  • Aha, D.W., & Bankert, R.L. (1996). A comparative evaluation of sequential feature selection algorithms. In Learning from data (pp. 199–206). Springer.

  • Al-Shboul, B., & Myaeng, S.H. (2011). Query phrase expansion using wikipedia in patent class search. In Information retrieval technology (pp. 115126). Springer.

  • Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: a nucleus for a web of open data. Springer.

  • Bendersky, M., Metzler, D., & Croft, W.B. (2012). Effective query formulation with multiple information sources. In Proceedings of the fifth ACM international conference on web search and data mining, ACM (pp. 443–452).

  • Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM (pp. 1247–1250).

  • Bruce, C., Gao, X., Andreae, P., & Jabeen, S. (2012). Query expansion powered by wikipedia hyperlinks. In AI 2012: advances in artificial intelligence (pp. 421–432). Springer.

  • Buckley, C., & Voorhees, E.M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 25–32).

  • Carpineto, C., De Mori, R., Romano, G., & Bigi, B. (2001). An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS), 19(1), 1–27.

    Article  Google Scholar 

  • Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR), 44(1), 1.

    Article  MATH  Google Scholar 

  • Chakaravarthy, V.T., Gupta, H., Roy, P., & Mohania, M. (2006). Efficiently linking text documents with relevant structured information. In Proceedings of the 32nd international conference on very large data bases, VLDB endowment (pp. 667–678).

  • Cheung, J.C.K., & Li, X. (2012). Sequence clustering and labeling for unsupervised query intent discovery. In Proceedings of the fifth ACM international conference on web search and data mining, ACM (pp. 383–392).

  • Crabtree, D.W., Andreae, P., & Gao, X. (2007). Exploiting underrepresented query aspects for automatic query expansion. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 191–200).

  • Crabtree, D.W., Andreae, P., & Gao, X. (2007). Exploiting underrepresented query aspects for automatic query expansion. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 191–200).

  • Craswell, N., & Szummer, M. (2007). Random walks on the click graph. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 239–246).

  • Croft, W.B., Metzler, D., & Strohman, T. (2010). Search engines: information retrieval in practice. Reading: Addison-Wesley.

    Google Scholar 

  • Dalton, J., Dietz, L., & Allan, J. (2014). Entity query feature expansion using knowledge base links. In Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, ACM (pp. 365–374).

  • Dang, V., & Croft, B.W. (2010). Query reformulation using anchor text. In Proceedings of the third ACM international conference on web search and data mining, ACM (pp. 41–50).

  • Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.

    Article  Google Scholar 

  • Doszkocs, T.E. (1978). Aid, an associative interactive dictionary for online searching. Online Review, 2(2), 163–173.

    Article  Google Scholar 

  • Fellbaum, C. (1998). Wordnet. Wiley Online Library.

  • Ferragina, P., & Scaiella, U. (2010). Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on information and knowledge management, ACM (pp. 1625–1628).

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Google Scholar 

  • Hatcher, E., & Gospodnetic, O. (2004). Lucene in action. Manning Publications. ISBN: 1932394281.

  • Hu, J., Wang, G., Lochovsky, F., Sun, J.t., & Chen, Z. (2009). Understanding user’s query intent with wikipedia. In Proceedings of the 18th international conference on world wide web, ACM (pp. 471–480).

  • Jain, A., & Zongker, D. (1997). Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 153–158.

    Article  Google Scholar 

  • Järvelin, K., & Kekäläinen, J. (2000). Ir evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 41–48).

  • Jovanovic, J., Bagheri, E., Cuzzola, J., Gasevic, D., Jeremic, Z., & Bashash, R. (2014). Automated semantic tagging of textual content. IT Professional, 16(6), 38–46.

    Article  Google Scholar 

  • Lavrenko, V., & Croft, W.B. (2001). Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 120–127).

  • Li, Y., Luk, W.P.R., Ho, K.S.E., & Chung, F.L.K. (2007). Improving weak ad-hoc queries using wikipedia asexternal corpus. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 797–798).

  • Liu, S., Liu, F., Yu, C., & Meng, W. (2004). An effective approach to document retrieval via utilizing wordnet and recognizing phrases. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 266–272).

  • Liu, X., Bouchoucha, A., Sordoni, A., & Nie, J.Y. (2014). Compact aspect embedding for diversified query expansions. In Proceedings of AAAI (Vol. 14, pp. 115–121).

  • Meij, E., Bron, M., Hollink, L., Huurnink, B., & De Rijke, M. (2009). Learning semantic query suggestions. The Semantic Web-ISWC, 2009, 424–440.

    Google Scholar 

  • Mendes, P.N., Jakob, M., García-Silva, A., & Bizer, C. (2011). Dbpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems, ACM (pp. 1–8).

  • Pass, G., Chowdhury, A., & Torgeson, C. (2006). A picture of search. In Infoscale (Vol. 152, p. 1).

  • Radlinski, F., Szummer, M., & Craswell, N. (2010). Inferring query intent from reformulations and clicks. In Proceedings of the 19th international conference on world wide web, ACM (pp. 1171–1172).

  • Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning.

  • Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. arXiv:cmp-lg/9511007.

  • Robertson, S.E., & Jones, K.S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.

    Article  Google Scholar 

  • Robertson, S.E., Walker, S., Beaulieu, M., & Willett, P. (1999). Okapi at trec-7: automatic ad hoc, filtering, vlc and interactive track. Nist Special Publication SP, 253–264.

  • Rocchio, J.J. (1971). Prentice-Hall series in automatic computation, relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automatic document processing, chap 14 (pp. 313–323). Englewood Cliffs NJ: Prentice-Hall.

    Google Scholar 

  • Ruiz, R., Riquelme, J.C., & Aguilar-Ruiz, J.S. (2008). Best agglomerative ranked subset for feature selection, FSDM (pp. 148–162).

    Google Scholar 

  • Salton, G., & Buckley, C. (1997). Improving retrieval performance by relevance feedback. Readings in Information Retrieval, 24(5), 355–363.

    Google Scholar 

  • Santamaría, C., Gonzalo, J., & Artiles, J. (2010). Wikipedia as sense inventory to improve diversity in web search results. In Proceedings of the 48th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 1357–1366).

  • Spink, A., Wolfram, D., Jansen, M.B., & Saracevic, T. (2001). Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.

    Article  Google Scholar 

  • Xu, J., & Croft, W.B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79–112.

    Article  Google Scholar 

  • Xu, Y., Jones, G.J., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 59–66).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ebrahim Bagheri.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Keikha, A., Ensan, F. & Bagheri, E. Query expansion using pseudo relevance feedback on wikipedia. J Intell Inf Syst 50, 455–478 (2018). https://doi.org/10.1007/s10844-017-0466-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-017-0466-3

Keywords

Navigation