Employing query disambiguation using clustering techniques

Abstract

Due to the boundless expansion of the Web in the last decade, the research community has paid significant attention to the problem of effective searching in the vast information available. In this paper, we introduce a novel framework for improving information retrieval results. Initially, relevant documents are organized in clusters utilizing several metrics combined with language modelling tools. In following, a produced ranked list of the documents is returned to the user for a specific query. This is implemented as the scores between the clusters and the query representations are extracted; next in line, the internal rankings of the documents, per cluster, using these scores as weighting factor, are combined. Our proposed methodology is based on the exploitation of the inter-documents similarities (lexical and/or semantics) after a sophisticated pre-processing step. Our experimental evaluation demonstrates that the proposed algorithm can efficiently improve the quality of the retrieved results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Google: https://www.google.com/search/about/.

  2. 2.

    A snippet is usually a short text summarizing the context in which the query words appear in the result page.

  3. 3.

    http://www.nltk.org/howto/wordnet.html.

  4. 4.

    http://opennlp.sourceforge.net/models-1.5/.

  5. 5.

    http://sourceforge.net/projects/jWordNet/.

  6. 6.

    https://wordnet.princeton.edu/.

  7. 7.

    http://lemurproject.org/clueweb09/.

References

  1. Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: 2nd International conference on web search and web data mining (WSDM), pp 5–14

  2. Angel A, Koudas N (2011) Efficient diversity-aware search. In: ACM SIGMOD international conference on management of data (SIGMOD), pp 781–792

  3. Angelov P, Kasabov N (2005) Evolving computational intelligence systems. In: Proceedings of the 1st international workshop on genetic fuzzy systems, pp 76–82

  4. Baeza-Yates RA, Ribeiro-Neto BA (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education Ltd., Harlow

    Google Scholar 

  5. Baruah RD, Angelov PP (2012) Evolving local means method for clustering of streaming data. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–8

  6. Baruah RD, Angelov PP (2014) DEC: dynamically evolving clustering and its application to structure identification of evolving fuzzy models. IEEE Trans Cybern 44(9):1619–1631

    Article  Google Scholar 

  7. Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(8–13):1157–1166

    Google Scholar 

  8. Christen P (2006) A comparison of personal name matching: techniques and practical issues. In: Workshops proceedings of the 6th IEEE international conference on data mining (ICDM), pp 290–294

  9. Clarke CLA, Craswell N, Soboroff I (2009) Overview of the TREC 2009 web track. In: 18th Text REtrieval Conference (TREC)

  10. Clarke CLA, Craswell N, Soboroff I, Cormack GV (2010) Overview of the TREC 2010 web track. In: 19th Text REtrieval Conference (TREC)

  11. Clarke CLA, Craswell N, Soboroff I, Voorhees EM (2011) Overview of the TREC 2011 web track. In: 20th Text REtrieval Conference (TREC)

  12. Clarke CLA, Craswell N, Voorhees EM (2012) Overview of the TREC 2012 web track. In: 21th Text REtrieval Conference (TREC)

  13. Croft WB, Metzler D, Strohman T (2009) Search engines: information retrieval in practice. Pearson Education, London

    Google Scholar 

  14. Fellbaum C (1998) WordNet: an electronic lexical database. The MIT Press, Cambridge

    Google Scholar 

  15. Ferragina P, Scaiella U (2010) TAGME: on-the-fly annotation of short text fragments (by Wikipedia entities). In: 19th ACM conference on information and knowledge management (CIKM), pp 1625–1628

  16. Giakoumi I, Makris C, Plegas Y (2015) Language model and clustering based information retrieval. In: 11th International conference on web information systems and technologies (WEBIST), pp 479–486

  17. Jardine N, van Rijsbergen CJ (1971) The use of hierarchic clustering in information retrieval. Inf Storage Retr 7(5):217–240

    Article  Google Scholar 

  18. Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: 23rd Annual international ACM conference on research and development in information retrieval (SIGIR), pp 41–48

  19. Kanavos A, Theodoridis E, Tsakalidis AK (2012) Extracting knowledge from web search engine results. In: IEEE 24th international conference on tools with artificial intelligence (ICTAI), pp 860–867

  20. Kanavos A, Makris C, Plegas Y, Theodoridis E (2013) Extracting knowledge from web search engine using Wikipedia. In: 14th International conference on engineering applications of neural networks (EANN), pp 100–109

    Google Scholar 

  21. Kanavos A, Makris C, Plegas Y, Theodoridis E (2016) Ranking web search results exploiting wikipedia. Int J Artif Intell Tools (IJAIT) 25(3):1–26

    Google Scholar 

  22. Kondrak G (2005) N-gram similarity and distance. In: 12th International conference on string processing and information retrieval (SPIRE), pp 115–126

    Google Scholar 

  23. Kotoula P, Makris C (2018) Query disambiguation based on clustering techniques. In: 14th International conference on artificial intelligence applications and innovations (AIAI), pp 133–145

    Google Scholar 

  24. Kozorovitzky AK, Kurland O (2011) Cluster-based fusion of retrieved lists. In: 34th International ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 893–902

  25. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    MathSciNet  Article  Google Scholar 

  26. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710

    MathSciNet  Google Scholar 

  27. Levi O, Raiber F, Kurland O, Guy I (2016) Selective cluster-based document retrieval. In: 25th ACM international conference on information and knowledge management (CIKM), pp 1473–1482

  28. Makris C, Plegas Y, Theodoridis E (2013) Improved text annotation with Wikipedia entities. In: 28th annual ACM symposium on applied computing (SAC), pp 288–295

  29. Makris C, Plegas Y, Stamatiou YC, Stavropoulos EC, Tsakalidis AK (2014) Reducing redundant information in search results employing approximation algorithms. In: 25th International conference on database and expert systems applications (DEXA), pp 240–247

    Google Scholar 

  30. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Google Scholar 

  31. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  32. Navigli R, Ponzetto SP (2010) Babelnet: Building a very large multilingual semantic network. In: 48th Annual meeting of the association for computational linguistics (ACL), pp 216–225

  33. Plegas Y, Stamou S (2013) Reducing information redundancy in search results. In: 28th annual ACM symposium on applied computing (SAC), pp 886–893

  34. Raiber F, Kurland O (2014) The correlation between cluster hypothesis tests and the effectiveness of cluster-based retrieval. In: 37th International ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 1155–1158

  35. Raviv H, Kurland O, Carmel D (2016) Document retrieval using entity-based language models. In: 39th International ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 65–74

  36. van Rijsbergen CJ (1979) Information retrieval. Butterworth, Oxford

    Google Scholar 

  37. Suchanek FM, Kasneci G, Weikum G (2007) Yago: A core of semantic knowledge. In: 16th International conference on world wide web (WWW), pp 697–706

  38. Wu Z, Palmer MS (1994) Verb semantics and lexical selection. In: 32nd Annual meeting of the association for computational linguistics (ACL), pp 133–138

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andreas Kanavos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kanavos, A., Kotoula, P., Makris, C. et al. Employing query disambiguation using clustering techniques. Evolving Systems 11, 305–315 (2020). https://doi.org/10.1007/s12530-019-09292-7

Download citation

Keywords

  • Query disambiguation
  • Information retrieval
  • Query reformulation
  • Clustering
  • Containment
  • Semantics