Constructing Thesaurus Using TAG Term Weight for Query Expansion in Information Retrieval Application



In information retrieval applications, the query expansion is considered as the important procedure for improving the precision of retrieval. This chapter discusses on Thesaurus of N-gram content. This is generated using the content from web documents for expanding the query. The TAG of HTML pages are parsed, and the text present within the TAG is assigned weight based on the nature of TAGs. The total weight for these texts is calculated as the sum of TAG weight and frequency of occurrence. The content of Thesaurus is updated with single term or text as Unigram. Similarly, N-gram Thesaurus is updated with N-term or text along with total weight. Given a query, the term(s) are looked up in the corresponding Thesaurus to obtain a set of query as prediction. The set is ordered based on the total weight, and the user selects any of the term(s) as preference. The benchmark datasets such as Clueweb09B, WT10g and GOV2 are used for experiments. A threshold value is fixed as baseline. The proposed approach has gained 8, 19 and 30% on Clueweb09B, WT10g and GOV2, respectively. In addition, KLDCo and BoCo are used as benchmark datasets for evaluating the performance of the presented approach in terms of query refinement. The MAP, MRR is on the higher side against the baseline.


Thesaurus Corpus Corpora N grams Query refinement Query expansion TAG weight 


  1. Arnaud, L. H., & Elena, L. (2003) (IBM) Discover key features of DOM level 3 core, part 1, manipulating and comparing nodes, handling text and user data. Google Scholar
  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993.zbMATHGoogle Scholar
  3. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York, NY, USA: Wiley-Interscience.CrossRefGoogle Scholar
  4. Cucerzan, S., & Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of EMNLP (p. 293).Google Scholar
  5. Francesco, C., Massimo, D. S., Luca, G., & Paolo, N. (2013). A query expansion method based on a weighted word pairs approach. In Proceedings of 4th IIR Workshop 2013. Pisa, Italy: National Council of Research Campus.Google Scholar
  6. Gong, Z., Cheang, C., & Hou, U. L. (2005). Web query expansion by wordnet. In Proceedings of Database and Expert Systems Applications (pp. 166–175). Berlin/Heidelberg: LNCS, Springer.Google Scholar
  7. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211.CrossRefGoogle Scholar
  8. Jing, Y., & Croft, W. B. (1994). An association Thesaurus for information retrieval. In Proceedings of RIAO 94 Conference (pp. 146–160).Google Scholar
  9. Kaptein, R., & Kamps, J. (2009). Advances in focused retrieval. Chapter finding entities in Wikipedia using links and categories (pp. 273–279). Berlin, Heidelberg: Springer-Verlag.Google Scholar
  10. Kilgarriff, A. (2007). Googlelology is bad science. Journal of Computational Linguistics, 33(1), 147.CrossRefGoogle Scholar
  11. Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In Proceedings of 30th ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ‘07 (pp. 797–798). New York, USA.Google Scholar
  12. Lin, H. C., Wang, L. H., & Chen, S. M. (2005). A new query expansion method for document retrieval by mining additional query terms. In Proceedings of International Conference on Business and Information. Hong Kong, China.Google Scholar
  13. Macdonald, C., He, B., Plachouras, V., & Ounis, I. (2005). University of Glasgow at TREC 2005: Experiments in terabyte and enterprise tracks with terrier. In Proceedings of 14th Text REtrieval Conference (TREC 2005).Google Scholar
  14. Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press.Google Scholar
  15. Marianne, H., Nadja, N., & Carolin, B. (Eds.). (2007). Corpus linguistics and the web: In literary and linguistic computing (p. 305). Amsterdam/New York: Radopi.Google Scholar
  16. Martin-Bautista, M. J., Sanches, D., Chamorro-Martinez, J., Serrano, J. M., & Vila, M. A. (2004). Mining web documents to find additional query terms using fuzzy association rules. Fuzzy Sets and Systems, 148(1), 85.MathSciNetCrossRefGoogle Scholar
  17. Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 311).Google Scholar
  18. Milne, D. N., Witten, I. H., & Nichols, D. M. (2007). A knowledge based search engine powered by Wikipedia. In Proceedings of 16th ACM Conference on Information and Knowledge Management. CIKM ‘07 (pp. 445–454). New York, USA.Google Scholar
  19. Perez-Aguera, J. R., & Lourdes-Araujo. (2008). Comparing and combining methods for automatic query expansion. Advances in Natural Language Processing Research in Computing Science, 33, 177–188.Google Scholar
  20. Qiu, Y., & Frei, H.-P. (1993). Concept based query expansion. In Proceedings of 16th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 160–169). New York, NY, USA.Google Scholar
  21. Smeaton, A. F., Kelledy, F., & O’Donnell, R. (1995). Thresholding posting lists, query expansions with wordnet and pos tagging of Spanish. In Proceedings of 4th Text REtrieval Conference (TREC-4) (pp. 373–390).Google Scholar
  22. Van Rijsbergen, C. J. (1977). A theoretical basis for the use of cooccurrence data in information retrieval. Journal of Documentation, 33, 106–119.CrossRefGoogle Scholar
  23. Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proceedings of 17th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 61–69). New York, USA: Springer-Verlag Inc.CrossRefGoogle Scholar
  24. Wang, C., Yajun, D., Zhang, P., & Han, B. (2010). A term-reweighting method for query expansion. Journal of Computational Information Systems, 6(11), 3779.Google Scholar
  25. Xu, Y., Jones, G. J., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In Proceedings of 32nd InterNational ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ‘09 (pp. 59–66). New York, USA.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringDayananda Sagar UniversityBangaloreIndia
  2. 2.Department of Computer Science and EngineeringSRM University APAmaravatiIndia

Personalised recommendations