Skip to main content

Constructing Thesaurus Using TAG Term Weight for Query Expansion in Information Retrieval Application

  • Chapter
  • First Online:
  • 395 Accesses

Abstract

In information retrieval applications, the query expansion is considered as the important procedure for improving the precision of retrieval. This chapter discusses on Thesaurus of N-gram content. This is generated using the content from web documents for expanding the query. The TAG of HTML pages are parsed, and the text present within the TAG is assigned weight based on the nature of TAGs. The total weight for these texts is calculated as the sum of TAG weight and frequency of occurrence. The content of Thesaurus is updated with single term or text as Unigram. Similarly, N-gram Thesaurus is updated with N-term or text along with total weight. Given a query, the term(s) are looked up in the corresponding Thesaurus to obtain a set of query as prediction. The set is ordered based on the total weight, and the user selects any of the term(s) as preference. The benchmark datasets such as Clueweb09B, WT10g and GOV2 are used for experiments. A threshold value is fixed as baseline. The proposed approach has gained 8, 19 and 30% on Clueweb09B, WT10g and GOV2, respectively. In addition, KLDCo and BoCo are used as benchmark datasets for evaluating the performance of the presented approach in terms of query refinement. The MAP, MRR is on the higher side against the baseline.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Arnaud, L. H., & Elena, L. (2003) (IBM) Discover key features of DOM level 3 core, part 1, manipulating and comparing nodes, handling text and user data.

    Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993.

    MATH  Google Scholar 

  • Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York, NY, USA: Wiley-Interscience.

    Book  Google Scholar 

  • Cucerzan, S., & Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of EMNLP (p. 293).

    Google Scholar 

  • Francesco, C., Massimo, D. S., Luca, G., & Paolo, N. (2013). A query expansion method based on a weighted word pairs approach. In Proceedings of 4th IIR Workshop 2013. Pisa, Italy: National Council of Research Campus.

    Google Scholar 

  • Gong, Z., Cheang, C., & Hou, U. L. (2005). Web query expansion by wordnet. In Proceedings of Database and Expert Systems Applications (pp. 166–175). Berlin/Heidelberg: LNCS, Springer.

    Google Scholar 

  • Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211.

    Article  Google Scholar 

  • Jing, Y., & Croft, W. B. (1994). An association Thesaurus for information retrieval. In Proceedings of RIAO 94 Conference (pp. 146–160).

    Google Scholar 

  • Kaptein, R., & Kamps, J. (2009). Advances in focused retrieval. Chapter finding entities in Wikipedia using links and categories (pp. 273–279). Berlin, Heidelberg: Springer-Verlag.

    Google Scholar 

  • Kilgarriff, A. (2007). Googlelology is bad science. Journal of Computational Linguistics, 33(1), 147.

    Article  Google Scholar 

  • Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In Proceedings of 30th ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ‘07 (pp. 797–798). New York, USA.

    Google Scholar 

  • Lin, H. C., Wang, L. H., & Chen, S. M. (2005). A new query expansion method for document retrieval by mining additional query terms. In Proceedings of International Conference on Business and Information. Hong Kong, China.

    Google Scholar 

  • Macdonald, C., He, B., Plachouras, V., & Ounis, I. (2005). University of Glasgow at TREC 2005: Experiments in terabyte and enterprise tracks with terrier. In Proceedings of 14th Text REtrieval Conference (TREC 2005).

    Google Scholar 

  • Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press.

    Google Scholar 

  • Marianne, H., Nadja, N., & Carolin, B. (Eds.). (2007). Corpus linguistics and the web: In literary and linguistic computing (p. 305). Amsterdam/New York: Radopi.

    Google Scholar 

  • Martin-Bautista, M. J., Sanches, D., Chamorro-Martinez, J., Serrano, J. M., & Vila, M. A. (2004). Mining web documents to find additional query terms using fuzzy association rules. Fuzzy Sets and Systems, 148(1), 85.

    Article  MathSciNet  Google Scholar 

  • Metzler, D., & Croft, W. B. (2007). Latent concept expansion using markov random fields. In Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 311).

    Google Scholar 

  • Milne, D. N., Witten, I. H., & Nichols, D. M. (2007). A knowledge based search engine powered by Wikipedia. In Proceedings of 16th ACM Conference on Information and Knowledge Management. CIKM ‘07 (pp. 445–454). New York, USA.

    Google Scholar 

  • Perez-Aguera, J. R., & Lourdes-Araujo. (2008). Comparing and combining methods for automatic query expansion. Advances in Natural Language Processing Research in Computing Science, 33, 177–188.

    Google Scholar 

  • Qiu, Y., & Frei, H.-P. (1993). Concept based query expansion. In Proceedings of 16th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 160–169). New York, NY, USA.

    Google Scholar 

  • Smeaton, A. F., Kelledy, F., & O’Donnell, R. (1995). Thresholding posting lists, query expansions with wordnet and pos tagging of Spanish. In Proceedings of 4th Text REtrieval Conference (TREC-4) (pp. 373–390).

    Google Scholar 

  • Van Rijsbergen, C. J. (1977). A theoretical basis for the use of cooccurrence data in information retrieval. Journal of Documentation, 33, 106–119.

    Article  Google Scholar 

  • Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proceedings of 17th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 61–69). New York, USA: Springer-Verlag Inc.

    Chapter  Google Scholar 

  • Wang, C., Yajun, D., Zhang, P., & Han, B. (2010). A term-reweighting method for query expansion. Journal of Computational Information Systems, 6(11), 3779.

    Google Scholar 

  • Xu, Y., Jones, G. J., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia. In Proceedings of 32nd InterNational ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ‘09 (pp. 59–66). New York, USA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. G. Shaila .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Shaila, S.G., Vadivel, A. (2018). Constructing Thesaurus Using TAG Term Weight for Query Expansion in Information Retrieval Application. In: Textual and Visual Information Retrieval using Query Refinement and Pattern Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-13-2559-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-2559-5_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-2558-8

  • Online ISBN: 978-981-13-2559-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics