Skip to main content

Scalability Influence on Retrieval Models: An Experimental Methodology

  • Conference paper
Advances in Information Retrieval (ECIR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

  • 4319 Accesses

Abstract

Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) effectiveness and efficiency in the context of scalability in corpus size.

We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic C is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.

We apply our methodology to WT10G (TREC9 collection) and consider the characteristic C to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Voorhees, E., Harman, D.: Overview of the sixth text retrieval conference (trec-6). In: The Sixth text retrieval Conference, NIST Special Plublication 500-420 (1997)

    Google Scholar 

  2. Zobel, J.: How reliable are the results of large scale information retrieval experiments. In: Proceedings of the 21th ACM SIGIR Conference on research and development in information retrieval, pp. 307–314 (1998)

    Google Scholar 

  3. Cormack, G., Palmer, C., Clarke, C.L.A.: Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–289 (1998)

    Google Scholar 

  4. Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th annual international ACM Conference on research and Development in Information Retrieval, pp. 66–73 (2001)

    Google Scholar 

  5. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. Information and Processing Management, 697–716 (2000)

    Google Scholar 

  6. Frieder, O., Grossman, D.: On scalable information retrieval systems. In: Invited Paper, 2nd IEEE International Symposium on Network Computing and Applications, Massachusett, Cambridge (2003)

    Google Scholar 

  7. Gurrin, C., Smeaton, A.: Replicating web structure in small-scale test collections. Information retrieval 7, 239–263 (2004)

    Article  Google Scholar 

  8. Chevallet, J.P., Martinez, J., Boughanem, M., Lechani-Tamine, L., Calabretto, S.: Rapport final de l’AS-91 du RTP-9 ’passage à l’échelle dans la taille des corpus’ (2004)

    Google Scholar 

  9. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes - Compressing and indexing documents and images, 2nd edn., pp. 451–468. Morgan Kaufman Publishers, San Francisco (1999)

    Google Scholar 

  10. Newby, G.B.: The science of large scale information retrieval. Internet archives (2000)

    Google Scholar 

  11. Beigbeder, M., Mercier, A.: Etude des distributions de tf et de idf sur une collection de 5 millions de pages html. In: Atelier de recherche d’informations sur le passage à l’échelle Congrès INFORSID 2003, Nancy, France (2003)

    Google Scholar 

  12. Hawking, D., Thistlewaite, P.: Scaling up the trec collection. Information retrieval 1, 115–137 (1999)

    Article  Google Scholar 

  13. Hawking, D., Robertson, S.: On collection size and retrieval effectiveness. Information retrieval 6, 99–105 (2003)

    Article  Google Scholar 

  14. Bailey, P., Craswell, N., Hawking, D.: Engineering a multipurpose test collection for web retrieval experiments draft. In: Proceedings of the 24th annual international ACM SIGIR conference (2001)

    Google Scholar 

  15. http://www.cs.mu.oz.au/mg/

  16. http://www.seg.rmit.edu.au/lucy/

  17. Mercier, A.: Etude comparative de trois approches utilisant la proximité entre les termes de la requête pour le calcul des scores des documents. In: INFORSID 2004 - 22ème congrès informatique des organisations et des systèmes d’information et de décision, pp. 95–106 (2004)

    Google Scholar 

  18. Clarke, C.L.A., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. Information Processing and Management 26, 291–311 (2000)

    Article  Google Scholar 

  19. Hawking, D., Thistlewaite, P.: Proximity operators - so near and yet so far. In: Proceedings of the Fourth Text Retrieval Conference TREC-4, pp. 131–143 (1995)

    Google Scholar 

  20. Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Proceedings of European Conference on Information Retrieval Research, pp. 207–218 (2003)

    Google Scholar 

  21. Voorhees, E., Buckley, C.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international conference on Research and development in information retrieval, pp. 25–32 (2004)

    Google Scholar 

  22. Lyman, P., Varian, H.R., Swearingen, K., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much informations (2003), http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

  23. http://trec.nist.gov/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Imafouo, A., Beigbeder, M. (2005). Scalability Influence on Retrieval Models: An Experimental Methodology. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31865-1_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25295-5

  • Online ISBN: 978-3-540-31865-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics