Scalability Influence on Retrieval Models: An Experimental Methodology

Imafouo, Amélie; Beigbeder, Michel

doi:10.1007/978-3-540-31865-1_28

Amélie Imafouo¹⁸ &
Michel Beigbeder¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

European Conference on Information Retrieval

4319 Accesses

Abstract

Few works in Information Retrieval (IR) tackled the questions of Information Retrieval Systems (IRS) effectiveness and efficiency in the context of scalability in corpus size.

We propose a general experimental methodology to study the scalability influence on IR models. This methodology is based on the construction of a collection on which a given characteristic C is the same whatever be the portion of collection selected. This new collection called uniform can be split into sub-collection of growing size on which some given properties will be studied.

We apply our methodology to WT10G (TREC9 collection) and consider the characteristic C to be the distribution of relevant documents on a collection. We build a uniform WT10G, sample it into sub-collections of increasing size and use these sub-collections to study the impact of corpus volume increase on standards IRS evaluation measures (recall/precision, high precision).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Voorhees, E., Harman, D.: Overview of the sixth text retrieval conference (trec-6). In: The Sixth text retrieval Conference, NIST Special Plublication 500-420 (1997)
Google Scholar
Zobel, J.: How reliable are the results of large scale information retrieval experiments. In: Proceedings of the 21th ACM SIGIR Conference on research and development in information retrieval, pp. 307–314 (1998)
Google Scholar
Cormack, G., Palmer, C., Clarke, C.L.A.: Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 282–289 (1998)
Google Scholar
Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th annual international ACM Conference on research and Development in Information Retrieval, pp. 66–73 (2001)
Google Scholar
Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. Information and Processing Management, 697–716 (2000)
Google Scholar
Frieder, O., Grossman, D.: On scalable information retrieval systems. In: Invited Paper, 2nd IEEE International Symposium on Network Computing and Applications, Massachusett, Cambridge (2003)
Google Scholar
Gurrin, C., Smeaton, A.: Replicating web structure in small-scale test collections. Information retrieval 7, 239–263 (2004)
Article Google Scholar
Chevallet, J.P., Martinez, J., Boughanem, M., Lechani-Tamine, L., Calabretto, S.: Rapport final de l’AS-91 du RTP-9 ’passage à l’échelle dans la taille des corpus’ (2004)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes - Compressing and indexing documents and images, 2nd edn., pp. 451–468. Morgan Kaufman Publishers, San Francisco (1999)
Google Scholar
Newby, G.B.: The science of large scale information retrieval. Internet archives (2000)
Google Scholar
Beigbeder, M., Mercier, A.: Etude des distributions de tf et de idf sur une collection de 5 millions de pages html. In: Atelier de recherche d’informations sur le passage à l’échelle Congrès INFORSID 2003, Nancy, France (2003)
Google Scholar
Hawking, D., Thistlewaite, P.: Scaling up the trec collection. Information retrieval 1, 115–137 (1999)
Article Google Scholar
Hawking, D., Robertson, S.: On collection size and retrieval effectiveness. Information retrieval 6, 99–105 (2003)
Article Google Scholar
Bailey, P., Craswell, N., Hawking, D.: Engineering a multipurpose test collection for web retrieval experiments draft. In: Proceedings of the 24th annual international ACM SIGIR conference (2001)
Google Scholar
http://www.cs.mu.oz.au/mg/
http://www.seg.rmit.edu.au/lucy/
Mercier, A.: Etude comparative de trois approches utilisant la proximité entre les termes de la requête pour le calcul des scores des documents. In: INFORSID 2004 - 22ème congrès informatique des organisations et des systèmes d’information et de décision, pp. 95–106 (2004)
Google Scholar
Clarke, C.L.A., Cormack, G., Tudhope, E.: Relevance ranking for one to three term queries. Information Processing and Management 26, 291–311 (2000)
Article Google Scholar
Hawking, D., Thistlewaite, P.: Proximity operators - so near and yet so far. In: Proceedings of the Fourth Text Retrieval Conference TREC-4, pp. 131–143 (1995)
Google Scholar
Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval systems. In: Proceedings of European Conference on Information Retrieval Research, pp. 207–218 (2003)
Google Scholar
Voorhees, E., Buckley, C.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international conference on Research and development in information retrieval, pp. 25–32 (2004)
Google Scholar
Lyman, P., Varian, H.R., Swearingen, K., Charles, P., Good, N., Jordan, L.L., Pal, J.: How much informations (2003), http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
http://trec.nist.gov/

Download references

Author information

Authors and Affiliations

Ecole Nationale Supérieure des Mines de Saint-Etienne, 158 Cours Fauriel, 42023, Saint-Etienne, Cedex 2, France
Amélie Imafouo & Michel Beigbeder

Authors

Amélie Imafouo
View author publications
You can also search for this author in PubMed Google Scholar
Michel Beigbeder
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, Spain
David E. Losada
Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, Universidad de Granada, 18071, Granada, Spain
Juan M. Fernández-Luna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Imafouo, A., Beigbeder, M. (2005). Scalability Influence on Retrieval Models: An Experimental Methodology. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-540-31865-1_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics