Skip to main content

MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”

  • Conference paper
Multilingual and Multimodal Information Access Evaluation (CLEF 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6360))

Abstract

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)

    Google Scholar 

  2. Craswell, N., Fetterly, D., Najork, M., Robertson, S., Yilmaz, E.: Microsoft Research at TREC 2009: Web and relevance feedback tracks. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)

    Google Scholar 

  3. Dean, J.: Challenges in building large-scale information retrieval systems. In: Proceedings of the 2nd Conference on Web Search and Data Mining, WSDM (2009)

    Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implemention, OSDI (2004)

    Google Scholar 

  5. Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. thesis (2001)

    Google Scholar 

  6. Lemur Toolkit, http://www.lemurproject.org/

  7. Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)

    Google Scholar 

  8. Lucene Search Engine, http://lucene.apache.org

  9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  10. Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.): Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009)

    Google Scholar 

  11. Terrier IR Platform, http://ir.dcs.gla.ac.uk/terrier/

  12. Salton, G., Buckley, C.: Parallel text search methods. Communications of the ACM 31(2) (1988)

    Google Scholar 

  13. Voorhees, E.M., Harman, D.K. (eds.): TREC Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2008)

    Google Scholar 

  14. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)

    Google Scholar 

  15. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Kumar, P., Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th Symposium on Operating System Design and Implemention, OSDI (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hiemstra, D., Hauff, C. (2010). MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15998-5_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15997-8

  • Online ISBN: 978-3-642-15998-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics