Advertisement

Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living Labs

  • Frank HopfgartnerEmail author
  • Krisztian Balog
  • Andreas Lommatzsch
  • Liadh Kelly
  • Benjamin Kille
  • Anne Schuth
  • Martha Larson
Chapter
Part of the The Information Retrieval Series book series (INRE, volume 41)

Abstract

A/B testing is currently being increasingly adopted for the evaluation of commercial information access systems with a large user base since it provides the advantage of observing the efficiency and effectiveness of information access systems under real conditions. Unfortunately, unless university-based researchers closely collaborate with industry or develop their own infrastructure or user base, they cannot validate their ideas in live settings with real users. Without online testing opportunities open to the research communities, academic researchers are unable to employ online evaluation on a larger scale. This means that they do not get feedback for their ideas and cannot advance their research further. Businesses, on the other hand, miss the opportunity to have higher customer satisfaction due to improved systems. In addition, users miss the chance to benefit from an improved information access system. In this chapter, we introduce two evaluation initiatives at CLEF, NewsREEL and Living Labs for IR (LL4IR), that aim to address this growing “evaluation gap” between academia and industry. We explain the challenges and discuss the experiences organizing theses living labs.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan J, Croft B, Moffat A, Sanderson M (2012) Frontiers, challenges, and opportunities for information retrieval: report from SWIRL 2012 the second strategic workshop on information retrieval in Lorne. SIGIR Forum 46(1):2–32CrossRefGoogle Scholar
  2. Azzopardi L, Balog K (2011) Towards a living lab for information retrieval research and development - a proposal for a living lab for product search tasks. In: Forner P, Gonzalo J, Kekäläinen J, Lalmas M, de Rijke M (eds) Multilingual and multimodal information access evaluation. Proceedings of the second international conference of the cross-language evaluation forum (CLEF 2011). Lecture notes in computer science (LNCS), vol 6941. Springer, Heidelberg, pp 26–37Google Scholar
  3. Balog K, Elsweiler D, Kanoulas E, Kelly L, Smucker MD (2014a) Report on the CIKM workshop on living labs for information retrieval evaluation. SIGIR Forum 48(1):21–28CrossRefGoogle Scholar
  4. Balog K, Kelly L, Schuth A (2014b) Head first: living labs for ad-hoc search evaluation. In: Proceedings of the 23rd international conference on information and knowledge management (CIKM’14). ACM, New York, pp 1815–1818Google Scholar
  5. Beck PD, Blaser M, Michalke A, Lommatzsch A (2017) A system for online news recommendations in real-time with Apache mahout. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  6. Bons P, Evans N, Kampstra P, van Kessel T (2017) A news recommender engine with a killer sequence. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  7. Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10CrossRefGoogle Scholar
  8. Brodt T, Hopfgartner F (2014) Shedding light on a living lab: the CLEF NewsREEL open recommendation platform. In: Proceedings of the information interaction in context conference, IIiX’14. Springer, New York, pp 223–226Google Scholar
  9. Chapelle O, Joachims T, Radlinski F, Yue Y (2012) Large-scale validation and analysis of interleaved search evaluation. ACM Trans Info Syst (TOIS) 30:1–41CrossRefGoogle Scholar
  10. Ciobanu A, Lommatzsch A (2016) Development of a news recommender system based on Apache flink. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedingsGoogle Scholar
  11. Corsini F, Larson M (2016) CLEF NewsREEL 2016: image based recommendation. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedingsGoogle Scholar
  12. Diaz F, White R, Buscher G, Liebling D (2013) Robust models of mouse movement on dynamic web search results pages. In: Proceedings of the 22nd ACM international conference on information and knowledge management (CIKM’13), pp 1451–1460Google Scholar
  13. Domann J, Meiners J, Helmers L, Lommatzsch A (2016) Real-time news recommendations using Apache spark. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedingsGoogle Scholar
  14. Freire J, Fuhr N, Rauber A (2016) Reproducibility of data-oriented experiments in e-science (Dagstuhl seminar 16041). In: Dagstuhl reports, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, vol 6Google Scholar
  15. Gebremeskel G, de Vries AP (2015) The degree of randomness in a live recommender systems evaluation. In: Working notes for CLEF 2015 conference, Toulouse, CEURGoogle Scholar
  16. Ghirmatsion AB, Balog K (2015) Probabilistic field mapping for product search. In: CLEF 2015 online working notesGoogle Scholar
  17. Golian C, Kuchar J (2017) News recommender system based on association rules at CLEF NewsREEL 2017. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  18. Gunawardana A, Shani G (2009) A survey of accuracy evaluation metrics of recommendation tasks. J Mach Learn Res 10:2935–2962MathSciNetzbMATHGoogle Scholar
  19. Hanbury A, Müller H, Balog K, Brodt T, Cormack GV, Eggel I, Gollub T, Hopfgartner F, Kalpathy-Cramer J, Kando N, Krithara A, Lin JJ, Mercer S, Potthast M (2015) Evaluation-as-a-service: overview and outlook. CoRR abs/1512.07454Google Scholar
  20. Hassan A, Shi X, Craswell N, Ramsey B (2013) Beyond clicks: query reformulation as a predictor of search satisfaction. In: Proceedings of the 22nd ACM international conference on information and knowledge management (CIKM’13). ACM, New York, pp 2019–2028Google Scholar
  21. Hawking D (2015) If SIGIR had an academic track, what would be in it? In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, Santiago, August 9–13, 2015, p 1077Google Scholar
  22. Hofmann K, Whiteson S, de Rijke M (2011) A probabilistic method for inferring preferences from clicks. In: Proceedings of the 20th conference on information and knowledge management (CIKM’11). ACM, New York, p 249Google Scholar
  23. Hopfgartner F, Brodt T (2015) Join the living lab: evaluating news recommendations in real-time. In: Advances in information retrieval - 37th European conference on IR research, ECIR 2015, Proceedings, Vienna, March 29–April 2, 2015, pp 826–829Google Scholar
  24. Hopfgartner F, Jose JM (2014) An experimental evaluation of ontology-based user profiles. Multimed Tools Appl 73(2):1029–1051CrossRefGoogle Scholar
  25. Hopfgartner F, Kille B, Lommatzsch A, Plumbaum T, Brodt T, Heintz T (2014) Benchmarking news recommendations in a living lab. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 250–267Google Scholar
  26. Hopfgartner F, Brodt T, Seiler J, Kille B, Lommatzsch A, Larson M, Turrin R, Serény A (2015a) Benchmarking news recommendations: the CLEF newsreel use case. SIGIR Forum 49(2):129–136CrossRefGoogle Scholar
  27. Hopfgartner F, Kille B, Heintz T, Turrin R (2015b) Real-time recommendation of streamed data. In: Proceedings of the 9th ACM conference on recommender systems, RecSys 2015, Vienna, September 16–20, 2015, pp 361–362Google Scholar
  28. Hopfgartner F, Lommatzsch A, Kille B, Larson M, Brodt T, Cremonesi P, Karatzoglou A (2016) The potentials of recommender systems challenges for student learning. In: Proceedings of CiML’16: challenges in machine learning: gaming and educationGoogle Scholar
  29. Hopfgartner F, Hanbury A, Mueller H, Eggel I, Balog K, Brodt T, Cormack GV, Lin J, Kalpathy-Cramer J, Kando N, Kato MP, Krithara A, Gollub T, Potthast M, Viegas E, Mercer S (2018) Evaluation-as-a-service for the computational sciences: overview and outlook. ACM J Data Inf Qual. https://doi.org/10.1145/3239570 CrossRefGoogle Scholar
  30. Jagerman R, Balog K, de Rijke M (2018) Opensearch: lessons learned from an online evaluation campaign. J Data Inf Qual 10(3):13:1–13:15CrossRefGoogle Scholar
  31. Joachims T (2003) Evaluating retrieval performance using clickthrough data. In: Franke J, Nakhaeizadeh G, Renz I (eds) Text mining. Physica. Springer, Heidelberg, pp 79–96Google Scholar
  32. Joachims T, Granka LA, Pan B, Hembrooke H, Radlinski F, Gay G (2007) Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans Inf Syst 25(2):7CrossRefGoogle Scholar
  33. Kamps J, Geva S, Peters C, Sakai T, Trotman A, Voorhees E (2009) Report on the SIGIR 2009 workshop on the future of IR evaluation. SIGIR Forum 43(2):13–23CrossRefGoogle Scholar
  34. Kelly D, Dumais ST, Pedersen JO (2009) Evaluation challenges and directions for information-seeking support systems. IEEE Comput 42(3):60–66CrossRefGoogle Scholar
  35. Kelly L, Bunbury P, Jones GJF (2012) Evaluating personal information retrieval. In: Proceedings of the 34th European conference on information retrieval (ECIR’12). Springer, BerlinGoogle Scholar
  36. Kille B, Hopfgartner F, Brodt T, Heintz T (2013) The plista dataset. In: NRS’13: proceedings of the international workshop and challenge on news recommender systems. ACM, New York, pp 14–21Google Scholar
  37. Kille B, Lommatzsch A, Turrin R, Serény A, Larson M, Brodt T, Seiler J, Hopfgartner F (2015) Stream-based recommendations: online and offline evaluation as a service. In: Mothe J, Savoy J, Kamps J, Pinel-Sauvagnat K, Jones GJF, SanJuan E, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the sixth international conference of the CLEF association (CLEF 2015). Lecture notes in computer science (LNCS), vol 9283. Springer, Heidelberg, pp 497–517Google Scholar
  38. Kim J, Xue X, Croft WB (2009) A probabilistic retrieval model for semistructured data. In: Proc. of the 31st European conference on information retrieval (ECIR’09). Springer, Heidelberg, pp 228–239Google Scholar
  39. Kim Y, Hassan A, White R, Zitouni I (2014) Modeling dwell time to predict click-level satisfaction. In: Proc. of the 7th ACM international conference on web search and data mining (WSDM’14). ACM, New York, pp 193–202Google Scholar
  40. Kohavi R (2015) Online controlled experiments: lessons from running A/B/n tests for 12 Years. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, Sydney, NSW, August 10–13, 2015, p 1Google Scholar
  41. Kumar V, Khattar D, Gupta S, Gupta M, Varma V (2017) Deep neural architecture for news recommendation. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  42. Li J, Huffman S, Tokuda A (2009) Good abandonment in mobile and pc internet search. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR ’09). ACM, New York, pp 43–50Google Scholar
  43. Liang Y, Loni B, Larson M (2017) CLEF NewsREEL 2017: contextual bandit news recommendation. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  44. Liu TY (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331CrossRefGoogle Scholar
  45. Liu TY, Xu J, Qin T, Xiong W, Li H (2007) LETOR: benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval (LR4IR’07), pp 346–374Google Scholar
  46. Lommatzsch A, Albayrak S (2015) Real-time recommendations for user-item streams. In: Proc. of the 30th symposium on applied computing, SAC 2015, SAC ’15. ACM, New York, pp 1039–1046Google Scholar
  47. Lommatzsch A, Johannes N, Meiners J, Helmers L, Domann J (2016) Recommender ensembles for news articles based on most-popular strategies. In: Working notes of the 7th international conference of the CLEF initiative, Evora, CEUR workshop proceedingsGoogle Scholar
  48. Lommatzsch A, Kille B, Hopfgartner F, Larson M, Brodt T, Seiler J, Özgöbek Ö (2017) CLEF 2017 NewsREEL overview: a stream-based recommender task for evaluation and education. In: Jones GJF, Lawless S, Gonzalo J, Kelly L, Goeuriot L, Mandl T, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the eighth international conference of the CLEF association (CLEF 2017). Lecture notes in computer science (LNCS), vol 10456. Springer, Heidelberg, pp 239–254Google Scholar
  49. Ludmann C (2017) Recommending news articles in the CLEF news recommendation evaluation lab with the data stream management system odysseus. In: Working notes of the 8th international conference of the CLEF initiative, Dublin, CEUR workshop proceedingsGoogle Scholar
  50. Radlinski F, Craswell N (2013) Optimized interleaving for online retrieval evaluation. In: Proc. of ACM international conference on web search and data mining (WSDM’13). ACM, New York, pp 245–254Google Scholar
  51. Radlinski F, Kurup M, Joachims T (2008) How does clickthrough data reflect retrieval quality? In: Proceedings of the 17th conference on information and knowledge management (CIKM’08). ACM, New York, pp 43–52Google Scholar
  52. Said A, Tikk D, Stumpf K, Shi Y, Larson M, Cremonesi P (2012) Recommender systems evaluation: a 3D benchmark. In: Proceedings of the workshop on recommendation utility evaluation: beyond RMSE (RUE 2012), CEUR-WS, vol 910, RUE’12, pp 21–23Google Scholar
  53. Sakai T (2018) Laboratory experiments in information retrieval. Springer, SingaporeCrossRefGoogle Scholar
  54. Schaer P, Tavakolpoursaleh N (2015) GESIS at CLEF LL4IR 2015. In: CLEF 2015 Online Working NotesGoogle Scholar
  55. Schuth A, Sietsma F, Whiteson S, Lefortier D, de Rijke M (2014) Multileaved comparisons for fast online evaluation. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14). ACM, New York, pp 71–80Google Scholar
  56. Schuth A, Balog K, Kelly L (2015a) Extended overview of the living labs for information retrieval evaluation (LL4IR) CLEF lab 2015. In: CLEF 2015 online working notesGoogle Scholar
  57. Schuth A, Bruintjes RJ, Büttner F, van Doorn J, Groenland C, Oosterhuis H, Tran CN, Veeling B, van der Velde J, Wechsler R, Woudenberg D, de Rijke M (2015b) Probabilistic multileave for online retrieval evaluation. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (SIGIR’15). ACM, New York, pp 955–958Google Scholar
  58. Schuth A, Hofmann K, Radlinski F (2015c) Predicting search satisfaction metrics with interleaved comparisons. In: Proceedings of the 38th ACM international conference on information retrieval (SIGIR’15). ACM, New York pp 463–472Google Scholar
  59. Scriminaci M, Lommatzsch A, Kille B, Hopfgartner F, Larson M, Malagoli D, Serény A, Plumbaum T (2016) Idomaar: a framework for multi-dimensional benchmarking of recommender algorithms. In: Proceedings of the poster track of the 10th ACM conference on recommender systems (RecSys 2016), Boston, September 17, 2016Google Scholar
  60. Tavakolifard M, Gulla JA, Almeroth KC, Hopfgartner F, Kille B, Plumbaum T, Lommatzsch A, Brodt T, Bucko A, Heintz T (2013) Workshop and challenge on news recommender systems. In: Seventh ACM conference on recommender systems, RecSys ’13, Hong Kong, October 12–16, 2013, pp 481–482Google Scholar
  61. Teevan J, Dumais S, Horvitz E (2007) The potential value of personalizing search. In: Proceedings of the ACM international conference on information retrieval (SIGIR’07). ACM, New York, pp 756–757Google Scholar
  62. Turpin A, Scholar F (2006) User performance versus precision measures for simple search tasks. In: Proc. of the ACM international conference on information retrieval (SIGIR’06). ACM, New York, pp 11–18Google Scholar
  63. Verbitskiy I, Probst P, Lommatzsch A (2015) Developing and evaluation of a highly scalable news recommender system. In: Working notes for CLEF 2015 conference, Toulouse, CEURGoogle Scholar
  64. Voorhees EM, Harman DK (2005) TREC: Experiment and evaluation in information retrieval, 1st edn. MIT Press, Cambridge, MAGoogle Scholar
  65. Wang K, Gloy N, Li X (2010) Inferring search behaviors using partially observable Markov (POM) model. In: WSDM’10. ACM, New York, pp 211–220Google Scholar
  66. Wilkins P, Byrne D, Jones GJF, Lee H, Keenan G, McGuinness K, O’Connor NE, O’Hare N, Smeaton AF, Adamek T, Troncy R, Amin A, Benmokhtar R, Dumont E, Huet B, Mérialdo B, Tolias G, Spyrou E, Avrithis YS, Papadopoulos GT, Mezaris V, Kompatsiaris I, Mörzinger R, Schallauer P, Bailer W, Chandramouli K, Izquierdo E, Goldmann L, Haller M, Samour A, Cobet A, Sikora T, Praks P, Hannah D, Halvey M, Hopfgartner F, Villa R, Punitha P, Goyal A, Jose JM (2008) K-space at TRECVid 2008. In: TRECVID 2008 workshop participants notebook papers, Gaithersburg, MD, Nov 2008Google Scholar
  67. Yilmaz E, Verma M, Craswell N, Radlinski F, Bailey P (2014) Relevance and effort: an analysis of document utility. In: Proceedings of the 23rd ACM international conference on information and knowledge management (CIKM’14). ACM, New York, pp 91–100Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Frank Hopfgartner
    • 1
    Email author
  • Krisztian Balog
    • 2
  • Andreas Lommatzsch
    • 3
  • Liadh Kelly
    • 4
  • Benjamin Kille
    • 3
  • Anne Schuth
    • 5
  • Martha Larson
    • 6
  1. 1.University of SheffieldSheffieldUK
  2. 2.University of StavangerStavangerNorway
  3. 3.Technische Universität BerlinBerlinGermany
  4. 4.Maynooth UniversityMaynoothIreland
  5. 5.De PersgroepAmsterdamThe Netherlands
  6. 6.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations