Online Experimentation for Information Retrieval

  • Katja HofmannEmail author
Part of the Communications in Computer and Information Science book series (CCIS, volume 505)


Online experimentation for information retrieval (IR) focuses on insights that can be gained from user interactions with IR systems, such as web search engines. The most common form of online experimentation, A/B testing, is widely used in practice, and has helped sustain continuous improvement of the current generation of these systems.

As online experimentation is taking a more and more central role in IR research and practice, new techniques are being developed to address, e.g., questions regarding the scale and fidelity of experiments in online settings. This paper gives an overview of the currently available tools. This includes techniques that are already in wide use, such as A/B testing and interleaved comparisons, as well as techniques that have been developed more recently, such as bandit approaches for online learning to rank.

This paper summarizes and connects the wide range of techniques and insights that have been developed in this field to date. It concludes with an outlook on open questions and directions for ongoing and future research.


Online evaluation A/B testing Contextual bandits Dueling bandits Interleaved comparison Online learning to rank Counterfactual analysis Experiment design 


  1. 1.
    Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: COLT 2012 (2012)Google Scholar
  2. 2.
    Ailon, N., Karnin, Z., Joachims, T.: Reducing dueling bandits to cardinal bandits. In: ICML 2014 (2014)Google Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefGoogle Scholar
  4. 4.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Babbie, E.R.: The Practice of Social Research, 13th edn. Cengage Learning, Boston (2012)Google Scholar
  6. 6.
    Balog, K., Kelly, L., Schuth, A.: Head first: Living labs for ad-hoc search evaluation. In: CIKM 2014 (2014)Google Scholar
  7. 7.
    Bendersky, M., Garcia-Pueyo, L., Harmsen, J., Josifovski, V., Lepikhin, D.: Up next: Retrieval methods for large scale related video suggestion. In: KDD 2014 (2014)Google Scholar
  8. 8.
    Bottou, L., Chickering, J., Portugaly, E., Ray, D., Simard, P., Snelson, E.: Counterfactual reasoning and learning systems: The example of computational advertising. J. Mach. Learn. Res. 14(1), 3207–3260 (2013)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)CrossRefGoogle Scholar
  10. 10.
    Busa-Fekete, R., Hüllermeier, E.: A survey of preference-based online learning with bandit algorithms. In: Auer, P., Clark, A., Zeugmann, T., Zilles, S. (eds.) ALT 2014. LNCS, vol. 8776, pp. 18–39. Springer, Heidelberg (2014) zbMATHGoogle Scholar
  11. 11.
    Carterette, B.: Statistical significance testing in information retrieval: Theory and practice. In: ICTIR 2013 (2013)Google Scholar
  12. 12.
    Chakraborty, S., Radlinski, F., Shokouhi, M., Baecke, P.: On correlation of absence time and search effectiveness. In: SIGIR 2014, pp. 1163–1166 (2014)Google Scholar
  13. 13.
    Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS 2011, pp. 2249–2257 (2011)Google Scholar
  14. 14.
    Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search ranking. In: WWW 2009, pp. 1–10 (2009)Google Scholar
  15. 15.
    Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 6:1–6:41 (2012)CrossRefGoogle Scholar
  16. 16.
    Chuklin, A., Schuth, A., Hofmann, K., Serdyukov, P., de Rijke, M.: Evaluating aggregated search using interleaving. In: CIKM 2013 (2013)Google Scholar
  17. 17.
    Chuklin, A., Schuth, A., Zhou, K., de Rijke, M.: A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. (2014)Google Scholar
  18. 18.
    Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008, pp. 87–94 (2008)Google Scholar
  19. 19.
    Deng, A., Xu, Y., Kohavi, R., Walker, T.: Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: WSDM 2013, pp. 123–132 (2013)Google Scholar
  20. 20.
    Diaz, F.: Adaptation of offline vertical selection predictions in the presence of user feedback. In: SIGIR 2009, pp. 323–330 (2009)Google Scholar
  21. 21.
    Dupret, G., Lalmas, M.: Absence time and user engagement. In: WSDM 2013, p. 173. ACM Press, New York, February 2013Google Scholar
  22. 22.
    Granka, L.A., Joachims, T., Gay, G.: Eye-tracking analysis of user behavior in www search. In: SIGIR 2004, pp. 478–479 (2004)Google Scholar
  23. 23.
    Guan, Z., Cutrell, E.: An eye tracking study of the effect of target rank on web search. In: CHI 2007, pp. 417–420 (2007)Google Scholar
  24. 24.
    Hassan, A., White, R.W.: Personalized models of search satisfaction. In: CIKM 2013, pp. 2009–2018 (2013)Google Scholar
  25. 25.
    Hofmann, K., Whiteson, S., de Rijke, M.: Balancing exploration and exploitation in learning to rank online. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 251–263. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  26. 26.
    Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring preferences from clicks. In: CIKM 2011, pp. 249–258 (2011)Google Scholar
  27. 27.
    Hofmann, K., Behr, F., Radlinski, F.: On caption bias in interleaving experiments. In: CIKM 2012, pp. 115–124. ACM Press (2012)Google Scholar
  28. 28.
    Hofmann, K., Whiteson, S., de Rijke, M.: Estimating interleaved comparison outcomes from historical click data. In: CIKM 2012, pp. 1779–1783 (2012)Google Scholar
  29. 29.
    Hofmann, K., Whiteson, S., de Rijke, M.: Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Inf. Retrieval J. 16(1), 63–90 (2013)CrossRefGoogle Scholar
  30. 30.
    Hofmann, K., Whiteson, S., de Rijke, M.: Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31(4), 1–43 (2013)CrossRefGoogle Scholar
  31. 31.
    Hofmann, K., Mitra, B., Radlinski, F., Shokouhi, M.: An eye-tracking study of user interactions with query auto completion. In: CIKM 2014 (2014)Google Scholar
  32. 32.
    Jie, L., Lamkhede, S., Sapra, R., Hsu, E., Song, H., Chang, Y.: A unified search federation system based on online user feedback. In: KDD 2013, pp. 1195–1203 (2013)Google Scholar
  33. 33.
    Jin, X., Sloan, M., Wang, J.: Interactive exploratory search for multi page search results. In: WWW 2013, pp. 655–666 (2013)Google Scholar
  34. 34.
    Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142 (2002)Google Scholar
  35. 35.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 1–26 (2007)CrossRefGoogle Scholar
  36. 36.
    Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS, vol. 7568, pp. 199–213. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  37. 37.
    Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In: SIGIR 2011, pp. 205–214 (2011)Google Scholar
  38. 38.
    Kelly, D.: Methods for evaluating interactive information retrieval systems with users. Found. Trends Inf. Retrieval 3(1–2), 1–224 (2009)Google Scholar
  39. 39.
    Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography. SIGIR Forum 37(2), 18–28 (2003)CrossRefGoogle Scholar
  40. 40.
    Kelly, D., Gyllstrom, K., Bailey, E.W.: A comparison of query and term suggestion features for interactive searching. In: SIGIR 2009, p. 371. ACM Press, New York, July 2009Google Scholar
  41. 41.
    Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: WSDM 2014, pp. 193–202. ACM, New York (2014)Google Scholar
  42. 42.
    Kleinberg, R., Slivkins, A., Upfal, E.: Multi-armed bandits in metric spaces. In: STOC 2008. ACM Press (2008)Google Scholar
  43. 43.
    Kohavi, R., Longbotham, R., Sommerfield, D., Henne, R.M.: Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Disc. 18(1), 140–181 (2009)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., Xu, Y.: Trustworthy online controlled experiments: Five puzzling outcomes explained. In: KDD 2012, pp. 786–794. ACM, New York (2012)Google Scholar
  45. 45.
    Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., Pohlmann, N.: Online controlled experiments at large scale. In: KDD 2013, pp. 1168–1176. ACM, New York (2013)Google Scholar
  46. 46.
    Kohli, P., Salek, M., Stoddard, G.: A fast bandit algorithm for recommendation to users with heterogenous tastes. In: AAAI 2013 (2013)Google Scholar
  47. 47.
    Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS 2008, pp. 817–824 (2008)Google Scholar
  48. 48.
    Langford, J., Strehl, A., Wortman, J.: Exploration scavenging. In: ICML 2008, pp. 528–535 (2008)Google Scholar
  49. 49.
    Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW 2010, pp. 661–670 (2010)Google Scholar
  50. 50.
    Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM 2011, pp. 297–306 (2011)Google Scholar
  51. 51.
    Li, L., Chen, S., Kleban, J., Gupta, A.: Couterfactual estimation and optimization of click metrics for search engines (2014). arXiv preprint arXiv:1403.1891
  52. 52.
    Luo, J., Zhang, S., Yang, H.: Win-win search: Dual-agent stochastic game in session search. In: SIGIR 2014, pp. 587–596. ACM (2014)Google Scholar
  53. 53.
    Mahajan, D.K., Rastogi, R., Tiwari, C., Mitra, A.: LogUCB: An explore-exploit algorithm for comments recommendation. In: CIKM 2012, pp. 6–15 (2012)Google Scholar
  54. 54.
    Pearl, J.: Causality: Models, Reasoning and Inference, vol. 29. Cambridge University Press, Cambridge (2000) zbMATHGoogle Scholar
  55. 55.
    Pearl, J.: An introduction to causal inference. Int. J. Biostatistics 6(2) (2010)Google Scholar
  56. 56.
    Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: ICML 2000, pp. 759–766 (2000)Google Scholar
  57. 57.
    Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: SIGIR 2010, pp. 667–674 (2010)Google Scholar
  58. 58.
    Radlinski, F., Craswell, N.: Optimized interleaving for online retrieval evaluation. In: WSDM 2013 (2013)Google Scholar
  59. 59.
    Radlinski, F., Joachims, T.: Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In: AAAI 2006, p. 1406 (2006)Google Scholar
  60. 60.
    Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multi-armed bandits. In: ICML 2008, pp. 784–791. ACM (2008)Google Scholar
  61. 61.
    Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality?. In: CIKM 2008, pp. 43–52 (2008)Google Scholar
  62. 62.
    Russo, D., Roy, B.V.: An information-theoretic analysis of thompson sampling. CoRR, abs/1403.5341 (2014). URL
  63. 63.
    Sanderson, M.: Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retrieval 4(4), 247–375 (2010)CrossRefGoogle Scholar
  64. 64.
    Scholer, F., Shokouhi, M., Billerbeck, B., Turpin, A.: Using clicks as implicit judgments: expectations versus observations. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 28–39. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  65. 65.
    Schuth, A., Hofmann, K., Whiteson, S., de Rijke, M.: Lerot: an online learning to rank framework. In: LivingLab 2013, pP. 23–26. ACM (2013)Google Scholar
  66. 66.
    Schuth, A., Sietsma, F., Whiteson, S., de Rijke, M.: Optimizing base rankers using clicks. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 75–87. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  67. 67.
    Schuth, A., Sietsma, F., Whiteson, S., Lefortier, D., de Rijke, M.: Multileaved comparisons for fast online evaluation. In: CIKM 2014 (2014)Google Scholar
  68. 68.
    Slivkins, A., Radlinski, F., Gollapudi, S.: Ranked bandits in metric spaces: learning diverse rankings over large document collections. J. Mach. Learn. Res. 14(1), 399–436 (2013)MathSciNetzbMATHGoogle Scholar
  69. 69.
    Song, Y., Shi, X., Fu, X.: Evaluating and predicting user engagement change with degraded search relevance. In: WWW 2013, pp. 1213–1224 (2013)Google Scholar
  70. 70.
    Streeter, M., Golovin, D., Krause, A.: Online learning of assignments. In: NIPS 2009, pp. 1794–1802 (2009)Google Scholar
  71. 71.
    Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998) CrossRefGoogle Scholar
  72. 72.
    Tang, D., Agarwal, A., O’Brien, D., Meyer, M.: Overlapping experiment infrastructure: More, better, faster experimentation. In: KDD 2010, pp. 17–26 (2010)Google Scholar
  73. 73.
    Tang, L., Rosales, R., Singh, A., Agarwal, D.: Automatic ad format selection via contextual bandits. In: CIKM 2013, pp. 1587–1594 (2013)Google Scholar
  74. 74.
    Valko, M., Carpentier, A., Munos, R.: Stochastic simultaneous optimistic optimization. In: ICML 2013, pp. 19–27 (2013)Google Scholar
  75. 75.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. Digital Libraries and Electronic Publishing. MIT Press, Cambridge (2005) Google Scholar
  76. 76.
    Wang, K., Walker, T., Zheng, Z.: PSkip: estimating relevance ranking quality from web search clickthrough data. In: KDD 2009, pp. 1355–1364 (2009)Google Scholar
  77. 77.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge (1989)Google Scholar
  78. 78.
    Yue, Y., Guestrin, C.: Linear submodular bandits and their application to diversified retrieval. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) NIPS 2011, pp. 2483–2491 (2011)Google Scholar
  79. 79.
    Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a dueling bandits problem. In: ICML 2009, pp. 1201–1208 (2009)Google Scholar
  80. 80.
    Yue, Y., Joachims, T.: Beat the mean bandit. In: ICML 2011 (2011)Google Scholar
  81. 81.
    Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The K-armed dueling bandits problem. In: COLT 2009 (2009)Google Scholar
  82. 82.
    Yue, Y., Patel, R., Roehrig, H.: Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data. In: WWW 2010, pp. 1011–1018 (2010)Google Scholar
  83. 83.
    Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The K-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012)MathSciNetCrossRefGoogle Scholar
  84. 84.
    Zoghi, M., Whiteson, S.A., de Rijke, M., Munos, R.: Relative confidence sampling for efficient on-line ranker evaluation. In: WSDM 2014, pp. 73–82 (2014)Google Scholar
  85. 85.
    Zoghi, M., Whiteson, S.A., Munos, R., de Rijke, M.: Relative upper confidence bound for the K-armed dueling bandit problem. In: ICML 2014 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (, which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  1. 1.Microsoft ResearchCambridgeUK

Personalised recommendations