Advertisement

Benchmarking

A Methodology for Ensuring the Relative Quality of Recommendation Systems in Software Engineering
  • Alan Said
  • Domonkos Tikk
  • Paolo Cremonesi

Abstract

This chapter describes the concepts involved in the process of benchmarking of recommendation systems. Benchmarking of recommendation systems is used to ensure the quality of a research system or production system in comparison to other systems, whether algorithmically, infrastructurally, or according to any sought-after quality. Specifically, the chapter presents evaluation of recommendation systems according to recommendation accuracy, technical constraints, and business values in the context of a multi-dimensional benchmarking and evaluation model encompassing any number of qualities into a final comparable metric. The focus is put on quality measures related to recommendation accuracy, technical factors, and business values. The chapter first introduces concepts related to evaluation and benchmarking of recommendation systems, continues with an overview of the current state of the art, then presents the multi-dimensional approach in detail. The chapter concludes with a brief discussion of the introduced concepts and a summary.

Notes

Acknowledgments

The authors would like to thank Martha Larson from TU Delft, Brijnesh J. Jain from TU Berlin, and Alejandro Bellogín from CWI for their contributions and suggestions to this chapter.

This work was partially carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 246016.

References

  1. 1.
    Adomavicius, G., Zhang, J.: Stability of recommendation algorithms. ACM Trans. Inform. Syst. 30(4), 23:1–23:31 (2012). doi:10.1145/2382438.2382442Google Scholar
  2. 2.
    Amatriain, X., Basilico, J.: Netflix recommendations: Beyond the 5 stars (Part 1)—The Netflix tech blog. URL http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html (2012) Accessed 9 October 2013
  3. 3.
    Avazpour, I., Pitakrat, T., Grunske, L., Grundy, J.: Dimensions and metrics for evaluating recommendation systems. In: Robillard, M., Maalej, W., Walker, R.J., Zimmermann, T. (eds.) Recommendation Systems in Software Engineering, Chap. 10. Springer, New York (2014)Google Scholar
  4. 4.
    Bajracharya, S.K., Lopes, C.V.: Analyzing and mining a code search engine usage log. Empir. Software Eng. 17(4–5), 424–466 (2012). doi:10.1007/s10664-010-9144-6CrossRefGoogle Scholar
  5. 5.
    Barber, W., Badre, A.: Culturability: The merging of culture and usability. In: Proceedings of the Conference on Human Factors & the Web, Basking Ridge, NJ, USA, 5 June 1998Google Scholar
  6. 6.
    Bell, R., Koren, Y., Volinsky, C.: Chasing $1,000,000: How we won the Netflix Progress Prize. ASA Stat. Comput. Graph. Newslett. 18(2), 4–12 (2007)Google Scholar
  7. 7.
    Boxwell Jr., R.J.: Benchmarking for Competitive Advantage. McGraw-Hill, New York (1994)Google Scholar
  8. 8.
    Butkiewicz, M., Madhyastha, H.V., Sekar, V.: Understanding website complexity: Measurements, metrics, and implications. In: Proceedings of the ACM SIGCOMM Conference on Internet Measurement, pp. 313–328, Berlin, Germany, 2 November 2011. doi:10.1145/2068816.2068846Google Scholar
  9. 9.
    Carenini, G.: User-specific decision-theoretic accuracy metrics for collaborative filtering. In: Proceedings of the International Conference on Intelligent User Interfaces, San Diego, CA, USA, 10–13 January 2005Google Scholar
  10. 10.
    Celma, Ò., Lamere, P.: If you like the Beatles you might like : A tutorial on music recommendation. In: Proceedings of the ACM International Conference on Multimedia, pp. 1157–1158. ACM, New York (2008). doi:10.1145/1459359.1459615Google Scholar
  11. 11.
    Chen, L., Pu, P.: A cross-cultural user evaluation of product recommender interfaces. In: Proceedings of the ACM Conference on Recommender Systems, pp. 75–82, Lousanne, Switzerland, 23–25 October 2008. doi:10.1145/1454008.1454022Google Scholar
  12. 12.
    Cilibrasi, R.L., Vitányi, P.M.B.: The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007). doi:10.1109/TKDE.2007.48CrossRefGoogle Scholar
  13. 13.
    Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A.V., Turrin, R.: Looking for “good” recommendations: A comparative evaluation of recommender systems. In: Proceedings of the IFIP TC13 International Conference on Human–Computer Interactaction, Part III, pp. 152–168, Lisbon, Portugal, 5–9 September 2011. doi:10.1007/978-3-642-23765-2_11Google Scholar
  14. 14.
    Cremonesi, P., Garzotto, F., Turrin, R.: Investigating the persuasion potential of recommender systems from a quality perspective: An empirical study. ACM Trans. Interact. Intell. Syst. 2(2), 11:1–11:41 (2012). doi:10.1145/2209310.2209314Google Scholar
  15. 15.
    Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 107–144. Springer, Boston (2011). doi:10.1007/978-0-387-85820-3_4CrossRefGoogle Scholar
  16. 16.
    Dias, M.B., Locher, D., Li, M., El-Deredy, W., Lisboa, P.J.G.: The value of personalised recommender systems to e-business: A case study. In: Proceedings of the ACM Conference on Recommender Systems, pp. 291–294, Lousanne, Switzerland, 23–25 October 2008. doi:10.1145/1454008.1454054Google Scholar
  17. 17.
    Ehrgott, M., Gandibleux, X. (eds.): Multiple Criteria Optimization: State of the Art Annotated Bibliographic Surveys. Kluwer, Boston (2002). doi:10.1007/b101915Google Scholar
  18. 18.
    Fraser, G., Arcuri, A.: Sound empirical evidence in software testing. In: Proceedings of the ACM/IEEE International Conference on Software Engineering, pp. 178–188, Zurich, Switzerland, 2–9 June 2012. doi:10.1109/ICSE.2012.6227195Google Scholar
  19. 19.
    Goh, D., Razikin, K., Lee, C.S., Chu, A.: Investigating user perceptions of engagement and information quality in mobile human computation games. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 391–392, Washington, DC, USA, 10–14 June 2012. doi:10.1145/2232817.2232906Google Scholar
  20. 20.
    Gomez-Uribe, C.: Challenges and limitations in the offline and online evaluation of recommender systems: A Netflix case study. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, p. 1 (2012)Google Scholar
  21. 21.
    Gunawardana, A., Shani, G.: A survey of accuracy evaluation metrics of recommendation tasks. J. Mach. Learn. Res. 10, 2935–2962 (2009)zbMATHMathSciNetGoogle Scholar
  22. 22.
    Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inform. Syst. 22(1), 5–53 (2004). doi:10.1145/963770.963772CrossRefGoogle Scholar
  23. 23.
    Hu, R.: Design and user issues in personality-based recommender systems. In: Proceedings of the ACM Conference on Recommender Systems, pp. 357–360, Barcelona, Spain, 26–30 Septembert 2010. doi:10.1145/1864708.1864790Google Scholar
  24. 24.
    Jambor, T., Wang, J.: Optimizing multiple objectives in collaborative filtering. In: Proceedings of the ACM Conference on Recommender Systems, pp. 55–62, Barcelona, Spain, 26–30 Septembert 2010. doi:10.1145/1864708.1864723Google Scholar
  25. 25.
    Koenigstein, N., Dror, G., Koren, Y.: Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In: Proceedings of the ACM Conference on Recommender Systems, pp. 165–172, Chicago, IL, USA, 23–27 October 2011. doi:10.1145/2043932.2043964Google Scholar
  26. 26.
    Kung, H.T., Luccio, F., Preparata, F.P.: On finding the maxima of a set of vectors. J. ACM 22(4), 469–476 (1975). doi:10.1145/321906.321910CrossRefzbMATHMathSciNetGoogle Scholar
  27. 27.
    Lai, J.Y.: Assessment of employees’ perceptions of service quality and satisfaction with e-business. In: Proceedings of the ACM SIGMIS CPR Conference on Computer Personnel Research, pp. 236–243, Claremont, CA, USA, 13–15 April 2006. doi:10.1145/1125170.1125228Google Scholar
  28. 28.
    Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 31–40, Hong Kong, China, 7–10 February 2010. doi:10.1145/1719970.1719976Google Scholar
  29. 29.
    McNee, S., Lam, S.K., Guetzlaff, C., Konstan, J.A., Riedl, J.: Confidence displays and training in recommender systems. In: Proceedings of the IFIP TC13 International Conference on Human–Computer Interactaction, pp. 176–183, Zurich, Switzerland, 1–5 September 2003Google Scholar
  30. 30.
    Nah, F.F.H.: A study on tolerable waiting time: How long are Web users willing to wait? Behav. Inform. Technol. 23(3), 153–163 (2004). doi:10.1080/01449290410001669914CrossRefGoogle Scholar
  31. 31.
    Netflix Prize: The Netflix Prize rules (2006). URL http://www.netflixprize.com/rules. Accessed 9 October 2013
  32. 32.
    Perry, R., Lancaster, R.: Enterprise content management: Expected evolution or vendor positioning? Tech. rep., The Yankee Group (2002)Google Scholar
  33. 33.
    Peška, L., Vojtáš, P.: Evaluating the importance of various implicit factors in E-commerce. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, pp. 51–55, Dublin, Ireland, 9 September 2012Google Scholar
  34. 34.
    Pilászy, I., Tikk, D.: Recommending new movies: Even a few ratings are more valuable than metadata. In: Proceedings of the ACM Conference on Recommender Systems, pp. 93–100, New York, NY, USA, 23–25 October 2009. doi:10.1145/1639714.1639731Google Scholar
  35. 35.
    Pu, P., Chen, L., Hu, R.: A user-centric evaluation framework for recommender systems. In: Proceedings of the ACM Conference on Recommender Systems, pp. 157–164, Chicago, IL, USA, 23–27 October 2011. doi:10.1145/2043932.2043962Google Scholar
  36. 36.
    Said, A., Fields, B., Jain, B.J., Albayrak, S.: User-centric evaluation of a K-furthest neighbor collaborative filtering recommender algorithm. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 1399–1408, San Antonio, TX, USA, 23–27 February 2013. doi:10.1145/2441776.2441933Google Scholar
  37. 37.
    Said, A., Tikk, D., Shi, Y., Larson, M., Stumpf, K., Cremonesi, P.: Recommender systems evaluation: A 3D benchmark. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE, CEUR Workshop Proceedings, vol. 910, pp. 21–23, Dublin, Ireland, 9 September 2012Google Scholar
  38. 38.
    Sarwat, M., Bao, J., Eldawy, A., Levandoski, J.J., Magdy, A., Mokbel, M.F.: Sindbad: A location-based social networking system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 649–652, Scottsdale, AZ, USA, 20–24 May 2012. doi:10.1145/2213836.2213923Google Scholar
  39. 39.
    Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics for cold-start recommendations. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 253–260, Tampere, Finland, 11–15 August 2002. doi:10.1145/564376.564421Google Scholar
  40. 40.
    Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: CROC: A new evaluation criterion for recommender systems. Electron. Commerce Res. 5(1), 51–74 (2005). doi:10.1023/B:ELEC.0000045973.51289.8cCrossRefzbMATHGoogle Scholar
  41. 41.
    Schütze, H., Silverstein, C.: Projections for efficient document clustering. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 74–81, Philadelphia, PA, USA, 27–31 July 1997. doi:10.1145/258525.258539Google Scholar
  42. 42.
    Sumner, T., Khoo, M., Recker, M., Marlino, M.: Understanding educator perceptions of “quality” in digital libraries. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 269–279, Houston, Texas, USA, 27–31 May 2003. doi:10.1109/JCDL.2003.1204876Google Scholar
  43. 43.
    Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)Google Scholar
  44. 44.
    Terveen, L., Hill, W.: Beyond recommender systems: Helping people help each other. In: Carroll, J.M. (ed.) Human–Computer Interaction in the New Millennium. Addison-Wesley, New York (2001)Google Scholar
  45. 45.
    Van Veldhuizen, D.A., Lamont, G.B.: Multiobjective evolutionary algorithms: Analyzing the state-of-the-art. Evol. Comput. 8(2), 125–147 (2000). doi:10.1162/106365600568158CrossRefGoogle Scholar
  46. 46.
    Walker, R.J.: Recent advances in recommendation systems for software engineering. In: Proceedings of the International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Lecture Notes in Computer Science, vol. 7906, pp. 372–381. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38577-3_38Google Scholar
  47. 47.
    Zheng, H., Wang, D., Zhang, Q., Li, H., Yang, T.: Do clicks measure recommendation relevancy?: An empirical user study. In: Proceedings of the ACM Conference on Recommender Systems, pp. 249–252, Barcelona, Spain, 26–30 September 2010. doi:10.1145/1864708.1864759Google Scholar
  48. 48.
    Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the International Conference on the World Wide Web, pp. 22–32, Chiba, Japan, 10–14 May 2005. doi:10.1145/1060745.1060754Google Scholar
  49. 49.
    Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: Empirical results. Evol. Comput. 8(2), 173–195 (2000). doi:10.1162/106365600568202CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Centrum Wiskunde & InformaticaAmsterdamThe Netherlands
  2. 2.Gravity R&DBudapestHungary
  3. 3.Óbuda UniversityBudapestHungary
  4. 4.Politecnico di MilanoMilanoItaly

Personalised recommendations