Machine Learning

, Volume 108, Issue 1, pp 97–126 | Cite as

Incorporating domain knowledge in machine learning for soccer outcome prediction

  • Daniel BerrarEmail author
  • Philippe Lopes
  • Werner Dubitzky
Part of the following topical collections:
  1. Special Issue on Machine Learning for Soccer


The task of the 2017 Soccer Prediction Challenge was to use machine learning to predict the outcome of future soccer matches based on a data set describing the match outcomes of 216,743 past soccer matches. One of the goals of the Challenge was to gauge where the limits of predictability lie with this type of commonly available data. Another goal was to pose a real-world machine learning challenge with a fixed time line, involving the prediction of real future events. Here, we present two novel ideas for integrating soccer domain knowledge into the modeling process. Based on these ideas, we developed two new feature engineering methods for match outcome prediction, which we denote as recency feature extraction and rating feature learning. Using these methods, we constructed two learning sets from the Challenge data. The top-ranking model of the 2017 Soccer Prediction Challenge was our k-nearest neighbor model trained on the rating feature learning set. In further experiments, we could slightly improve on this performance with an ensemble of extreme gradient boosted trees (XGBoost). Our study suggests that a key factor in soccer match outcome prediction lies in the successful incorporation of domain knowledge into the machine learning modeling process.


2017 Soccer Prediction Challenge Feature engineering k-NN Knowledge representation Open International Soccer Database Rating feature learning Recency feature extraction Soccer analytics XGBoost 



We thank the three anonymous reviewers for their detailed comments that have helped us a lot to improve this manuscript.


  1. Angelini, G., & De Angelis, L. (2017). PARX model for football match predictions. Journal of Forecasting, 36(7), 795–807.MathSciNetzbMATHGoogle Scholar
  2. Berrar, D., Bradbury, I., & Dubitzky, W. (2006). Instance-based concept learning from multiclass DNA microarray data. BMC Bioinformatics, 7(1), 73.Google Scholar
  3. Brodley, C. E., & Smyth, P. (1997). Applying classification algorithms in practice. Statistics and Computing, 7(1), 45–56.Google Scholar
  4. Chen, T., & Guestrin, C. (2016). XGBoost: Reliable large-scale tree boosting system. In: M. Shah, A. Smola, C. Aggarwal, D. Shen, & R. Rastogi (Eds.) Proceedings of the 22nd ACM SIGKDD conference on knowledge discovery and data mining, San Francisco, CA, USA (pp. 785–794).Google Scholar
  5. Chen, T., He, T., Benesty, M., Khotilovich, V., & Tang, Y. (2017). xgboost: Extreme gradient boosting., R package version 0.6-4. Further documentation at Accessed 24 July 2018.
  6. Constantinou, A. (2018). Dolores: A model that predicts football match outcomes from all over the world. Machine Learning.
  7. Constantinou, A., & Fenton, N. (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1).
  8. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT, 13(1), 21–27.zbMATHGoogle Scholar
  9. Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
  10. Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2018). The Open International Soccer Database. Machine Learning.
  11. Dudoit, S., Fridlyand, J., & Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.MathSciNetzbMATHGoogle Scholar
  12. Elo, A. E. (1978). The rating of chessplayers, past and present. London: Batsford.Google Scholar
  13. Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6), 985–987.Google Scholar
  14. Forrest, D., Goddard, J., & Simmons, R. (2005). Odds-setters as forecasters: The case of English football. International Journal of Forecasting, 21(3), 551–564.Google Scholar
  15. Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.MathSciNetzbMATHGoogle Scholar
  16. Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 21(2), 331–340.Google Scholar
  17. Gómez, M., Pollard, R., & Luis-Pascual, J. (2011). Comparison of the home advantage in nine different professional team sports in Spain. Perceptual and Motor Skills, 113(1), 150–156.Google Scholar
  18. Hill, I. (1974). Association football and statistical inference. Applied Statistics, 23(2), 203–208.Google Scholar
  19. Hubáček, O., Šourek, G., & Železný, F. (2018). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning.
  20. Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting, 26(3), 460–470.Google Scholar
  21. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of IEEE international conference on neural networks (Vol. 4, pp. 1942–1948).Google Scholar
  22. Maher, M. (1982). Modelling association football scores. Statistica Neerlandica, 36(3), 109–118.Google Scholar
  23. O’Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., et al. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513–514.Google Scholar
  24. R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 24 July 2018.
  25. Reep, C., & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society, Series A (General), 131(4), 581–585.Google Scholar
  26. Rudin, C., & Wagstaff, K. L. (2014). Machine learning for science and society. Machine Learning, 95(1), 1–9.MathSciNetGoogle Scholar
  27. Shi, Y., & Eberhart, R. (1998). A modified particle swarm optimizer. In Proceedings of IEEE international conference on evolutionary computation (pp. 69–73).Google Scholar
  28. Spann, M., & Skiera, B. (2008). Sports forecasting: A comparison of the forecast accuracy of prediction markets, betting odds and tipsters. Journal of Forecasting, 28(1), 55–72.MathSciNetGoogle Scholar
  29. Tsokos, A., Narayanan, S., Kosmidis, I., Baio., G., Cucuringu, M., Whitaker, G., & Király, F. (2018). Modeling outcomes of soccer matches. Machine Learning. (to appear).Google Scholar
  30. Van Haaren, J., Dzyuba, V., Hannosset, S., & Davis, J. (2015). Automatically discovering offensive patterns in soccer match data. In E. Fromont, T. De Bie, & M. van Leeuwen (Eds.) International symposium on intelligent data analysis. Lecture notes in computer science, Saint-Étienne, France, October 22–24, 2015 (pp. 286–297). Springer, Berlin.Google Scholar
  31. Van Haaren, J., Hannosset, S., & Davis, J. (2016). Strategy discovery in professional soccer match data. In Proceedings of the KDD-16 workshop on large-scale sports analytics (LSSA-2016) (pp. 1–4).Google Scholar
  32. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., & Hea, M. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.Google Scholar
  33. Zambrano-Bigiarini, M., & Rojas, R. (2013). A model-independent particle swarm optimisation software for model calibration. Environmental Modelling & Software, 43, 5–25.Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Daniel Berrar
    • 1
    Email author
  • Philippe Lopes
    • 2
    • 3
  • Werner Dubitzky
    • 4
  1. 1.Data Science Lab, Department of Information and Communications EngineeringTokyo Institute of TechnologyTokyoJapan
  2. 2.Sport and Exercise Science DepartmentUniversity of Evry-Val d’EssonneÉvryFrance
  3. 3.INSERMParis Descartes UniversityParisFrance
  4. 4.Research Unit Scientific Computing, German Research Center for Environmental HealthHelmholtz Zentrum MünchenMunichGermany

Personalised recommendations