Machine Learning

, Volume 108, Issue 1, pp 9–28 | Cite as

The Open International Soccer Database for machine learning

  • Werner Dubitzky
  • Philippe Lopes
  • Jesse Davis
  • Daniel BerrarEmail author
Part of the following topical collections:
  1. Special Issue on Machine Learning for Soccer


How well can machine learning predict the outcome of a soccer game, given the most commonly and freely available match data? To help answer this question and to facilitate machine learning research in soccer, we have developed the Open International Soccer Database. Version v1.0 of the Database contains essential information from 216,743 league soccer matches from 52 leagues in 35 countries. The earliest entries in the Database are from the year 2000, which is when football leagues generally adopted the “three points for a win” rule. To demonstrate the use of the Database for machine learning research, we organized the 2017 Soccer Prediction Challenge. One of the goals of the Challenge was to estimate where the limits of predictability lie, given the type of match data contained in the Database. Another goal of the Challenge was to pose a real-world machine learning problem with a fixed time line and a genuine prediction task: to develop a predictive model from the Database and then to predict the outcome of the 206 future soccer matches taking place from 31 March 2017 to the end of the regular season. The Open International Soccer Database is released as an open science project, providing a valuable resource for soccer analysts and a unique benchmark for advanced machine learning methods. Here, we describe the Database and the 2017 Soccer Prediction Challenge and its results.


Open International Soccer Database 2017 Soccer Prediction Challenge Open science Soccer analytics 



After we released the Challenge data sets, we received valuable feedback from the participants regarding the evaluation of the predicted outcomes. In particular, we wish to thank team ACC and team FK for their constructive comments. We also thank the three anonymous reviewers for their valuable comments. JD is partially supported by the KU Leuven Research Fund (C14/17/070, C22/15/015, C32/17/036), FWO-Vlaanderen (SBO-150033) and Interreg V A project NANO4Sports.


  1. Angelini, G., & De Angelis, L. (2017). PARX model for football match predictions. Journal of Forecasting, 36(7), 795–807.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Baio, G., & Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37(2), 253–264.MathSciNetCrossRefGoogle Scholar
  3. Berrar, D. (2017). Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Machine Learning, 106(6), 911–949.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Berrar, D., Lopes, P., Davis, J., Dubitzky, W. (2017a). The 2017 Soccer Prediction Challenge.
  5. Berrar, D., Lopes, P., & Dubitzky, W. (2017b). Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. International Journal of Data Science and Analytics, 4(2), 143–151.CrossRefGoogle Scholar
  6. Berrar, D., Lopes, P., Dubitzky, W. (2018). Incorporating domain knowledge in machine learning for soccer outcome prediction. Machine Learning (to appear).Google Scholar
  7. Brier, G. (1950). Verfication of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.CrossRefGoogle Scholar
  8. Büchner, A. G., Dubitzky, W., Schuster, A., Lopes, P., O’Donoghue, P. G., Hughes, J. G., Bell, D. A., Adamson, K., White, J. A., Anderson, J. M. C. C., & Mulvenna, M. D. (1997). Corporate evidential decision making in performance prediction domains. In Proceedings of the 13th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers, San Francisco, CA, USA, UAI’97 (pp. 38–45).Google Scholar
  9. Constantinou, A. (2018). Dolores: A model that predicts football match outcomes from all over the world. Machine Learning.
  10. Constantinou, A. C., & Fenton, N. E., (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1), 1.
  11. Constantinou, A. C., & Fenton, N. E. (2013). Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. Journal of Quantitative Analysis in Sports, 9(1), 37–50.CrossRefGoogle Scholar
  12. Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
  13. Drummond, C. (2009). Replicability is not reproducibility: Nor is it good science. In Proceedings of Evaluation Methods for Machine Learning Workshop at the 26th International Conference on Machine Learning, Montreal, Canada (pp. 1–6).Google Scholar
  14. Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2017). The Open International Soccer Database.
  15. Elo, A. E. (1978). The rating of chessplayers, past and present. London: Batsford.Google Scholar
  16. Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6), 985–987.CrossRefGoogle Scholar
  17. Forrest, D., Goddard, J., & Simmons, R. (2005). Odds-setters as forecasters: The case of English football. International Journal of Forecasting, 21(3), 551–564.CrossRefGoogle Scholar
  18. Foster, E., & Deardorff, A. (2017). Open science framework (OSF). Journal of the Medical Library Association, 105(2), 203–206.Google Scholar
  19. Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 21(2), 331–340.CrossRefGoogle Scholar
  20. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRefGoogle Scholar
  21. Hill, I. (1974). Association football and statistical inference. Applied Statistics, 23(2), 203–208.CrossRefGoogle Scholar
  22. Hirsh, H. (2008). Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining, 1(2), 104–107.MathSciNetCrossRefGoogle Scholar
  23. Hubáček, O., Šourek, G., & Železný, F. (2018). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning.
  24. Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting, 26(3), 460–470.CrossRefGoogle Scholar
  25. Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3), 381–393.MathSciNetGoogle Scholar
  26. Kumar, G. (2013). Machine learning for soccer analytics. Master’s thesis, Department Computerwetenschappen, KU Leuven, Belgium.Google Scholar
  27. Lichman, M. (2013). UCI Machine Learning Repository. Accessed 16 June 2018.
  28. Maher, M. (1982). Modelling association football scores. Statistica Neerlandica, 36(3), 109–118.CrossRefGoogle Scholar
  29. Manolescu, I., Afanasiev, L., Arion, A., Dittrich, J., Manegold, S., Polyzotis, N., et al. (2008). The repeatability experiment of SIGMOD 2008. ACM SIGMOD Record, 37(1), 39–45.CrossRefGoogle Scholar
  30. Mathien, H. (2017). The European Soccer Database. Accessed 16 June 2018.
  31. O’Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., et al. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513–514.Google Scholar
  32. Reep, C., & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society, Series A (General), 131(4):581–585.Google Scholar
  33. Rue, H., & Salvesen, O. (2000). Prediction and retrospective analysis of soccer matches in a league. Journal of the Royal Statistical Society: Series D (The Statistician), 49(3), 399–418.CrossRefGoogle Scholar
  34. Tsokos, A., Narayanan, S., Kosmidis, I., Baio, G., Cucuringu, M., Whitaker, G., & Király, F. J. (2018). Modeling outcomes of soccer matches. Machine Learning (to appear).Google Scholar
  35. Van Haaren, J., & Van den Broeck, G. (2011). Relational learning for football-related predictions. In Proceedings of the 21st International Conference on Inductive Logic Programming (ILP-2011), Windsor Great Park, UK (pp. 1–6).Google Scholar
  36. Vanschoren, J., Blockeel, H., Pfahringer, B., & Holmes, G. (2012). Experiment databases. Machine Learning, 87(2), 127–158.MathSciNetCrossRefzbMATHGoogle Scholar
  37. Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2013). OpenML: Networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2), 49–60.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Werner Dubitzky
    • 1
  • Philippe Lopes
    • 2
  • Jesse Davis
    • 3
  • Daniel Berrar
    • 4
    Email author
  1. 1.Research Unit Scientific Computing, German Research Center for Environmental HealthHelmholtz Zentrum MünchenMunichGermany
  2. 2.Sport and Exercise Science DepartmentUniversity of Évry-Val d’Essonne, and INSERM, Paris Descartes UniversityParisFrance
  3. 3.Department of Computer ScienceKU LeuvenLeuvenBelgium
  4. 4.Data Science Lab, Department of Information and Communications EngineeringTokyo Institute of TechnologyTokyoJapan

Personalised recommendations