Incorporating domain knowledge in machine learning for soccer outcome prediction
- 316 Downloads
The task of the 2017 Soccer Prediction Challenge was to use machine learning to predict the outcome of future soccer matches based on a data set describing the match outcomes of 216,743 past soccer matches. One of the goals of the Challenge was to gauge where the limits of predictability lie with this type of commonly available data. Another goal was to pose a real-world machine learning challenge with a fixed time line, involving the prediction of real future events. Here, we present two novel ideas for integrating soccer domain knowledge into the modeling process. Based on these ideas, we developed two new feature engineering methods for match outcome prediction, which we denote as recency feature extraction and rating feature learning. Using these methods, we constructed two learning sets from the Challenge data. The top-ranking model of the 2017 Soccer Prediction Challenge was our k-nearest neighbor model trained on the rating feature learning set. In further experiments, we could slightly improve on this performance with an ensemble of extreme gradient boosted trees (XGBoost). Our study suggests that a key factor in soccer match outcome prediction lies in the successful incorporation of domain knowledge into the machine learning modeling process.
Keywords2017 Soccer Prediction Challenge Feature engineering k-NN Knowledge representation Open International Soccer Database Rating feature learning Recency feature extraction Soccer analytics XGBoost
We thank the three anonymous reviewers for their detailed comments that have helped us a lot to improve this manuscript.
- Berrar, D., Bradbury, I., & Dubitzky, W. (2006). Instance-based concept learning from multiclass DNA microarray data. BMC Bioinformatics, 7(1), 73.Google Scholar
- Brodley, C. E., & Smyth, P. (1997). Applying classification algorithms in practice. Statistics and Computing, 7(1), 45–56.Google Scholar
- Chen, T., & Guestrin, C. (2016). XGBoost: Reliable large-scale tree boosting system. In: M. Shah, A. Smola, C. Aggarwal, D. Shen, & R. Rastogi (Eds.) Proceedings of the 22nd ACM SIGKDD conference on knowledge discovery and data mining, San Francisco, CA, USA (pp. 785–794).Google Scholar
- Constantinou, A. (2018). Dolores: A model that predicts football match outcomes from all over the world. Machine Learning. https://doi.org/10.1007/s10994-018-5703-7.
- Constantinou, A., & Fenton, N. (2012). Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. Journal of Quantitative Analysis in Sports, 8(1). https://doi.org/10.1515/1559-0410.1418.
- Dixon, M., & Coles, S. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
- Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2018). The Open International Soccer Database. Machine Learning. https://doi.org/10.1007/s10994-018-5726-0.
- Elo, A. E. (1978). The rating of chessplayers, past and present. London: Batsford.Google Scholar
- Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6), 985–987.Google Scholar
- Forrest, D., Goddard, J., & Simmons, R. (2005). Odds-setters as forecasters: The case of English football. International Journal of Forecasting, 21(3), 551–564.Google Scholar
- Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 21(2), 331–340.Google Scholar
- Gómez, M., Pollard, R., & Luis-Pascual, J. (2011). Comparison of the home advantage in nine different professional team sports in Spain. Perceptual and Motor Skills, 113(1), 150–156.Google Scholar
- Hill, I. (1974). Association football and statistical inference. Applied Statistics, 23(2), 203–208.Google Scholar
- Hubáček, O., Šourek, G., & Železný, F. (2018). Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning. https://doi.org/10.1007/s10994-018-5704-6.
- Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting, 26(3), 460–470.Google Scholar
- Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of IEEE international conference on neural networks (Vol. 4, pp. 1942–1948).Google Scholar
- Maher, M. (1982). Modelling association football scores. Statistica Neerlandica, 36(3), 109–118.Google Scholar
- O’Donoghue, P., Dubitzky, W., Lopes, P., Berrar, D., Lagan, K., Hassan, D., et al. (2004). An evaluation of quantitative and qualitative methods of predicting the 2002 FIFA World Cup. Journal of Sports Sciences, 22(6), 513–514.Google Scholar
- R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 24 July 2018.
- Reep, C., & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society, Series A (General), 131(4), 581–585.Google Scholar
- Shi, Y., & Eberhart, R. (1998). A modified particle swarm optimizer. In Proceedings of IEEE international conference on evolutionary computation (pp. 69–73).Google Scholar
- Tsokos, A., Narayanan, S., Kosmidis, I., Baio., G., Cucuringu, M., Whitaker, G., & Király, F. (2018). Modeling outcomes of soccer matches. Machine Learning. (to appear).Google Scholar
- Van Haaren, J., Dzyuba, V., Hannosset, S., & Davis, J. (2015). Automatically discovering offensive patterns in soccer match data. In E. Fromont, T. De Bie, & M. van Leeuwen (Eds.) International symposium on intelligent data analysis. Lecture notes in computer science, Saint-Étienne, France, October 22–24, 2015 (pp. 286–297). Springer, Berlin.Google Scholar
- Van Haaren, J., Hannosset, S., & Davis, J. (2016). Strategy discovery in professional soccer match data. In Proceedings of the KDD-16 workshop on large-scale sports analytics (LSSA-2016) (pp. 1–4).Google Scholar
- Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., & Hea, M. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.Google Scholar
- Zambrano-Bigiarini, M., & Rojas, R. (2013). A model-independent particle swarm optimisation software for model calibration. Environmental Modelling & Software, 43, 5–25.Google Scholar