Keywords

1 Introduction

Association football (hereafter referred to simply as “football”) is currently the most popular spectator sport in the world [25]. This popularity can be partly explained by its unpredictability [15]. Because football is such a low-scoring game, a single event can make the difference between a win, a draw or a loss. Especially on the top-level, many games are decided by an extraordinary action of a single player, a rare defensive slip, a refereeing error, or just luck. As a consequence, many football games are ultimately won by the proclaimed underdog.

Despite the fact that a large part of the outcome of soccer matches is governed by chance, every team has its strengths and weaknesses and most results reflect these qualities. Eventually, skill does prevail and the best teams typically distinguish themselves during the course of a season or a tournament. This indicates that statistical forecasting methods could be used to predict the outcome of football matches.

Making these predictions is also one of the favorite pastimes of many football fans. However, they typically base their predictions on subjective opinions. It is a challenging task to quantify the strength of a team objectively. There has previously been a fair amount of research on this topic. The Elo rating system is one of the most adopted approaches. While its origins are in chess [10], Elo ratings are commonly used for other sports, including football. Hvattum and Arntzen [24] have shown for English league football that an ordered logit model with the relative difference between the Elo ratings of two competing teams as a single covariate is a highly significant predictor of match outcomes. While the Elo system is essentially a result-based rating (i.e., Elo ratings are computed from the win-draw-loss records of a team), other rating systems are goal-based (i.e., they are based on the number of goals a team scores). These approaches typically extend ratings to two parameters – an offensive and defensive rating.

In this paper we compare both rating systems (as well as a combination of both) in terms of their predictive performance on previous World Cups and we provide our predictions for the 2018 World Cup. Our approach consists of three steps: First, we estimate the strength of the participating teams using past results. Second, we estimate probabilities of match results between each competing nation based on the pairwise rating differences. Finally, the predictions are used to determine the most likely tournament outcome in a Monte Carlo simulation.

The remainder of this paper is structured as follows: In Sect. 2, we discuss various proposals that have been made for modeling the outcome of football matches, as well as for rating the strength of teams; Next, Sect. 3, describes our models, followed in Sect. 4 by a discussion of three different performance metrics to compare the predictive power of these models. Finally, Sects. 5 and 6 validate the predictive strength of our models on respectively international and domestic league football.

2 Related Work

The literature on modeling the outcomes of football games can be divided in two broad categories, namely goal-based and result-based models. The first category models the number of goals scored and conceded by both competing teams. Predictions of win-draw-loss outcomes can then be derived indirectly by aggregating the probabilities assigned to all possible scorelines. The second category models win-draw-loss outcomes directly.

The simplest goal-based models assume that the number of goals scored by both teams are independent and can be modeled with two separate models. Poisson regression models are used predominately, but the negative binomial distribution [2] and Gaussian distribution (by employing the least squares regression method) [40] appear as well in the literature. For example, Lee et al. [31] applied a Poisson regression model to data from the 95/96 Premier League season, using the offensive and defensive strength of both teams and the home advantage as parameters. These parameters are then estimated using maximum-likelihood estimation on historic data. Although these independent Poisson models give a reasonably accurate description of football scores, they tend to underestimate the proportion of draws. Maher et al. [34] were the first to identify a slight correlation between scores and therefore proposed bivariate Poisson models as an alternative. They showed that such a bivariate Poisson distribution gives a better fit on differences in scores than an independent model. Yet, they did not use this insight to predict the results of future matches. Dixon and Coles [7] made the same observation, but instead of using a bivariate Poisson model, they extended the independent Poisson model by introducing an ad-hoc adjustment to the probabilities of low-scoring games. Karlis and Ntzoufras [26] further developed the idea of bivariate Poisson distributions for forecasting football games.

Approaches that model the match outcome directly are a more recent development. Apart from their computational simplicity, these models have the advantage that they avoid the problem of having to model the interdependence between the scores of both teams [17]. Most studies in this category use discrete choice regression models, such as the ordered probit model and the ordered logit model [18, 28]. Goddard [17] compared a bivariate Poisson regression models with an ordered probit regression model and found that a hybrid model in which goal-based team performance covariates are used to forecast win-draw-loss match results yielded the best performance.

Although regression models are most common in the literature, any machine learning model could be used. For example, Groll et al. [21] found that a random forests model generally outperforms the conventional regression methods. Another popular class of models are Bayesian networks. Rue and Salvesen [38] proposed a Dynamic Bayesian Network (DBN) in order to take the time-dependent strength parameters of all teams in a league simultaneously into account. Baio and Blangiardo [1] extended this to a hierarchical goal-based Bayesian model. As such, they avoid the use of a more complex bivariate structure for the number of goals scored by assuming a common distribution at a higher level.

While the above studies focus on the actual prediction models, other studies have investigated the feasibility of possible covariates. Bookmaker odds are a first popular covariate. They reflect the (expert) predictions of bookmakers [37], who have strong economic incentives to make accurate predictions. Several studies have found that they are an efficient forecasting instrument [4, 14, 39].

A second popular covariate are ratings or rankings. The main idea is to estimate adequate ability parameters that reflect a team’s current strength, based on a set of recent matches. A widely accepted approach in sports forecasting is the Elo rating system [10]. It has several generalisations, including the Glickman [16] and TrueSkill [22] rating systems. Besides Elo, there are numerous other approaches that fit into many categories [29]. Closely related to the regression-based forecasting models are the regression-based ranking methods. These models use maximum likelihood estimation to find adequate strength parameters for each team that can explain the number of goals scored or the win-tie-loss outcome in past games [33]. Other approaches are the Markov Chain based models such as Keener et al. [27] and the Power Rank rating system [45]; or the network based rating systems, such as the one by Park and Newman [36]. Yet another one is the Sinkhorn-Knopp based ranking models such as the Offense Defense Model (ODM) [19]. Lasek et al. [30] found that these Elo ratings outperform several of these other ranking methods when predicting the outcome of individual games. Additionally, Van Haaren and Davis [43] found that Elo ratings perform well when predicting the final league tables of domestic football.

3 The Models

In this section, we describe the different components of our approach for predicting the outcome of football games. We computed both result-based Elo ratings and goal-based ODM ratings for each team based on past results. These ratings are then combined in an ordered logit regression or bivariate Poisson regression model in order to make predictions for future games. In the next sections, we compare various combinations of these rating systems and regression models.

3.1 The Elo Rating System

We will briefly introduce both the basic Elo rating system and two football-specific modifications. An Elo rating system assigns a single number to each team that corresponds to a team’s current strength. These numbers increase or decrease depending on the outcome of games and the ratings of the opponents in these games. Therefore, the Elo system defines an expected score for each team in a game, based on the rating difference with the opponent. Let \(R^H\) be the current rating of the home team and \(R^A\) the current rating of the away team, the exact formulae for the expected score \(E^H\) and actual score \(S^H\) of the home team are given by:

$$\begin{aligned} E^H = \frac{1}{1+c^{R^{H} - R^{A}}/d} \qquad \text {and} \qquad S^H = {\left\{ \begin{array}{ll} 1\quad &{}\text {if the home team won} \\ 0.5\quad &{}\text {if the match ended in a draw}\\ 0\quad &{}\text {otherwise} \end{array}\right. } \end{aligned}$$

The expected score and actual score for the away team are then respectively \(E^A = 1-E^H\) and \(S^A = 1 - S^H\).

When a team’s actual score exceeds its expected score, this is seen as evidence that a team’s current rating is too low and needs to be adjusted upward. Similarly, when a team’s actual score is below its expected score, that team’s rating is adjusted downward. Elo’s original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a team overperformed or underperformed their expected score. The formula for updating the rating is

$$\begin{aligned} R^{\prime H} = R^H + k (S^H - E^H). \end{aligned}$$

How much a team’s rating increases or decreases is determined by both its expected score and the k-factor. The rating of a team that was expected to win by a large margin will therefore decrease with an accordingly large amount if it actually loses. The k-factor is often called the recentness factor, because it determines how much weight is given to the results of recent matches. We added two additional factors: the competitiveness factor and the margin of victory. First, one of the difficulties in evaluating international football is that not all games are handled with the same seriousness. Friendlies, for example, are often used to experiment with new line-ups and players tend not to go to any extreme. Therefore, when computing the Elo ratings, we weight games differently depending on the importance of the competition. Second, because the best performing international teams play most of their matches against weak opponents (especially in European qualifiers) and record very few losses, we take the margin of victory (i.e., the absolute goal difference) into account [24]. So, we replace k by the expression

$$\begin{aligned} k = k_0w_i(1+\delta )^{\gamma } \end{aligned}$$
(1)

with \(\delta \) the absolute goal difference, \(w_i > 0\) a weight factor corresponding to the competitiveness of the competition, \(k_0\) the recentness factor and \(\gamma > 0\) a fixed parameter determining the impact of the margin of victory on the update rule.

There are five parameters in this rating system. The parameters c and d determine the scale of the ratings. We set them respectively to 10 and 400. Other values lead to the same rating system, but one has to determine matching weight parameters \(k_0\), w and \(\gamma \). The optimal values for \(k_0\) and \(\gamma \) are determined from historical data. We explain this procedure in Sect. 3.3. The values for w are application dependent and based on expert knowledge.

3.2 Offense-Defense Ratings

An important factor in football games is the playing style of both teams and the balance between offense and defense. We argue that, besides the relative strength of both teams, this difference in playing style might be an important factor in deciding the final outcome of a game. For example, a game between two teams that are known to rely on a very strong defense might have a higher probability to end up in a draw than a game between two teams that are known to play very offensively.

A rating system that can capture these differences in offensive and defensive strengths of a team is the Offense Defense Model (ODM) [19]. As opposed to the Elo system discussed above, it captures the offensive and defensive strength of a team as two separate parameters. Therefore, it uses goals scored as a measure of offensive strength and goals conceded as a measure of defensive strength. Whether a game eventually results in a win, a tie or a loss does not affect the ratings. We will again briefly discuss the basic ODM rating system, followed by a couple of modifications for its application to international football.

Define \(A_{ij}\) as the goals scored by team j against team i. The offensive and defensive ratings of team j are

$$\begin{aligned} o_{j} = \sum \limits _{i=1}^{n} \frac{A_{ij}}{d_{i}} \qquad \text {and} \qquad d_{j} = \sum \limits _{i=1}^{n} \frac{A_{ji}}{o_{i}}. \end{aligned}$$

Since the offensive and defensive ratings are interdependent, they must be approximated by an iterative refinement procedure. We refer to the original paper for details.

This approach works for domestic league football where each team plays the same number of games against every other team. In international football, however, there can be large disparities between the number of games played and the strength of the opponents. A team that plays few games against strong opponents will likely score fewer goals and concede more goals, which leads to weaker attack and weaker defence ratings. To address these problems, we update ratings sequentially. In each game, a team has two sets of ratings: pre-game ratings and post-game ratings. The pre-game ratings are a weighted sum of a team’s post-game ratings in previous games. Similar to what we did for the Elo rating system, these weights are determined by the recentness of a game and its competitiveness. To compute the post-game ratings, we apply the iterative procedure from the original ODM as if the competition has only two teams and using the pre-game ratings as initial estimates for the offensive and defensive ratings. Algorithm 1 below defines the exact procedure.

figure a

3.3 Match Result Predictions

The rating systems defined above can be combined with a regression model to obtain predictions for future matches. Therefore, we consider the rating differences \(R^H - R^A\), \(o^H - d^A\) and \(d^H - o^A\) as covariates. Additionally, a fourth covariate indicates whether a home advantage applies to the home team. In Sect. 5, we compare the predictive power of an ordered logit regression model [35] and a bivariate Poisson regression model [26], as well as various combinations of these covariates in terms of their predictive power on previous World Cups.

We use the L-BFGS-B algorithm [5] with the Ranked Probability Score (RPS) [11] as a loss function to determine the optimal set of parameters for these models. This approach allows us to jointly optimize the parameters for both the rating systems and regression models. Therefore, we order the games in our dataset chronologically and define two subsets: a validation set and a test set. The test set contains the matches that we would like to predict and ideally the validation set contains matches from previous editions of the same tournament or league. Then we repeatedly evaluate each match in the complete dataset sequentially, updating the ratings based on the actual outcome and making a prediction for the matches in the validation set. Once all matches are evaluated, we compute the RPS on the validation set and update the parameters for both the rating system and regression model in order to minimize the RPS.

4 Evaluation Procedures

In this section, we consider several evaluation measures to compare our models with each other and to the odds determined by bookmakers. These odds can serve as a natural benchmark for our models. In contrast to the methods above, bookmaker odds are not solely based on results in past games. They include expert judgments from the bookmakers, which have a strong economic motivation to rate the competitors accurately [32]. After removing the profit margin of the bookmaker, the inverted odds can be interpreted as outcome probabilities [20].

Both our models and the bookmaker odds have in common that they assign a probability to all three possible outcomes of a match. One can evaluate these probabilities in three ways: First, one can consider the outcome with the highest assigned probability as the predicted outcome. Second, one can look at the probability that was assigned to the true outcome. As a third evaluation, one can judge these three probabilities as a whole. Each of these evaluations leads to a different evaluation measure, which we define below.

In the following paragraphs, we use the ordered vector \(\hat{p} = (p_1, p_2, p_3)\) to denote a probability forecast of all possible match outcomes \(r = (\text {win}, \text {tie}, \text {loss})\). Additionally, \(y = (y_1, y_2, y_3)\) denotes the true outcome of a match, with \(y_i\) a binary indicator of whether or not i is the true outcome.

  • Accuracy. This measure compares the outcome with the highest assigned probability to the true outcome. The accuracy for a single match is computed using the following formula:

    $$\begin{aligned} \mathbbm {1}[\mathop {\hbox {arg max}}\limits _{i} y_i = \mathop {\hbox {arg max}}\limits _{i} \hat{p}_{i}] \end{aligned}$$
    (2)

    where \(\mathbbm {1}{[.]}\) is the indicator function that equals 1 if the statement between brackets holds, and 0 otherwise.

  • Logarithmic loss. This measure the uncertainty of the prediction based on how much it varies from the actual outcome. The logarithmic loss is computed as

    $$\begin{aligned} - \sum \limits _{i=1}^{\left| {r}\right| } y_{i} \log {\hat{p}_{i}} \end{aligned}$$
    (3)

    with \(\left| {r}\right| \) the number of possible outcomes. A perfect classifier would have a logarithmic loss of precisely zero. Less ideal classifiers have progressively larger values.

  • Ranked Probability Score (RPS). The ranked probability score (RPS) was introduced in 1969 by Epstein [11] to evaluate probability forecasts of ranked categories. In contrast to the two previous measures it explicitly accounts for the ordinal structure of the predictions. This means that predicting a tie when the actual outcome is a loss is considered a better prediction than a win. For our purpose, it can be defined as

    $$\begin{aligned} \frac{1}{\left| {r}\right| -1} \sum \limits _{k = 1}^{\left| {r}\right| -1} (\sum \limits _{l = 1}^k (\hat{p}_{l} - y_{l}))^2 \end{aligned}$$
    (4)

    As the RPS is an error measure, a lower value corresponds to a better fit.

We will use the same metrics to evaluate the prediction of a tournament outcome. For that purpose, we define the set of possible outcomes as \(r = (\text {elimination in the group stage}, \text {elimination in the round of 16}, \ldots , \text {win})\).

5 Validation on Previous World Cups

This section evaluates the predictive performance of our models on the four FIFA World Cups between 2002 and 2014. Therefore, we adopt a leave-one-out procedure where we iteratively tune the parameters of our model on three out of the four world cups and evaluate the performance on the left out one. For example, while predicting the matches of the 2010 World Cup (i.e., the test set), we use all available international games played between the end of the second world war and the start of the 2018 World Cup to determine the ratings for each team; but only matches of the 2002, 2004 and 2014 World Cups to tune the parameters (i.e., the validation set). Although we use data from future World Cups to determine the parameter settings for a previous World Cup, note that we only include information from past games when rating teams. Therefore, the predictions do not depend on results in future matches.

Our dataset was scraped from http://eloratings.net and includes all international games played between the end of the second world war (January 1, 1946) and the start of the 2018 World Cup (June 13, 2018). For each of these games, we scraped the competing teams, the date of the game, the competition, the outcome after the official game time, the outcome after extensions or penalties, and whether a home advantage applies. Additionally, we scraped the average assigned three-way odds by multiple bookmakers from http://betexplorer.com for allFootnote 1 World Cup matches between 2002 and 2014.

We classified all international competitions into three categories, corresponding with their competitiveness and relative importance. Each category is assigned a weight, which corresponds to w in Formula 1. We assigned a “very high” weight (\(w = 1\)) to World Cup games; a “high” weight (\(w = 0.833\)) to each of the six (AFC, CAF, CONCACAF, CONMEBOL, OFC and UEFA) continental championships; a “medium” weight (\(w = 0.66\)) to their qualifiers as well as to the World Cup qualifiers; a “low” weight (\(w = 0.5\)) to the less important tournaments, such as the African Games, Balkan Cup,...; and finally a “very low” weight (\(w = 0.33\)) to Friendlies These five categories originate from the FIFA/Coca-Cola World Ranking. In comparison, we added an additional category of very low importance for friendlies (the FIFA ranking gives friendlies the same weight as small tournament games) and used different weights.

Table 1 shows the predictive performance of the models introduced in Sect. 3 on the individual games from these four World Cups. Additionally, we include the averaged bookmaker predictions and the predictions from the best-performing random forest model from Groll et al. [21] as a baseline. This random forest model includes amongst others the bookmaker odds and Elo ratings as covariates. We updated the ratings of each team after each game based on the true outcome. This approach is in line with how bookmaker odds are updated until close before the start of a game. It turns out that the simple ordered logit regression model with the Elo rating difference and home advantage as covariates outperforms the bookmaker predictions and all other models in terms of accuracy, logarithmic loss and RPS. Furthermore, we notice that the 2006 and 2014 are a lot easier to predict than the 2002 and 2010 World Cups. The ODM-based models perform slightly better on these hard to predict World Cups.

Table 1. Validation of different predictive models on individual games in the 2002, 2006, 2010 and 2014 World Cups. The models are ordered by increasing average RPS value. The last column shows the average RPS for each World Cup.

Next, we used these models to predict the tournament course for the World Cups between 2002 and 2014. Given match outcome probabilities for each possible match-up, we ran 20,000 Monte Carlo simulations for each World Cup. Occasionally two or more teams will finish the group phase on the same points tally. In that case, the FIFA defines a couple of tie-breakers to determine the ranking order, primarily based on the number of goals scored. However, since the ordered logit models can only sample win, draw and loss results, we resolve these ties randomly. Similarly, in the case of a draw in the knockout stage, we simulate the extra time by sampling a second result. The bivariate Poisson models make it possible to resolve these equal point tallies according to the official rules, but this does not lead to more accurate predictions.

Table 2 presents the performance of the Elo-based and Elo+ODM-based ordered logit models on the past World Cups. In contrast to the previous experiment, these are pre-tournament predictions, meaning that only games preceding the corresponding World Cup are considered when making these predictions. For comparison, we added the 2014 World Cup predictions by FiveThirtyEight [12]. The model using the Elo ratings as a single covariate is again the best performing one, but it turns out that these tournament forecasts are quite inaccurate. We could predict the actual round of elimination correctly for only about half of the participating teams.

Table 2. Performance of a Monte Carlo simulation on previous World Cups, using the Elo and Elo+ODM-based ordered logit models. Additionally, we provide the 2014 World Cup predictions of FiveThirtyEight [12] for comparison.

Finally, we applied the ordered logit model with both Elo and ODM ratings as covariates to forecast the 2018 World Cup. According to our simulation, Brazil was the clear favorite with a win probability of 33% followed by Germany, Spain, France and Argentina. These results were in line with the bookmaker odds, although the bookmakers were more conservative about the win probability of Brazil. Table 3 shows these probabilities for the five favourites. For a detailed overview, we refer to our interactive at https://dtai.cs.kuleuven.be/sports/worldcup18/.

Table 3. Estimated probabilities for reaching the different stages in the 2018 World Cup for the five most-likely winners based on 20,000 simulations.

The 2018 World Cup caught the attention of several other data scientist, trying to forecast the tournament outcome. Based on their SPI rating system, FiveThirtyEight [13] forecasted Brazil (19%) to win the World Cup, followed by Spain (17%) and Germany (13%). The same teams were determined as the major favorites by Zeileis, Leitner and Hornik [44]. By aggregating the winning odds of several bookmakers and transforming those into winning probabilities, they obtained a win probability of 16.6% for Brazil, 15.8% for Germany and 12.5% for Spain. The Swiss bank UBS [42] came up with the same three favorites, but in a different order. They obtained Germany as the main favorite (24.0%), followed by Brazil (19.8%) and Spain (12.5%). Also Groll et al. [21] came up with Spain (17.8%), Germany (17.1%) and Brazil (12.3%) as the main favorites. They combined a large set of 16 features with a random forest approach. As far as we know, only EA Sports [9] correctly predicted France as the World Cup winner. Yet, they did not publish any win probabilities. In Table 4 we compare these models with ours, looking both at predictive accuracy for individual games (if available) and the accuracy of the pre-tournament simulation. To allow comparison with FiveThirtyEight’s predictions, we convert the win-tie-loss probabilities for games in the knockout stage to win-loss probabilities. Therefore, we use the formula \(p'_{win} = p_{win} + p_{win} / (p_{win} + p_{loss}) * p_{tie}\) and analogous for \(p'_{loss}\).

Table 4. A comparison of our model’s predictions with others for the 2018 World Cup. Both the predictive accuracy of individual games (if available) and of the pre-tournament forecast are listed.

6 Validation on Domestic League Football

We also verified our models on The Open International Soccer Database that was provided as part of the 2017 Soccer Prediction Challenge [8]. The training set incorporates 216,743 match outcomes, with missing data (as part of the challenge), from 52 football leagues from all over the world in the seasons ranging from 2000–2001 until 2017–2018. The challenge involved using a single model to predict 206 future match outcomes from 26 different leagues. We used the 2010–2011 until 2017–2018 seasons of the training set as our test set to determine the optimal parameters for our models. This corresponds to about half of the training data.

Table 5 compares the performance of our best-performing models to the four best results of the competition and the bookmaker odds. As for the World Cup predictions, these bookmaker odds are the average assigned three-way odds by multiple bookmakers scraped from http://betexplorer.com. Although we did not optimize our approach for domestic league football, it was found that our relatively simple model outperforms all other, more complex models in terms of RPS. Three models report a better accuracy. While we did not verify this, we think that the predictive accuracy could be improved by incorporating league-specific home advantages and (because of transfers) allowing faster rating updates after the summer and winter breaks.

Table 5. Our approach compared to the best performing models from the 2017 Soccer Prediction Challenge [8]. We omit the logarithmic loss, because we do not have it for the 2017 Soccer Prediction Challenge submissions.

7 Conclusion

In this work, we compared several models for match outcome prediction and tournament simulation in football. We considered all possible combinations of a result- and goal-based regression model, with result-based Elo rating and goal-based ODM rating differences as covariates. In conclusion, we found that a very basic Elo-based ordered logit model outperforms all other models, including more complex models from the literature. The outcome of any match is unpredictable enough to confound these sophisticated computer models.