Dolores: a model that predicts football match outcomes from all over the world
 2.5k Downloads
 5 Citations
Abstract
The paper describes Dolores, a model designed to predict football match outcomes in one country by observing football matches in multiple other countries. The model is a mixture of two methods: (a) dynamic ratings and (b) Hybrid Bayesian Networks. It was developed as part of the international special issue competition Machine Learning for Soccer. Unlike past academic literature which tends to focus on a single league or tournament, Dolores is trained with a single dataset that incorporates match outcomes, with missing data (as part of the challenge), from 52 football leagues from all over the world. The challenge involved using a single model to predict 206 future match outcomes from 26 different leagues, played from March 31 to April 9 in 2017. Dolores ranked 2nd in the competition with a predictive error 0.94% higher than the top and 116.78% lower than the bottom participants. The paper extends the assessment of the model in terms of profitability against published market odds. Given that the training dataset incorporates a number of challenges as part of the competition, the results suggest that the model generalised well over multiple leagues, divisions, and seasons. Furthermore, while detailed historical performance for each team helps to maximise predictive accuracy, Dolores provides empirical proof that a model can make a good prediction for a match outcome between teams x and y even when the prediction is derived from historical match data that neither x nor y participated in. While this agrees with past studies in football and other sports, this paper extends the empirical evidence to historical training data that does not just include match results from a single competition but contains results spanning different leagues and divisions from 35 different countries. This implies that we can still predict, for example, the outcome of English Premier League matches, based on training data from Japan, New Zealand, Mexico, South Africa, Russia, and other countries in addition to data from the English Premier league.
Keywords
Association football Bayesian Networks Dynamic ratings Football betting Soccer prediction Timeseries analysis1 Introduction
Association football, more commonly known as football or soccer (hereby referred to as ‘football’), is the world’s most popular sport (Dunning 1999). At the turn of the twentyfirst century, FIFA estimated that there were approximately 250 million football players in over 200 countries, and over 1.3 billion football fans (Britannica 2017). From a financial perspective, the European football market alone is projected to exceed €25billion in 2016/17 (Deloitte 2016), whereas the global sports gambling market is estimated to worth up to $3trillion, with football betting representing 65% of this figure (Daily Mail 2015).
 1.
Statistical models Applications to football match prediction typically include ordered probit regression models (Kuypers 2000; Goddard and Asimakopoulos 2004; Forrest et al. 2005; Goddard 2005) and Poisson models (Maher 1982; Dixon and Coles 1997; Lee 1997; Karlis and Ntzoufras 2003; Angelini and Angelis 2017). These studies are typically published in statistical journals.
 2.
Machine learning and probabilistic graphical models Applications to football match prediction typically include genetic algorithms (Tsakonas et al. 2002; Rotshtein et al. 2005), Bayesian or Markov methods (Joseph et al. 2006; Baio and Blangiardo 2010; Rue and Salvesen 2010; Constantinou et al. 2012, 2013) and neural networks (Cheng et al. 2003; Huang and Chang 2010; Arabzad et al. 2014). These studies are typically published in computer science and artificial intelligence journals.
 3.
Rating systems Applications to football match prediction are mainly based on variants of the widely known ELO rating system (Elo 1978; Leitner et al. 2010; Hvattum and Arntzen 2010), which was initially developed for assessing the strength of chess players, and include the official FIFA/CocaCola World Ranking (FIFA 2017). A rather different rating method, the pirating (Constantinou and Fenton 2013a), provides relative measures of superiority between football teams solely on the basis of the relative discrepancies in scores between adversaries. These studies also tend to be published in statistical journals.
This paper describes a model, which combines a rating system with a Hybrid Bayesian Network (BN). The rating system, which is partly based on the pirating system mentioned above, generates a rating score that captures the ability of a team relative to the residual teams within a particular league. The resulting ratings are then used as input to the BN model for match prediction.
A BN is a wellestablished graphical formalism for representing and reasoning under uncertainty. It is a type of a probabilistic graphical model (Koller and Friedman 2009) introduced by Pearl (1982, 1985, 2009), where variables are represented by nodes and influential links by arcs. A BN model encodes the conditional probabilistic relationships amongst random variables under the assumptions of a Directed Acyclic Graph (DAG), which satisfies the Markov condition of conditional independence. Hybrid BNs are simply BN models that incorporate both discrete and continuous variables.
The paper is structured as follows: Sect. 2 describes the data engineering approach, Sect. 3 describes the model, Sect. 4 provides a worked example of the model, Sect. 5 evaluates the model and discusses the results, and Sect. 6 provides the concluding remarks.
2 Data engineering
The football leagues captured by the training and test datasets. The code ENG1 represents the top division in England (i.e., English Premier League) and ENG5 the fifth division in England (i.e., Conference League); the same reasoning applies to each of the residual coded leagues. A cell in yellow background indicates that the league is captured by the training dataset; grey indicates that the league is not captured by any of the datasets; red indicates missing data (whole league); and blue indicates ongoing leagues captured by the test dataset (Color table online)

Yellow represent leagues captured by data.

Grey represent leagues not captured by data.

Red represent missing data; i.e., missing match results for a whole season. A total of seven seasons of match results are omitted for model training as part of the challenge in the competition, which is expected to negatively influence the predictive accuracy of the model.

Blue represent ongoing leagues captured by the test dataset.
In predicting the outcome of a match for team x, a possible starting point is to base the prediction on recent historical results of x. Such an approach typically requires statistical profiles related to the historical performances for each team. In contrast, this paper adopts the approach of Constantinou and Fenton (2012), where team ratings are based on recent historical match results, but where match predictions are derived from historical observations which include different teams. This implies that a match prediction between teams x and y is often based on historical results that include neither x nor y. In this paper, this approach is extended to different divisions and different countries.
Since part of the overall model is based on a rating system, it naturally shares similarities with other ratingbased approaches, but which demonstrate varying degrees of success. These include the Elo variants and pirating in football (Leitner et al. 2010; Hvattum and Arntzen 2010; Constantinou and Fenton 2012, 2013a; FIFA 2017), the ‘adjusted offensive and defensive efficiencies’ in basketball (Gelman et al. 2003; Piette et al. 2011), the points scored or ‘runs scored and runs allowed’ in baseball, hockey, and basketball (Oliver 2004; Miller 2006; Dayaratna and Miller 2013), and the ‘defenceadjusted value over average’ statistics in Australian and American Football (O’Shaughnessy 2006; Schatz 2006).
An illustration of the data engineering approach which enables us to generate identical predictions for match instances which share identical rating difference (RD), where identical RDs are derived from teams with different home (HT) and away (AT) ratings
Case  League  Match date  HT  AT  HT rating  AT rating  RD (BN input)  Model prediction [1X2]  Bookmakers’ prediction [1X2]  Match result [Goals] 

1  ENG2  01/04/17  Newcastle  Wigan  0.98  − 0.34  1.31  701812  692110  1 [21] 
HT favourite  ECU1  09/04/17  CS Emelec  C Juvenil  1.16  − 0.23  1.39  711811  77168  1 [20] 
2  MEX1  02/04/17  Guadalajara  C Tijuana  0.12  0.23  − 0.11  382735  382933  X [33] 
No favourite  ISR1  01/04/17  M Haifa  Beitar J  0.53  0.62  − 0.09  382735  353134  1 [32] 
3  ITA1  02/04/17  Pescara  AC Milan  − 0.99  0.68  − 1.68  91774  162162  X [11] 
AT favourite  SPA1  02/04/17  Granada  Barcelona  − 0.30  1.47  − 1.77  91774  71380  2 [14] 
 1.
Temporal data Consider a match between teams x and y in season 2016/17, where y has + 1 advantage in rating over x. The historical performances of x and y in past seasons are not only sparse, but also become increasingly less relevant the further away they are from season 2016/17. This implies that the data are temporally dependent, which makes recent data more important than old data. However, this approach eliminates this drawback. This is because instead of searching for historical match instances between x and y, and having to weight discovered observations in terms of relevance in the temporal space, the algorithm searches for historical match instances where any away team had + 1 rating relative to the home team, regardless of the date, the place, or the teams of the match.
 2.
New team data When a team is promoted or relegated to a division for the first time, there may be no relevant data available in terms of how this team performs against teams that already participate in that division. This approach partly addresses this issue, since the challenge now is to rapidly optimise the rating of the newly promoted/relegated team for that division, and this is because when a team joins a league for the first time it does so with a default rating value of 0.
 3.
Different leagues A particularly important benefit of this approach is that historical observations of match instances from one league can be used to predict match results for teams in another league. This is because while a team with rating R in league A is in no way equivalent to a team with rating R in league B, a match instance in league A with rating difference D exhibits strong similarities with a match instance in league B. with rating difference D.
3 The overall model
 1.
A dynamic rating system that provides relative measures of superiority between adversaries for each league, and which represents an extended version of the pirating system (Constantinou and Fenton 2013a). Note that because in this paper the rating method is extended to multiple leagues, a team can participate in different leagues through promotion or relegation. Since a team’s rating converges relative to the adversaries in a particular league, each team has distinct ratings corresponding to each of their participating leagues. As discussed in the previous section, when a team joins a league for the first time it is assigned a default initial rating of 0 for that league. The old rating is saved for the old league as the new default rating for that specific team, in case they ever return to that league.
 2
A Hybrid BN model that takes the resulting ratings from (1) as input to infer the predictive distribution of 1X2, also known as HDA (i.e., home win, draw, and away win), as indicated in Table 2.
3.1 The rating system
 1.
Learning rate λ Determines to what extent the new match results influence the team ratings. The higher the learning rate λ, the more important the recent match results become and hence, the higher their impact is on revising team ratings. This parameter is based on the fact that recent match results are more relevant than older match results, in terms of generating team ratings that reflect a team’s ability at a given point in time. However, one limitation is that the parameter does not account for the temporal difference between matches; implying that whether the last game came in the preceding season or 1 week ago, they are discounted equally in both cases.
 2.
Diminishing function ψ Is a function of the difference between the observed and the expected goals. It aims to diminish the impact each additional goal difference error has on team ratings. For example, a win by 2 goals influences team ratings less than twice relative to a win by 1 goal. This parameter is based on the fact that a win is more important for a team than increasing goal difference.
 3.
Learning rate γ A team has two ratings, one for home and another for away grounds. The learning parameter γ determines to what extent performances at the home grounds influence away team ratings and vice versa. A higher learning rate γ indicates a greater influence. This parameter is based on the wellknown phenomenon of home advantage, under the assumption that the home advantage is not invariant between teams. While there is a single learning rate \( \gamma \) for all teams, Sect. 3.1.1 describes how the home/away effect for every team is treated individually.
In addition to the original three features of the pirating, this extended version incorporates the team form factor. This factor is introduced based on the assumption that team performances may dramatically decrease or increase for a short period of time, and such performances do not necessarily reflect the true longterm ability of the team. This assumption shares similarities with the Pythagorean expectation proposed in baseball, which provides an estimate of the games a baseball team should have won based on the number of runs they scored and allowed (Miller 2006). In essence, the Pythagorean expectation is a probabilistic estimation of team results based on run statistics, and it could be used to estimate under/overperformances. It has been applied to other sports such as basketball (Oliver 2004) and hockey (Dayaratna and Miller 2013) with varying degrees of success. It has also been applied successfully in college basketball based on points scored (Pomeroy 2017), by simply predicting the one with the higher expected win percentage as the likely winner. Applications to football have not been met with similar success, though a considerably more complicated extension of the Pythagorean expectation was shown to perform reasonably well in predicting total league points at the end of a football season (Hamilton 2011).
 1.
Form threshold ϕ Represents the number of continuous performances, above or below expectations, which do not trigger the form factor, under the assumption that the original implementation of the piratings fails to adapt quickly to such dramatic changes. For example, if ϕ is set to 1, the form factor will trigger only after observing more than one continuous under/overperformances.
 2.
Rating impact μ This parameter comes as a natural consequence of parameter ϕ above. It represents the rating difference used to establish provisional ratings from background ratings, once the form factor is triggered.
 3.
Diminishing factor δ This parameter is based on the assumption^{1} that the background ratings ‘catch up’ with each continuous over/underperformance and hence, the form impact diminishes with each ϕ + 1. It represents the level by which rating impact μ diminishes with each additional continuous over/underperformance.
In brief, the algorithm searches for patterns of continuous over/underperformances. If more than ϕ are discovered, the form factor is triggered and causes the provisional ratings to change and evolve differently from the background ratings, as long as the form factor remains active. In the case of continuous underperformances, the provisional ratings decrease faster relative to the background ratings, with a diminishing decrease with each ϕ + 1, and vice versa for overperformances. Otherwise, the provisional ratings remain equal to the background ratings. When an over/underperformance occurs for a team, the match prediction is based on the team’s provisional rating; otherwise, on the team’s background rating.
3.1.1 Description of the rating system

Revised (at time t) home (H) background rating (br) for home team x, given respective prior (at time t − 1) home background rating \( {\text{br}}_{{{\text{xH}}_{{t  1}} }} \):
$$ {\text{br}}_{{{\text{xH}}_{t} }} = {\text{br}}_{{{\text{xH}}_{{t  1}} }} + \psi _{x} \left( e \right) \times \lambda$$ 
Revised (at time t) away (A) background rating (br) for home team x, given respective prior (at time t − 1) away background rating \( {\text{br}}_{{{\text{xA}}_{t  1}}} \):
$$ {\text{br}}_{{{{\rm xA}}_{t}}} = {\text{br}}_{{{{\rm xA}}_{{t  1}}}} + \left( {{\text{br}}_{{{\rm {xH}}_{t}}}  {\text{br}}_{{{{\rm xH}}_{{t  1}}}} } \right) \times\gamma $$ 
Revised (at time t) away (A) background rating (br) for away team y, given respective prior (at time t − 1) away background rating \( {\text{br}}_{{{\text{yA}}_{{t  1}}}} \):
$$ {\text{br}}_{{{\text{yA}}_{t}}} = {\text{br}}_{{{\text{yA}}_{t  1}}} + \psi_{y} \left( e \right) \times \lambda $$ 
Revised (at time t) home (H) background rating (br) for away team y, given respective prior (at time t − 1) home background rating \( {\text{br}}_{{{\text{yH}}_{t  1}}} \):
$$ {\text{br}}_{{{\text{yH}}_{t}}} = {\text{br}}_{{{\text{yH}}_{t  1}}} + \left( {{\text{br}}_{{{\text{yA}}_{t}}}  {\text{br}}_{{{\text{yA}}_{t  1}}} } \right) \times\gamma $$
3.1.2 Parameter optimisation
The learning rates are optimised on the global scale over all football leagues considered by the dataset, and are somewhat higher than the learning rates of λ = 0.035 and γ = 0.7 reported in the original pirating version (Constantinou and Fenton 2013a), but which were solely based on the English Premier League (EPL). Note that the missing data incorporated into the training dataset as part of this competition, in the form of entire football seasons, is expected to have marginally inflated the global optimal learning rates. This is because, in the case where season t is missing, the team ratings at the start of season t + 1 are still strongly influenced by match results at the end of season t − 1 and hence, need to ‘catch up’ to current performance.
Similarly, and as shown in Fig. 2, the parameters with respect to the provisional ratings are optimised at δ = 2.5, μ = 0.01, and ϕ = 1, at which point the provisional ratings minimise the RPS at 0.211198. Note that while the average difference in RPS between the background and provisional ratings is rather marginal, it is still important because the form factor only affects a part of the 44,264 match instances considered for optimisation (i.e., teams that satisfy the ϕ criterion). In fact, the results show that the provisional ratings have influenced the predictive distribution 1X2 by up to a maximum of 2.75, 2.15 and 4.73% percentage points for each respective state of the distribution.
3.2 The Hybrid Bayesian Network model
 1.AD: Captures 42 distinct ranks of ability difference between adversaries, driven by rating discrepancies. At its prior state, AD outputs a datadriven histogram of the predetermined ranks (see Fig. 6). Since the ranks are inferred from ratings, it makes sense that each rank is represented by an equal interval width, rather than by clusters (note that no visible clusters exist). The deterministic ranks enable us to capture extreme rating discrepancies between adversaries which, as shown in Fig. 4, are very important in determining extreme favourites and outsiders. Each rank has rating difference 0.1, determined by the granularity of the 42 levels which has been chosen to ensure that for any rating discrepancy there are sufficient data points for a reasonably well informed prior.^{3} This level of complexity is significantly higher relative to the 28 ranks introduced in the original pirating system (Constantinou and Fenton 2013a). The relatively big dataset made available for this study, as part of the competition, has made it possible for the ranks of team ability difference to increase from 28 to 42.
 2.
RD: Represents a mixture of 42 ~ Gaussian distributions (one for each state of AD). At its prior state, RD represents the average discrepancy between home and away ratings, and assumes that the difference follows a ~ Gaussian distribution since the actual datadriven histogram of ancestor AD resembles a perfect ~ Gaussian distribution (see AD and RD in Fig. 6). This node takes the resulting provisional team ratings as input in the form of \( {\text{pr}}_{\text{xH}}  {\text{pr}}_{\text{yA}} \).
 3.
GH/GA: Represent discrete distributions which capture the datadriven histogram of goals scored for each team at home and away grounds, given AD. Note that while these distributions are not meant to be used as predictors for the number of goals scored by each team, they can be used to predict the score difference (in addition to the outcome of interest 1X2).
 4.
P: Represents a discrete probability distribution for the prediction of interest, with probabilities assigned to each of the three states of the 1X2 distribution.
The parameter learning of the BN model is restricted to match instances where both the home and away teams have already played a minimum of 50^{4} match instances for each specific league and division they participate in. This restriction ensures that team ratings have converged well prior to being considered as training samples by the model. As a result, the size of the training dataset is reduced from 216,743 to 149,772 samples. Tables 9, 10, 11, 12 and 13, in “Appendix A”, present the Conditional Probability Tables (CPTs) for each of the BN model nodes, which are learnt using Maximum Likelihood Estimation for parameter learning, based on the data provided for the competition. Figure 6, in “Appendix B”, illustrates the prior outputs of the BN model.
Predetermined levels of team ability difference, where R is the rank of rating difference, C is the rating condition, and S is the sample size of match instances that satisfy C
R  1  2  …  22  23  …  41  42 

C  > 2.1  > 2 and ≤ 2.1  …  > 0 and ≤ 0.1  >− 0.1 and ≤ 0  …  >− 1.9 and ≤− 1.8  ≤− 1.9 
S  201  145  …  9554  8680  …  32  50 
4 Worked example of Dolores
4.1 Predicting match outcomes from team ratings

Home prior background rating for team x\( {\text{br}}_{{{\text{xH}}_{t  1}}} = 0.463014 \).

Away prior background rating for team x\( {\text{br}}_{{{\text{xA}}_{t  1}}} = 0.208624 \).

Away prior background rating for team y: \( {\text{br}}_{{{\text{yA}}_{t  1}}} = 0.037819 \).

Home prior background rating for team y: \( {\text{br}}_{{{\text{yH}}_{t  1}}} = 0.537708 \).
4.2 Revising team ratings from match results
 $$ {\text{br}}_{{{\text{xH}}_{t}}} = {\text{br}}_{{{\text{xH}}_{t  1}}} + \psi_{x} \left( e \right) \times \lambda = 0.463014 + 1.246290 \times 0.054 = 0.530314 $$
 $$ {\text{br}}_{{{\text{xA}}_{t}}} = {\text{br}}_{{{\text{xA}}_{t  1}}} + \left( {{\text{br}}_{{{\text{xH}}_{t}}}  {\text{br}}_{{{\text{xH}}_{t  1}}} } \right) \times \gamma = 0.208624 + \left( {0.530314  0.463014} \right) \times 0.79 = 0.261791 $$
 $$ {\text{br}}_{{{\text{yA}}_{t}}} = {\text{br}}_{{{\text{yA}}_{t  1}}} + \psi_{y} \left( e \right) \times\lambda = 0.037819 + \left( {  1.246290} \right) \times 0.054 =  0.029481 $$
 $$ {\text{br}}_{{{\text{yH}}_{t}}} = {\text{br}}_{{{\text{yH}}_{t  1}}} + \left( {{\text{br}}_{{{\text{yA}}_{t}}}  {\text{br}}_{{{\text{yA}}_{t  1}}} } \right) \times\gamma = 0.537708 + \left( {  0.029481  0.037819} \right) \times 0.79 = 0.484541 $$
Finally, we need to update the parameter ϕ for both x and y teams. This would be the fourth continuous overperformance for team x; i.e., this is because the expectation was 0.397 goals difference in favour of team x, relative to the observation of 2 goals difference in favour of team x. Similarly, this would be the second continuous underperformance for team y. As a result, \( \phi_{\text{cx}} = 4 \) and \( \phi_{\text{cy}} =  2 \). Now the ratings are ready to be used for future match prediction (i.e., repeat of Sect. 4.1) and later revised based on future match results (i.e., repeat of Sect. 4.2).
5 Evaluation and discussion
The model is evaluated in terms of both predictive accuracy and profitability against published market odds. This section covers these two methods of predictive evaluation in turn.
5.1 Predictive accuracy
As part of the competition, the RPS function (Epstein 1969) is selected to determine the predictive accuracy of the models. The RPS is shown to be more appropriate in assessing probabilistic football match predictions than other more popular metrics, such as the RMS and Brier score (Constantinou and Fenton 2012). This is because the RPS is a scoring function suitable for evaluating probabilistic outcomes of ordinal, rather than nominal, scale. For example, in the case of predicting the winning lottery number, if the winning number is 10 then a prediction of 11 is no better than a prediction of 49; i.e., they are both equally wrong. However, in the case of football match prediction, if the observed outcome is a home win, then a prediction of a draw is less inaccurate than a prediction of an away win, even though neither of those outcomes occurred; i.e., they are not equally wrong.
The results from the international special issue competition Machine Learning for Soccer (Berrar et al. 2017), determined by the RPS function. ‘Team ACC’ represents Dolores described in this paper
Position  Participant  RPS  Relative performance (%) 

1  Team OH  0.206307  100 
2  Team ACC  0.208256  99.06 
3  Team FK  0.208651  98.88 
4  Team HEM  0.217665  94.78 
5  Team EB  0.225827  91.36 
6  Team LJ^{a}  0.231297  89.2 
7  Team AT  0.398058  51.83 
8  Team LHE  0.451456  45.7 
9  Team EDS  0.451456  45.7 
The 52 leagues ranked by the model’s ability to correctly predict match outcomes in each of those leagues, as determined by the RPS (Color table online)
5.2 Profitability
Naturally, the performance of a football model can also be determined by its ability to generate profit against published market odds. In Constantinou et al. (2013) we argued that it can be misleading to focus the evaluation of a football model solely on maximising or minimising a scoring function because (a) different scoring functions can generate different conclusions about which model is ‘best’, and (b) in financial domains researchers demonstrated a weak relationship between the various accuracy metrics and actual profitability (Leitch and Tanner 1991).
 1.
The published market odds, which differ depending on the selected bookmaker for validation purposes. However, in Constantinou and Fenton (2013b) we showed that the divergence in odds between bookmaking firms is limited to the point that arbitrage opportunities are eliminated or, otherwise, minimised.
 2.
The bookmakers’ incorporated profit margin, which is also known as the ‘overround’, and represents the ‘unfair’ advantage introduced in published market odds, to practically guarantee profit for the house^{5} over time. In Constantinou and Fenton (2013b) we showed that while the discrepancy in profit margins between bookmakers decreases over time due to competition, they can still differ considerably between online bookmakers and hence, the selection of the bookmaker can have a significant impact on profitability.
 3.
The betting strategy, which is an important decision making problem. Betting decision making is normally based on a discrepancy threshold associated with the difference between predicted and bookmakers’ probabilities (converted from odds), in favour of the model in terms of payoff. The value of the bet is either fixed throughout the betting simulation, or determined by the Kelly criterion (Kelly 1956).
 4.The interpretation of the results, which is typically based on the returnoninvestment (ROI) or the net profits. In Constantinou et al. (2013) we argued that ROI can be a misleading figure. Consider the following two scenarios:
 a.
Model A suggests two £100 bets and both are successful (100% winning rate), returning a net profit of £200, which represents a ROI of 100%.
 b.
Model B suggests five £100 bets and four of them are successful (80% winning rate), returning a net profit of £300, which represents a ROI of 60%.
 a.
A profitability evaluator based on ROI would have erroneously considered model B as being inferior at maximising profit than model A. This is because it fails to consider the possibility that model A might have failed to discover all of the potential betting opportunities in the same way model B did. Conversely, a model which maximises ROI can still be useful in cases where we are interested in minimising the risk of negative returns in exchange for a lower expected net profit.
 1.
Are based on FootballData (2017), which captures the published market odds offered by a number of bookmakers over many leagues. The odds are recorded on Friday afternoons for weekend games and on Tuesday afternoons for midweek games.
 2.
Consider the maximum bookmakers’ odds, which represent the best available odds over a number of fixed odds bookmakers (e.g. excluding Betfair Exchange odds).
 3.
Are based on all the leagues offered by FootballData (2017); a total of 21 leagues, where 11 are top divisions and 10 are lower divisions, starting from season 2010/11 to March 2017.
 4.
Do not assume that the profit margin is eliminated, which hovers between − 0.04 and 1.63%, for the best available odds (as discussed in (2) above).
 5.
Do not take advantage of any arbitrage opportunities that may arise between bookmakers’ odds.
 6.
Are based on the typical betting decision strategy whereby a bet is simulated on the outcome of a match instance that offers a payoff which exceeds a predetermined level of discrepancy between predicted and offered odds, in terms of probability. The discrepancy threshold found to maximise overall net profits is 8% (absolute). If more than one outcome meet the discrepancy threshold, only the outcome with the highest discrepancy is chosen for betting.
Profitability for top division leagues in Europe, ranked by the bookmakers’ build in profit margin
League  Bets simulated  Average betting odds  Win rate (%)  Returns  Profit  ROI (%)  Profit margin (%) 

GER1  559  5.38  29.87  £570.89  £11.89  2.13  − 0.04 
SPA1  667  6.31  25.04  £743.46  £76.46  11.46  0.00 
ENG1  686  5.20  28.57  £823.72  £137.72  20.08  0.05 
ITA1  767  5.43  23.47  £697.94  − £69.06  − 9.00  0.12 
FRA1  707  4.80  27.44  £672.22  − £34.78  − 4.92  0.32 
HOL1  442  4.42  28.96  £402.42  − £39.58  − 8.95  0.83 
SCO1  325  5.16  29.54  £371.70  £46.70  14.37  0.84 
POR1  510  6.79  21.57  £483.85  − £26.15  − 5.13  0.95 
TUR1  393  5.29  22.90  £369.86  − £23.14  − 5.89  1.03 
BEL1  382  4.29  31.15  £392.17  £10.17  2.66  1.10 
GRE1  556  7.33  18.53  £531.15  − £24.85  − 4.47  1.20 
Overall  5994  5.54  26.09  £6059.38  £65.38  1.09  0.58 
Profitability for lower division leagues in Europe, ranked by the bookmakers’ build in profit margin
League  Bets simulated  Average betting odds  Win rate (%)  Returns  Profit  ROI (%)  Profit margin (%) 

ENG2  750  3.28  31.33  £655.74  − £94.26  − 12.57  0.61 
FRA2  735  3.82  25.85  £631.67  − £103.33  − 14.06  0.67 
GER2  474  3.71  28.06  £419.46  − £54.54  − 11.51  0.73 
ENG3  927  3.38  33.76  £959.12  £32.12  3.46  0.77 
ENG4  838  3.40  32.10  £823.99  − £14.01  − 1.67  0.79 
ITA2  860  4.04  32.56  £937.62  £77.62  9.03  1.20 
SCO2  281  5.24  26.33  £308.48  £27.48  9.78  1.27 
SPA2  673  4.12  28.68  £631.36  − £41.64  − 6.19  1.33 
SCO3  343  4.58  33.24  £373.42  £30.42  8.87  1.44 
SCO4  225  4.91  34.67  £269.41  £44.41  19.74  1.63 
Overall  6106  3.83  30.66  £6010.27  − £95.73  − 1.57  1.04 
It has long been assumed that enormous betting volumes dictate a part of the odds; a way for bookmakers to exchange marginal levels of predictive accuracy to maximise profits. Odds which are biased due to betting volumes can be exploited by predictive models. This study supports this assumption based on the high profitability generated on match instances of the EPL, which is by far the most popular football league. It is also crucial to note that the popularity of the EPL has also made it the most likely choice for assessing football match prediction models in the academic literature. This is problematic because, as shown in Tables 6 and 7, the level of profitability observed on match instances of the EPL does not repeat for any of the residual 20 leagues. Additionally, the results show that the profitability between seasons, based on bets ranging from 76 to 135 per EPL season, is not consistent and ranges between − 6.4 and 38% ROI, or − £6.5 and £39.7 net profits.
Overall profitability generated per football season, over all of the 21 leagues
Season  Bets simulated  Average betting odds  Win rate (%)  Returns  Profit  ROI (%)  Profit margin (%) 

2010/2011  1475  4.62  27.93  £1469.06  − £5.94  − 0.40  1.37 
2011/2012  1562  5.12  26.06  £1538.89  − £23.11  − 1.48  0.98 
2012/2013  1691  4.42  29.92  £1758.98  £67.98  4.02  0.86 
2013/2014  1713  4.75  28.96  £1738.09  £25.09  1.46  0.54 
2014/2015  2099  4.63  28.68  £2097.33  − £1.67  − 0.08  0.15 
2015/2016  2054  4.68  28.24  £2074.75  £20.75  1.01  0.71 
2016/2017  1506  4.58  28.29  £1392.55  − £113.45  − 7.53  0.49 
6 Concluding remarks
The paper described Dolores, which is a model designed to predict football match outcomes from all over the world, as part of the international special issue competition Machine Learning for Soccer. The model is novel in its approach which is based on (a) dynamic ratings for temporal analysis, and (b) a hybrid BN model that takes the resulting ratings from (a) as input to infer the 1X2 distribution. The model was trained with a dataset of 52 leagues, which includes different divisions from 35 countries. Unlike past relevant literature, this model is designed in a way that enables it to predict football match outcomes of teams in one country by observing match outcomes of teams in multiple countries.
The predictive accuracy of Dolores was assessed as part of the competition, which involved predicting 206 future match instances from different leagues during March in 2017. The paper extends the assessment of the model to a profitabilitybased validation, based on bookmakers’ odds from 21 different leagues and over a period of approximately seven football seasons. The results indicate marginal profits of 1.09% ROI over all top divisions, and marginal losses of − 1.57% ROI over all lower divisions. While the overall ROI^{7} is not impressive, it still serves as empirical proof that the model, which was solely based on goal data, has generalised well over all leagues and divisions, even accounting for the missing data incorporated into the dataset as part of the challenge. Furthermore, while detailed historical performance for each team is typically required to maximise predictive accuracy, Dolores provides empirical proof that a model can make a good prediction for a match outcome between teams x and y even when the prediction is derived from historical match data that neither x nor y participated in.
Further to profitability, it is important to note that relevant academic literature is often driven by profitability from betting simulations on match instances of the EPL. In many cases, these results are based on a single season of the EPL. Interestingly, Dolores generated 20% + ROI based on approximately seven seasons of the EPL; a rather impressive performance. However, as shown in Tables 6 and 7, this level of profitability is not repeated for any of the residual 20 leagues taken into consideration. Given that the EPL is the most popular league, this enforces the popular hypothesis that the enormous betting volumes dictate part of the published market odds, and this enables predictive models to exploit such inaccuracies. Moreover, the results show that profitability between seasons of the same league is not consistent. In the case of the EPL, and over seven seasons of betting simulations, annual profitability ranges between − 6.4 and 38% ROI. These allinclusive results raise some concerns about the validity of conclusions in past relevant literature. This is because, while there is nothing wrong with demonstrating that a model can identify such (possibly) biased odds and generate profit from bets on match instances of the EPL, there is still a risk that such results will be misinterpreted as generic and independent of the EPL. The results from this study also suggest that it would be best to extend assessments of profitability over multiple seasons.
Finally, past studies have shown that it is possible to increase the predictive accuracy of a model by incorporating other key factors, such as player transfers, availability of key players, participation in international competitions, new coach, level of injuries, attack and defence ratings, and even team motivation/psychology in the form of expert knowledge (Constantinou et al. 2012; Pena 2014; Szczepanski and McHale 2015; Constantinou and Fenton 2017). Because of the competition requirements and the multiple leagues captured by the dataset, the model presented in this paper had to be restricted to goal scoring data. Future work will investigate ways to extend Dolores towards accounting for such additional key factors of interest.
Footnotes
 1.
The reverse assumption had also been examined and was found to decrease predictive accuracy.
 2.
Constantinou and Fenton (2013a) proposed the function ψ(e) to diminish the importance of high score differences. While in both studies the function appears to adequately capture the importance of high score differences, a weakness of this function is that it is deterministic, in exchange for reduced model complexity.
 3.
The minimum sample size is 32 at R = 41.
 4.
In Constantinou and Fenton (2013a), 30 iterations of rating development were found to be sufficient in the case of the EPL. In this study, the number of iterations has been increased to 50, even though the learning rates are higher and promise faster convergence of the ratings. This is because, in this study, we generalise the model over 52 leagues and hence, it is more than likely that some leagues exist in which the rating difference between the strongest and weakest teams is considerably higher relative to the respective difference when only focusing on the EPL, as in the original study.
 5.
This ‘unfairness’ is similar to the payoffs offered on roulette where the house has an edge, or a profit margin, of 1/37 (or 2.7%) in the case of the European roulette, and 2/38 (or 5.26%) in the case of the American version.
 6.
ROI has been chosen over the net profits to ensure that the graphs in Fig. 5 do not generate a trend that is biased towards the number of bets simulated per league.
 7.
Note that the betting strategy was optimised for net profits rather than ROI [refer to Sect. 5.2, point (iv)].
Notes
Acknowledgements
This study was partly supported by the European Research Council (ERC), Research Project ERC2013AdG339182BAYES_KNOWLEDGE.
References
 Angelini, G., & Angelis, L. D. (2017). PARX model for football match predictions. Journal of Forecasting, 36, 795.MathSciNetzbMATHCrossRefGoogle Scholar
 Arabzad, S. M., Araghi, M. E. T., SadiNezhad, S., & Ghofrani, N. (2014). Football match results prediction using artificial neural networks; The case of Iran Pro League. International Journal of Applied Research on Industrial Engineering, 1(3), 159–179.Google Scholar
 Baio, G., & Blangiardo, M. (2010). Bayesian hierarchical model for the prediction of football results. Journal of Applied Statistics, 37(2), 253–264.MathSciNetCrossRefGoogle Scholar
 Berrar, D., Dubitzky, W., Davis, J., & Lopes, P. (2017). Machine learning for soccer. Retrieved September 1, 2017 from https://osf.io/ftuva/.
 Britannica. (2017). Football (Association Football, Soccer). In Encyclopaedia Britannica, Retrieved April 19, 2017 from https://www.britannica.com/sports/footballsoccer.
 Cheng, T., Cui, D., Fan, Z., Zhou, J., & Lu, S. (2003). A new model to forecast the results of matches based on hybrid neural networks in the soccer rating system. In IEEE Xplore.Google Scholar
 Constantinou, A. C., & Fenton, N. E. (2012). Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models. Journal of Quantitative Analysis in Sports, 8(1), 1–14.CrossRefGoogle Scholar
 Constantinou, A. C., & Fenton, N. E. (2013a). Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries. Journal of Quantitative Analysis in Sports, 9(1), 37–50.CrossRefGoogle Scholar
 Constantinou, A. C., & Fenton, N. E. (2013b). Profiting from arbitrage and odds biases of the European football gambling market. The Journal of Gambling Business and Economics, 7(2), 41–70.Google Scholar
 Constantinou, A., & Fenton, N. (2017). Towards smartdata: Improving predictive accuracy in longterm football team performance. KnowledgeBased Systems, 124, 93–104.CrossRefGoogle Scholar
 Constantinou, A. C., Fenton, N. E., & Neil, M. (2012). pifootball: A Bayesian network model for forecasting Association Football match outcomes. KnowledgeBased Systems, 36, 322–339.CrossRefGoogle Scholar
 Constantinou, A. C., Fenton, N. E., & Neil, M. (2013). Profiting from an inefficient Association Football gambling market: Prediction, Risk and Uncertainty using Bayesian networks. KnowledgeBased Systems, 50, 60–86.CrossRefGoogle Scholar
 Daily Mail. (2015). Global sports gambling worth ‘up to $3 trillion’. Daily Mail. Retrieved April 19, 2017 from http://www.dailymail.co.uk/wires/afp/article3040540/Globalsportsgamblingworth3trillion.html.
 Dayaratna, K. D., & Miller, S. J. (2013). The Pythagorean wonloss formula and hockey: A statistical justification for using the classic baseball formula as an evaluative tool in hockey (pp. 193–209). XVI: The Hockey Research Journal.Google Scholar
 Deloitte. (2016). Annual Review of Football Finance 2016. Deloitte. Retrieved April 19, 2017 from https://www2.deloitte.com/uk/en/pages/sportsbusinessgroup/articles/annualreviewoffootballfinance.html.
 Dixon, M. J., & Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2), 265–280.Google Scholar
 Dunning, E. (1999). The development of soccer as a world game. In Sports Matters: Sociological Studies of Sport Violence and Civilisation. London: Routledge.Google Scholar
 Elo, A. E. (1978). The rating of chess players, past and present. New York: Arco Publishing.Google Scholar
 Epstein, E. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8, 985–987.CrossRefGoogle Scholar
 FIFA. (2017). FIFA/CocaCola World Ranking. FIFA. Retrieved April 19, 2017 from http://www.fifa.com/fifaworldranking/procedure/men.html.
 FootballData. (2017). Historical Football Results and Betting Odds Data. Retrieved April 4, 2017 from http://www.footballdata.co.uk/data.php.
 Forrest, D., Goddard, J., & Simmons, R. (2005). Oddssetters as forecasters: The case of English football. International Journal of Forecasting, 21, 551–564.CrossRefGoogle Scholar
 Gelman, A., Carlin, J., Stern, H., & Rubin, D. (2003). Bayesian data analysis (2nd ed.). Boca Raton: Chapman and Hall/CRC.zbMATHGoogle Scholar
 Goddard, J. (2005). Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 21, 331–340.CrossRefGoogle Scholar
 Goddard, J., & Asimakopoulos, I. (2004). Forecasting football results and the efficiency of fixedodds betting. Journal of Forecasting, 23, 51–66.CrossRefGoogle Scholar
 Hamilton, H. (2011). An extension of the pythagorean expectation for association football. Journal of Quantitative Analysis in Sports, 7(2), 1–18.CrossRefGoogle Scholar
 Huang, K., & Chang, W. (2010). A neural network method for prediction of 2006 World Cup Football Game. In IEEE Xplore.Google Scholar
 Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match result prediction in association football. International Journal of Forecasting, 26, 460–470.CrossRefGoogle Scholar
 Joseph, A., Fenton, N., & Neil, M. (2006). Predicting football results using Bayesian nets and other machine learning techniques. KnowledgeBased Systems, 7, 544–553.CrossRefGoogle Scholar
 Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3), 381–393.MathSciNetGoogle Scholar
 Kelly, J. L. (1956). A new interpretation of information rate. Bell System Technical Journal, 35(4), 917–926.MathSciNetCrossRefGoogle Scholar
 Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: The MIT Press.zbMATHGoogle Scholar
 Kuypers, T. (2000). Information and efficiency: An empirical study of a fixed odds betting market. Applied Economics, 32, 1353–1363.CrossRefGoogle Scholar
 Lee, A. J. (1997). Modeling scores in the Premier League: Is Manchester United really the best? Chance, 10(1), 15–19.CrossRefGoogle Scholar
 Leitch, G., & Tanner, J. E. (1991). Economic forecast evaluation: Profits versus the conventional error measures. American Economic Association, 81(3), 580–590.Google Scholar
 Leitner, C., Zeileis, A., & Hornik, K. (2010). Forecasting sports tournaments by ratings of (prob)abilities: A comparison for the EURO 2008. International Journal of Forecasting, 26, 471–481.CrossRefGoogle Scholar
 Maher, M. J. (1982). Modelling association football scores. Statistica Neerlandica, 36(3), 109–111.CrossRefGoogle Scholar
 Miller, S. J. (2006). A derivation of the pythagorean wonloss formula in baseball. arXiv:math/0509698 [math.ST].
 O’Shaughnessy, D. (2006). Possession versus position: Strategic evaluation in AFL. Journal of Sports Science & Medicine, 5(4), 533–540.Google Scholar
 Oliver, D. (2004). Basketball on paper: Rules and tools for performance analysis. Washington, DC: Brassey’s Inc.Google Scholar
 Pearl, J. (1982). Reverend Bayes on inference engines: A distributed hierarchical approach. In AAAI  82 Proceedings (pp. 133–136).Google Scholar
 Pearl, J. (1985). A model of activated memory for evidential reasoning. In Proceedings of the cognitive science society (pp. 329–334).Google Scholar
 Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). Cambridge: Cambridge University Press.zbMATHCrossRefGoogle Scholar
 Pena, J. L. (2014). A Markovian model for association football possession and its outcomes. arXiv:1403.7993 [math.PR].
 Piette, J., Pham, L., & Anand, S. (2011). Evaluating basketball player performance via statistical network modeling. In MIT Sloan Sports Analytics Conference 2011, Boston, MA, USA.Google Scholar
 Pomeroy, K. (2017). 2018 Pomeroy College Basketball Ratings. Retrieved November 30, 2017 from https://kenpom.com/.
 Rotshtein, A., Posner, M., & Rakytyanska, A. (2005). Football predictions based on a fuzzy model with genetic and neural tuning. Cybernetics and Systems Analysis, 41(4), 619–630.MathSciNetzbMATHCrossRefGoogle Scholar
 Rue, H., & Salvesen, O. (2010). Prediction and retrospective analysis of soccer matches in a league. Journal of the Royal Statistical Society: Series D (The Statistician), 49(3), 399–418.Google Scholar
 Schatz, A. (2006). Pro football prospectus 2006: Statistics, analysis, and insight for the information age. New York: Workman Publishing Company.Google Scholar
 Szczepanski, L., & McHale, I. (2015). Beyond completion rate: Evaluating the passing ability of footballers. Journal of the Royal Statistical Society: Series A (Statistics in Society), 179(2), 513–533.MathSciNetCrossRefGoogle Scholar
 Tsakonas, A., Dounias, G., Shtovba, S. & Vivdyuk, V. (2002). Soft computingbased result prediction of football games. In The first international conference on inductive modelling (ICIM2002), Lviv, Ukraine.Google Scholar