Is La Liga the most predictable European football league?

I’ve always been curious to know if any of the 4 major European leagues (Serie A, Bundesliga, Premiership, La Liga) are more predictable than others. La Liga certainly has a reputation as being dull and predictable, although this is due to the sheer dominance of Barcelona and Real Madrid in recent years. I’ve increased my database of football matches in order to improve my football prediction bot this summer, and so now have sufficient data to investigate. In this analysis my measure of predictability is whether the favourite according to the bookies won the match, so in essence I’m measuring the number of upsets. I have access to a data set comprising full match results from the last 11 seasons (20042005 through to 20152016) for the four major European leagues, along with the prior odds assigned to each of the three possible match outcomes. Note that the particular bookies I’m using is William Hill, and so in theory there could be an element of bias here as this is an English bookmakers, who presumably would have additional expertise in the English football leagues since this will be by far their biggest market. In the modern globalised world this could be completely wrong as they could contract out markets in different countries to local bookmakers, or form their odds by analysing the market - I have no idea, but just worth bearing in mind.

Initial look

I’ll be running my analysis in R, so I need to load some libraries as well as load my data (not shown).

library(dplyr)
library(ggplot2)
library(ggthemes)

Let’s have a quick look at the raw data (already collated in a variable called res) to check it looks sensible first:

knitr::kable(head(res))
idleagueseasonhomeoddsdrawoddsawayoddsoutcome
1premiership2012-20130.57803470.27777780.2083333h
2premiership2012-20130.80000000.18181820.0833333h
3premiership2012-20130.75187970.20000000.1111111a
4premiership2012-20130.18181820.23094690.6535948a
5premiership2012-20130.55555560.30769230.2000000d
6premiership2012-20130.55555560.27777780.2309469a
summary(res)
##        id                league           season        homeodds      
##  Min.   :    1   bundesliga1:3366   2005-2006:1446   Min.   :0.03846  
##  1st Qu.: 3882   laliga     :3800   2006-2007:1446   1st Qu.:0.37037  
##  Median : 8144   premiership:4180   2008-2009:1446   Median :0.47619  
##  Mean   : 8198   seriea     :4180   2009-2010:1446   Mean   :0.48377  
##  3rd Qu.:12405                      2010-2011:1446   3rd Qu.:0.59880  
##  Max.   :16286                      2011-2012:1446   Max.   :0.98039  
##                                     (Other)  :6850                    
##     drawodds          awayodds       outcome 
##  Min.   :0.05882   Min.   :0.01961   a:4358  
##  1st Qu.:0.27778   1st Qu.:0.20000   d:3955  
##  Median :0.30303   Median :0.29412   h:7213  
##  Mean   :0.28945   Mean   :0.31188           
##  3rd Qu.:0.32258   3rd Qu.:0.38462           
##  Max.   :0.98039   Max.   :0.92593           
## 

I now need to make a new column indicating whether the bookies got the prediction right or not, stored as an integer. This will be my dependent variable for much of the subsequent analysis, with the league factor being my principal independent variable of interest. I’m also curious to see how this predictability varies with time, and so I’ve included the season for each match under the season predictor.

res_clean <- res %>% 
                group_by(id) %>%
                mutate(predicted = c('h', 'd', 'a')[which.max(c(homeodds, drawodds, awayodds))],
                              correct = as.numeric(predicted==outcome))

Perfect, the data is in a format suitable for analysis.

knitr::kable(head(res_clean))
idleagueseasonhomeoddsdrawoddsawayoddsoutcomepredictedcorrect
1premiership2012-20130.57803470.27777780.2083333hh1
2premiership2012-20130.80000000.18181820.0833333hh1
3premiership2012-20130.75187970.20000000.1111111ah0
4premiership2012-20130.18181820.23094690.6535948aa1
5premiership2012-20130.55555560.30769230.2000000dh0
6premiership2012-20130.55555560.27777780.2309469ah0

Exploratory Analysis

Let’s do some quick summary statistics then. Firstly we can see the overall % of matches correctly predicted by the bookies:

mean(res_clean$correct)
## [1] 0.5314311

This is lower than I’d expected, considering that in the last two seasons Predictaball has got 48% and 43% I expected the bookies to be more accurate. It goes to show that football prediction is extremely challenging even for the professionals, although it is important to bear in mind that a bookies objective isn’t to maximise prediction accuracy, but rather profits by offering worse value odds (from a customer’s perspective).

We can break this down year on year, pooling together all the four leagues:

Across Europe there doesn’t appear to be any longitudinal shifts in bookie accuracy, I imagine that the league itself is a source of greater variation.

And what about accuracy by division?

res_clean %>%
        group_by(league) %>%
        summarise(accuracy = mean(correct) * 100) %>%
        arrange(-accuracy) %>%
        knitr::kable()
leagueaccuracy
laliga54.3
premiership53.9
seriea53.1
bundesliga151.0

Interestingly there doesn’t seem to be much difference between every league except the Bundesliga which is noticeably less accurate. People often deride La Liga for being too predictable, and while the main league outcomers tend to be one of 3 teams, the individual matches themselves look as tough to predict as those from any other league, with only a very slightly higher number of upsets (occasions where the favourite didn’t win).

Finally, we can look at accuracy by league and season, to see if there are any interesting patterns evident here which were masked when pooling the leagues together.

There are some interesting trends here, for instance the Premiership and Bundesliga follow a similar trend, with the Premiership having larger accuracies. The bookie’s accuracy in La Liga followed a very linear trend from 20052006 season culminating in the 20102011 season, before returning to similar levels as the other three leagues. Overall however, there does not appear to be a significant temporal aspect to the predictability of the leagues, with the league itself seemingly playing a more important role in the bookies’ accuracy than any time related factor.

Inference

From the earlier work we’re left with the impression of a subtle, but not large, effect of league on number of upsets (measured by whether the bookies’ favourite won), with seemingly little temporal element. Now let’s investigate these factors in more depth.

Chi-square

I’ll firstly run a $\chi^{2}$ test to see if there’s a statistically significant difference in the number of correct predictions by league, with the null hypothesis being that the proportion of upsets in each league is equal.

chisq.test(table(res_clean$correct, res_clean$league))
## 
##  Pearson's Chi-squared test
## 
## data:  table(res_clean$correct, res_clean$league)
## X-squared = 9.4037, df = 3, p-value = 0.02438

There does appear to be a statistically significant difference in proportion of upsets across the European leagues, even if it isn’t overly large in magnitude.

We can also investigate for a linear trend in seasons, using prop.trend.test to account for the fact that the independent variable is ordinal:

season_tabulated <- res_clean %>%
            group_by(season) %>%
            summarise(num_correct = sum(correct),
                      num = length(correct),
                      num_wrong = num - num_correct)
prop.trend.test(season_tabulated$num_correct, season_tabulated$num)
## 
##  Chi-squared Test for Trend in Proportions
## 
## data:  season_tabulated$num_correct out of season_tabulated$num ,
##  using scores: 1 2 3 4 5 6 7 8 9 10 11
## X-squared = 0.48574, df = 1, p-value = 0.4858

Here we don’t have sufficient evidence to reject the null hypothesis that there’s a linear trend which supports my initial conclusions from inspection of the time-series.

Logistic regression modelling

Another way of investigating these two factors is in a logistic regression model, since the outcome is the proportion of favourites which won (the complement of number of upsets) which can be modelled using a binomial distribution.

Firstly I’ll create the null model for comparison later on. Quite a high deviance, indicating the model isn’t fitting too well.

mod_null <- glm(as.factor(correct) ~ 1, data=res_clean, family='binomial')
summary(mod_null)
## 
## Call:
## glm(formula = as.factor(correct) ~ 1, family = "binomial", data = res_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.231  -1.231   1.124   1.124   1.124  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.12589    0.01608   7.828 4.97e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21462  on 15525  degrees of freedom
## Residual deviance: 21462  on 15525  degrees of freedom
## AIC: 21464
## 
## Number of Fisher Scoring iterations: 3

Adding the league predictor only reduces the deviance by 10 (ignore the AIC for now), with the coefficients for each league showing that each league has a non-zero impact upon the predictability. Since the baseline group here is the Bundesliga, the positive magnitude of the betas shows that all three of the remaining leagues are more predictable, with La Liga being the most.

mod_league <- glm(as.factor(correct) ~ league, data=res_clean, family='binomial')
summary(mod_league)
## 
## Call:
## glm(formula = as.factor(correct) ~ league, family = "binomial", 
##     data = res_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.251  -1.231   1.105   1.125   1.161  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.03803    0.03448   1.103  0.27000   
## leaguelaliga       0.13397    0.04743   2.825  0.00473 **
## leaguepremiership  0.11730    0.04638   2.529  0.01144 * 
## leagueseriea       0.08749    0.04636   1.887  0.05915 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21462  on 15525  degrees of freedom
## Residual deviance: 21453  on 15522  degrees of freedom
## AIC: 21461
## 
## Number of Fisher Scoring iterations: 3

A univariate model of the season predictor decreases the deviance further as expected due to having more dummy predictors (and therefore fewer degrees of freedom) than league. None of the individual seasons have statistically significant effects, aside from 2013-2014.

mod_season <- glm(as.factor(correct) ~ season, data=res_clean, family='binomial')
summary(mod_season)
## 
## Call:
## glm(formula = as.factor(correct) ~ season, family = "binomial", 
##     data = res_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.296  -1.214   1.063   1.121   1.166  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      7.195e-02  5.263e-02   1.367  0.17157   
## season2006-2007 -4.429e-02  7.441e-02  -0.595  0.55170   
## season2007-2008  7.842e-02  8.089e-02   0.969  0.33230   
## season2008-2009  1.111e-01  7.456e-02   1.490  0.13611   
## season2009-2010  1.307e-01  7.460e-02   1.752  0.07982 . 
## season2010-2011  6.102e-02  7.449e-02   0.819  0.41265   
## season2011-2012 -1.385e-02  7.442e-02  -0.186  0.85241   
## season2012-2013  6.102e-02  7.449e-02   0.819  0.41265   
## season2013-2014  2.036e-01  7.476e-02   2.724  0.00645 **
## season2014-2015  1.385e-02  7.444e-02   0.186  0.85237   
## season2015-2016  7.767e-16  7.443e-02   0.000  1.00000   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21462  on 15525  degrees of freedom
## Residual deviance: 21443  on 15515  degrees of freedom
## AIC: 21465
## 
## Number of Fisher Scoring iterations: 3

The next step is to add both the league and season into a combined model, as shown below. This model again increases deviance as expected, but not by a large amount. Of the predictors, the dummy variables representing belonging to each league are statistically significant, while only one of the seasons is, reinforcing previous findings.

mod_seasonleague <- glm(as.factor(correct) ~ season + league, data=res_clean, family='binomial')
summary(mod_seasonleague)
## 
## Call:
## glm(formula = as.factor(correct) ~ season + league, family = "binomial", 
##     data = res_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.318  -1.226   1.064   1.130   1.204  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       -1.818e-02  6.098e-02  -0.298  0.76556   
## season2006-2007   -4.432e-02  7.443e-02  -0.595  0.55157   
## season2007-2008    9.555e-02  8.153e-02   1.172  0.24120   
## season2008-2009    1.112e-01  7.458e-02   1.491  0.13598   
## season2009-2010    1.308e-01  7.462e-02   1.752  0.07972 . 
## season2010-2011    6.106e-02  7.451e-02   0.819  0.41251   
## season2011-2012   -1.385e-02  7.444e-02  -0.186  0.85236   
## season2012-2013    6.106e-02  7.451e-02   0.819  0.41251   
## season2013-2014    2.038e-01  7.478e-02   2.725  0.00643 **
## season2014-2015    1.386e-02  7.446e-02   0.186  0.85232   
## season2015-2016   -3.274e-16  7.445e-02   0.000  1.00000   
## leaguelaliga       1.381e-01  4.782e-02   2.888  0.00388 **
## leaguepremiership  1.174e-01  4.641e-02   2.531  0.01139 * 
## leagueseriea       8.760e-02  4.639e-02   1.888  0.05899 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21462  on 15525  degrees of freedom
## Residual deviance: 21433  on 15512  degrees of freedom
## AIC: 21461
## 
## Number of Fisher Scoring iterations: 3

We can also compare the models by AIC, which again reaffirms the belief that the league itself plays a far greater role in the number of upsets than the season.

data.frame(model=c('null', 'league', 'season', 'season+league'), aic=c(mod_null$aic, mod_league$aic, mod_season$aic, mod_seasonleague$aic)) %>%
            arrange(aic) %>%
            knitr::kable()
modelaic
league21460.82
season+league21461.22
null21464.21
season21464.91

Overall conclusions

The findings support the hypothesis that La Liga contains the fewest upsets out of the four major European football leagues, although the actual magnitude of the difference isn’t that large. Perhaps more interesting, is why the Bundesliga contains noticeably fewer upsets than the other leagues. An longitudinal effect has also been investigated but can be safely ignored.

When I come to forming predictive models of football leagues outside of the Premiership I’ll want to bear these subtle differences in mind, most likely by using a multi-level model to account for the fundamental hierarchical nature. If I decide to go down the Bayesian route this will be a simple addition, whereby the league can be modelled as a random effect.

Related

comments powered by Disqus