Implementing an Elo rating system for European football

Last updated on Jan 8, 2018 13 min read

My football prediction has previously relied upon a Bayesian approach to quantify a team’s skill level, by modelling it as a random intercept in a hierarchical model of the outcome of a match. While this model performed very well (62% accuracy last season), I was never fully satisfied since this measure of skill is an average across the last ten seasons that I had data for, rather than being updated to reflect the time-varying nature of form. This meant that a team which had experienced a large skill change in recent years and was heading on the same trajectory was poorly predicted by the model, such as was the case with Red Bull Leipzig, who came second in the Bundesliga in their first season after promotion.

In an attempt to combat this I added extra predictors to each match that measured the form of each team by counting the number of wins and losses from their previous 5 games. However, this method doesn’t take into account the oppponents of those games or whether they had home advantage or not. Instead, what I really wanted was a longitudinal measure of skill, such as the Elo system. This post provides a brief introduction to the Elo scoring system before going into detail on my implementation for European club football.

Elo

The Elo rating system is a method of quantifying the skill of competitors in a head-to-head competition, which was initially designed for chess. It assumes that a team’s performance is a random variable centered on its rating, in the original implementation a normal distribution was described although the logistic distribution is also commonly used as it allows for more upsets due its fatter tails.

The system is started by determining what an average rating is, with 1500 typically being chosen. This is assigned to every newcomer and is updated after each match to reflect the team’s performance. The amount to update by takes into account the expected match outcome, i.e. if Leeds beat Man Utd they would receive far more Elo points than if they beat another team in the Championship, since beating Man Utd is far more of an upset. It is a zero-sum system: the amount of Elo points won by the winning side are equal to the amount of points lost by the losers. Overall, the mean Elo score of all teams will be equal to the starting value.

Mathematically, the system is characterised by 2 equations. The expected outcome $E$ is given by the following equation, where $dr = elo_{home}-elo_{away}$. A value of $E$ = 1 indicates a home win, 0.5 indicates a draw and 0 is an away win.

$$E = \frac{1}{1+10^\frac{-dr}{400}}$$

This derivation assumes a team’s performance is distributed according to a logistic distribution, so the difference in performance ($dr$) also shares the same distribution with $\mu=0$. A value of 400 for the scale has been previously identified to work well. Therefore, $dr \sim Logistic(0, 400)$, with the CDF giving the expected outcome as above.

The rating update is calculated as follows:

$$elo_{home}^{′}=elo_{home}+KG(O − E)$$

We’ve already seen $E$, and $O$ is the observed outcome where $O \in $ {0, 0.5, 1} for loss, draw, win respectively.

This leaves 2 parameters $K$ and $G$ that govern the system stability and incorporate margin of victory respectively. I’ll later discuss how to select values for these. There are two other considerations of a football prediction model: incorporating home advantage and how to handling promotion and relegation.

I’ve found several implementations of Elo to club sports and make reference to these throughout the rest of the document. I’d highly recommend reading them if you’re interested, the FiveThirtyEight articles are particularly well written.

World Football Ratings (Soccer national teams)
Club Elo (Soccer club teams)
FiveThirtyEight’s NBA
FiveThirtyEight’s NFL

Considerations

$K$

Let’s start with an easy one. $K$ is a scaling parameter that dictates how much impact recent results should have on Elo. For sports with a small number of games that correlate strongly with a team’s ability, such as the football World Cup, use a large value. For assessing the strength of a team over a long season, where each individual match has less bearing on a team’s ability then a smaller value is more appropriate. I’m going to use $K = 20$ as used by all the four implementations referencd above.

Home advantage

Home advantage is typically incorporated into an Elo scoring method by manually adding a constant value to the home team’s Elo, for example: $$dr = elo_{home} − elo_{away} + HA$$ where $HA$ is a constant. This value can be determined by looking at historical data to identify a value of $E$ assuming both teams are of equal skill, and solving for $HA$.

$$dr = -400\log_{10}(\frac{1}{E} - 1) = HA$$ when $elo_{home} = elo_{away}$.

Different rating systems use different values:

World Football and NBA use 100
FiveThirtyEight’s NFL uses 65
Club Elo uses a dynamic system that is updated each week to account for historical trends in home advantage. I don’t think it’s necessary to update it this frequently, but it is definitely worth recalculating it every season and keeping a separate value for each league.

I’m going to use the football results database I’ve built up to identify appropriate values of HA, exploring whether there’s significant inter-league variation and any temporal trends.

The plot below displays home advantage calculated every 4 weeks across the main 4 European leagues with an overlaid smooth. It highlights that while the magnitude of home advantage does fluctuate over time, it does not appear to change quickly enough to necessitate weekly updates as Club Elo do. It’s also interesting to view how home advantage differs across leagues. For example, from 2007 to 2013 the Bundesliga had substantially lower home advantage than the other leagues, with all leagues experiencing similar levels of home advantage from this point onwards.

Home advantage over time per league

The plot highlights that there are definitely inter-league differences in home advantage, but updating the home advantage constant more frequently than season. FiveThirtyEight investigated using dynamic home advantage but didn’t notice any significant improvement over a static constant. As a result, I’ll stratify my home advantage by league and update it each season.

Margin of victory

$G$ is a multiplier that accounts for the Margin of Victory (MOV), i.e. assigning more Elo points for a 3-0 win than 1-0. Two considerations of $G$ are that it doesn’t need to be linear, as the difference between a 5-0 win and a 6-0 win is not the same as the difference between a 1-0 and 2-0. Also note that while the MOV in a football (soccer) game is a continuous variable, in practice it can be considered discrete taking on values $[0, X]$, where $X$ is a large number that will almost never be met in normal circumstances, say, 15.

The World Football Elo system uses the following formula:

$G = 1 \ if \ MOV \in {0, 1}$

$G = 1.5 \ if \ MOV = 2$

$G = \frac{11 + MOV}{8} \ \ otherwise$

World football's G function

This allows for MOV to play a bigger role in small victories, such as by 1 or 2 goal leads, with the effect decreasing at larger MOVs. I may even be tempted to use a more rapid tailing off for higher victories, with no additional benefit gained from beyond say 7 goals, since in my opinion it is not the skill difference between two teams that determines whether the MOV is 5 or 7, but other factors, such as red cards.

However, FiveThirtyEight identified the problem of auto-correlation, i.e. that strong teams are more likely to win by a large margin and so inflate their rating. To combat this, a penalty term is added to $G$ that is a function of the Elo difference between the two teams. They use two different functional forms for $G$ for NBA and NFL respectively:

$G_{nba} = \frac{(MOV + 3)^{0.8}}{7.5 + 0.006dr}$

$G_{nfl} = ln(abs(MOV)+1) \frac{2.2}{2.2 + 0.001dr}$

I’ll firstly plot $G_{nba}$ for a range of $dr$. I’m not familiar with basketball at all, but according to this handy website, a MOV of 68 is the highest recorded. The plot shows a far more linear curve than I expected, indeed it is more linear than World Football’s formulation for G which applies the same multiplier for a 2-0 win as a 1-0. The effect of including $dr$ to counteract autocorrelation is also highlighted, so that highly skilled teams receive fewer Elo points for winning by a high margin than evenly matched teams.

G nba

The highest MOV in an NFL game is 58 (from here). $G_{nfl}$ has a far steeper shape than the NBA owing to its use of the log, perhaps reflecting that like soccer, MOV is less uniformly distributed.

G nfl

I’m going to take inspiration from these approaches to develop my own formula for $G$. My main requirement is to have a similar log shape to $G_{nfl}$, to better reward a win at small MOVs, while not over-rewarding blowouts, since I don’t believe a MOV of 7 goals means the winner is that much better than if they had won by 5. I also definitely want to reward a 2-0 win more than a 1-0 win, unlike the World Football implementation.

I will tune it so that as with the NBA and NFL, the value for $G$ at the highest recorded MOV is ~4 for two teams of equal skill. It’s hard to find a reliable historic record MOV for competitive club football that still has relevance for today’s game. I’ve found the maximum Premiership MOV to be 9-0 (Man Utd vs Ipswich), in La Liga it’s a MOV of 11 from a 12-1 win for Bilbao over Barcelona. However, there are from the 1930s when the game was played far differently. In my 11 season data set the highest MOV is 8-0 from Chelsea vs Aston Villa so I’ll aim to fix $G = 4$ at this point. I’ll also add in a penalty term similar to FiveThirtyEight to handle autocorrelation.

Also note that unlike American sports where there are no ties and thus $MOV \neq 0$, in football this is possible. This is why the World Football method specifies two functional forms of $G$, one for $MOV \leq 1$ and one for $MOV \gt 1$. I’ll take the same approach of splitting the function up to ensure $G = 1$ at $MOV \leq 1$ and then adding on a curve rather than the linear relationship in the original implementation.

After trial and error I was satisfied with the following expression.

$$\log_{2}(1.7MOV) \frac{2}{2 + 0.001dr}$$

The plot below highlights my method, which meets its requirements of having a log shape, better rewarding a MOV of 2 than 1, and having a value of $G \approx 4$ at the highest recorded MOV value ($G = 3.77$ for $MOV = 8$, which is close enough). I’ve used the same penalty as $G_{nfl}$ since I have a similar logarithmic curve.

My final Elo function

Handling promotion and relegation

The idea of promotion and relegation is one that isn’t typically present in American sports, where there are a set number of franchises in the league. The league gets expanded every now and then with new ones, but as far as I’m aware franchises never leave the league without a replacement. This constant expansion is different to football, where a league will have the same number of teams across seasons (with minor exceptions now and then) with the teams themselves changing in each off season. A model will need to be able to handle promotion and relegation, particularly since I won’t initially be tracking Elo for lower divisions. In this situation, the system needs to adapt to ensure that the average Elo of the Premiership is 1500, this can be achieved by assigning the Elo of the promoted teams to the average rating of the relegated teams. If I were to track ratings across the entire football league then 1500 would refer to the global average and so promoted and relegated teams would simply keep their current rating.

I’ll also employ the decay method of FiveThirtyEight, whereby at the end of a season each team’s rating is brought back towards the mean, although I’ll slightly reduce this effect (0.80 Elo retained vs 0.75) since I’d imagine teams in American sports are more likely to be pulled to the average thanks to the draft system and the salary cap:

$elo^{′} = 0.8elo + 0.2 \times 1500$

Final system

The final system then uses $K = 20$, the formulation of $G$ above, and will track each top-tier European league independently, so that relegated teams stop being monitored and promoted teams are assumed to not have a known rating. Once I’ve got everything running smoothly I’ll look into tracking as many complete league structures as possible, so that an Elo of 1500 refers to an average team across Europe, rather than just an average Serie A team for example. This will also facilitate easy predictions for the Champions League.

The plot below shows the change in rating of 4 Premiership teams over the last 2 seasons using both my Elo system and the one employed by World Football. The log shape of $G$ chosen for Predictaball results as expected results in an amplified signal, with losses and wins of more than 1 goal having a greater impact on the rating change. However, they otherwise are well correlated. The inter-season decay is also visible in the middle of the x-axis and appears to be well justified. For example, Leceister finished the 2015-2016 season at a peak rating (owing to having won the Premiership), but then immediately start off on a downwards trend the next season, continuing the regression to the mean that the inter-season decay forms. Liverpool seemed to have started off the 2016-2017 season at the same skill that they ended the previous season and so the weight decay is not strictly accurate here. However, in the absence of information suggesting whether the team will start the next season where they left off or not, the regression to the mean is appropriate.

Elo of 4 Premier League teams

Implementation in Predictaball

Now that I have a longitudinal rating system, rather than the previous static one, it can be incorporated into the overall predictive modelling system that I’ve described variously elsewhere. The idea is that the predictive model itself can be rather simple, since factors such as home advantage, margin of victory, and of course skill, are included in the Elo rating. I may still want to include other details, such as: an indication of recent form measured by the overall rating change over the last 5 games, if the manager has changed recently, or the availability of key players. However, for now I’ll just make a simple ordinal regression.

The second aspect of Predictaball is identifying value bets offered by the bookies. The previous method was profitable on the Premiership, but not on the other major European leagues. This season I’m going to improve on the betting system in 2 ways: firstly so that it outputs a fraction of the bankroll to predict rather than an absolute value, and secondly I’ll stratify it by league as there seems to be variation in how the odds are calculated.

Once I’ve finalised my match prediction and betting models I’ll write about them, and will also try to provide more progress updates throughout the season. My aim for this season is to make the system robust and reliable, so that I can expand to other leagues and sports with confidence.