What are Expected Goals?
Ever watched a match and gone, “Oh he should have scored from there!”? Expected Goals is a stat that basically tells you how likely a player will be able to score from a given chance (you can also call it chance quality).
A player can get many chances in a game or series of games. When you add up the xG for each chance that a player received, you get the total number of goals he should have scored.
The beauty of xG is how contextual it is- A player’s xG is a result of the specific chances that came that specific player’s way. You can use this in many ways like comparing a player’s xG to his goals scored to see how well he’s performing. Or you can use xG to identify trends in a team’s play or form.
Why do we need Expected Goals?
Football is a sport with a high degree of variance. Unlike sports like Basketball, where the average team takes 85 shots a game and converts most of them, in football you’d be lucky to get 10 chances and a goal from each team in a game. In Indian Football, this is a bigger problem because we don’t play enough games, let alone score goals.
How does one calculate Expected Goals?
One of the first things one needs to understand is that not all shots are equal. A shot from the half-line is obviously less likely to convert into a goal in comparison to a shot from inside the six yard box.
To calculate xG, a set of features that describe each shot are captured and then passed through a Logistic Regression Model after excluding penalties. In the new model, a default of 0.75 xG is assigned for all penalties. Below is the equation for all other types of chances.
Expected Goals = – 0.349971
+ 0.006271 * Minute
+ 0.096542 * Game_state
+ 1.341620 * if the shot was in “prime”
+ 0.340696 * if the shot was in “inside box – right”
– 0.834023 * if the shot was in “outside box – left wing”
– 1.186690 * if the shot was in “outside box – right wing”
+ 0.049426 * if the shot was in “inside box – central”
– 1.366107 * if the shot was in “outside box – central”
+ 0.208562 * Headed Chance
+ 0.458224 * Through Ball
– 0.013551 * Cross
+ 0.424485 * Counter-Attack
– 1.463867 * Open Play Chance
– 2.054052 * Defensive Pressure
– 1.364752 * Free-kick Chance
– 1.653561 * Corner Chance
– 0.242841 * if the assist was in “inside box – left”
+ 1.002087 * if the assist was in “prime”
+ 0.104116 * if the assist was in “inside box – right”
– 1.180954 * if the assist was in “outside box – left wing”
– 0.683533 * if the assist was in “outside box – right wing”
– 0.768507 * if the assist was in “inside box – central”
– 0.138973 * if the assist was in “outside box – central”
– 0.203478 * if the assist was in “Defensive Half”
OK, but does it make footballing sense?
To demonstrate the underlying football logic and that the model isn’t throwing up random numbers, we generated some graphs to display its accuracy.
It’s widely accepted that goals in football follow a Poisson distribution. So, it’s not unreasonable to expect Expected Goals to also follow the same pattern.
As shown in the below graph, the output of the model closely resembles a Poisson distribution.
One of the fundamental rules of Expected Goals is that closer you get to the goal, the easier it should be to convert the chance. Once again, the model seems to agree with the principle.
And finally, the state of the game should also affect the ability to convert a particular chance. In the below graph, it’s clear to see that the more comfortable a team is, the more likely they are to score goals.
Accuracy and Testing
Garry Gelade published this piece where he compares famous models that are publicly quoted on the internet. He posts some accuracy and error measure numbers that are reproduced here, but we’ve also added this model to the mix. While this model isn’t even close to being the best, it’s definitely in the Poisson.
The best way to measure the accuracy of a logistic regression model is to plot an ROC curve and then measure the Area Under the Curve. The closer the curve gets to the top left corner and away from the diagonal line, the better the accuracy of predictions.
Below is the ROC curve. The diagonal line represents random guessing while the curve represents the model. As you can see, the model functions pretty well as a classifier.
The obvious weakness in the model is that zones are used as a categorical input rather than quantitative. Another key weakness is the sample size being low. In each Indian league we get 90 games and approximately 2000+ shots per season, which is nowhere near enough.