<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generalised linear model for football matches prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>KULeuven</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Departement of Computer Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Celestijnenlaan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leuven</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belgium</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper presents the method we used in the prediction challenge organised by the Sports Analytics Lab of the KU Leuven for the European football(soccer) championship. We built a generalised linear model to predict the score of a match. This score was modelled as the joint probability of a Poisson distribution, representing the total number of goals, and a binomial distribution, representing the goals of one team given that total number of goals. This model was trained on the matches of the past year using gradient descent to maximise the loglikelihood with l2 regularisation. Special care was taken to construct a model that is symmetrical and does not involve any home advantage, with the exception of the host team. The features considered were both team-based and player-based, using a randomised approach to select the players based on their past selections. A simulation of the tournament was then built on this match model to predict how far each team would go in the tournament.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Prediction of football (soccer) matches is becoming more and more popular. On
the occasion of European football championship, the Sports Analytic Lab of
the KU Leuven organised a prediction challenge which consisted of two parts.
The rst challenge was to predict the outcome of matches between the 24 teams
of the tournament. The prediction consisted in giving the probability for each
team to win, lose or draw against any other team. The second challenge was to
predict how far each team would go in the tournament. The prediction consisted
in giving the probability for each team to be eliminated in the group phase,
round of 16, quarter nal, semi nal, nal or to win the tournament.</p>
      <p>In order to simulate the tournament for this second part, a model that gives
the full score of a match, and not only the winner, was needed. Traditional
approaches for that task predict the goals scored by the two teams. We propose
to introduce an intermediate random variable representing the total number of
goals and then, given that number, predict how many were scored by each team.
As a team consists of a selection of players, features were built on characteristics
of players and not only on those of the team, using a selection process among
the players of a team.</p>
      <p>In the next section, the probabilistic model for predicting the match outcome
is explained, as well as how it was learned. The feature construction is then
explained in section 3. Finally, the performance of the model in the challenge is
reported in the evaluation section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Probabilistic model</title>
      <p>The goal of the rst challenge was to predict match outcomes as the probability
for one team of winning, loosing or drawing against any other team of the
tournament. To do so, we choose to model the probability distribution of the score
of a match.
2.1</p>
      <sec id="sec-2-1">
        <title>Model de nition</title>
        <p>
          An intuitive idea would be to represent the number of goals of each team by
two Poisson distributions. However, as we do not want to consider these two
variables as independent, the combination of the two Poisson distribution is
not trivial. What is more, we want our model to be symmetrical: the match
team A vs team B should produce the same result as team B vs team A. A
possible solution is to use a bivariate Poisson distribution [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Instead, we choose
to model the total number of goals in a match as a Poisson distribution. Then,
given that number of goals, the number of goals of each team can be modelled
as a binomial distribution. Each team is described by a number of features that
will be explained in the next section. Thus for a speci c match, we have a
feature vector X consisting of the features of both teams. Following the idea of
Generalised Linear Models, the parameters of the distributions are formulated
as the composition of a linear regression and an activation function. If we call gi
the number of goals scored by team i, the probability distribution of the score
of a match between team A and team B is then de ned as:
        </p>
        <p>P (gA; gBjX) = P (gjX)</p>
        <p>P (gAjg; X)
= Poisson((gj (X))</p>
        <p>Binomial((gAjn = g; p(X))
with</p>
        <p>g = gA + gB
(X) = exp(U T X + u0)
p(X) =</p>
        <p>1
1 + exp( (V T X + v0))</p>
        <p>The symmetry of the model for the matches A vs B or B vs A can then
be guaranteed by ensuring that AvsB = BvsA and pAvsB = 1 pBvsA. The
vectors U and V are the coe cients of the linear regression for parameters
and p, while u0 and v0 are the intercepts. The choice of the activation functions,
exponential and sigmoid, was guided by the desired value range of and p.</p>
        <p>To learn U , u0, V and v0, we built a training set based on the matches
of the past year, competitive and friendly, between the 24 teams participating
in the tournament. Unlike during this one, those matches were not played on
neutral ground but at home for one of the two teams. To remove this bias, we
actually produced 2 symmetrical examples from each past match. For example,
the match France-Belgium on June 7th 2015 that Belgium won 4-3 produces 2
training examples X, g, gA:
{ (features of France), (features of Belgium), 7, 4
{ (features of Belgium), (features of France), 7, 3
This training set being symmetrical, so was the learned model. In particular, v0,
which represents the prior of the home advantage, was always pretty close to 0.</p>
        <p>The training set consisted of M = 62 matches. We use a gradient descent
to maximise the loglikelihood with l2 regularisation to avoid over tting. The
function to maximise is the following:
2M
X log(P (gAk; gBkjXk))
k=1
(kU k22 + kV k22)
with</p>
        <p>the regularisation coe cient.</p>
        <p>The exception of France. As the tournament was held in France, the French
team was an exception to this problem of symmetry of the model. To take into
account the home advantage for France, a second model was trained, identical
to the rst one in its structure. However, the training set was only built on
examples putting the home country as the rst country. In the previously given
example, this implies that only the rst training example was used.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Simulations</title>
        <p>The learned model can then be used to run simulations of matches. Given two
teams, we rst build our feature vector and compute the parameters of the
distribution and p. We can then sample the total number of goals and given
that, the goals of each team.</p>
        <p>For the rst challenge, we needed to predict for each match the probability of
either one team winning, the other team winning, or having a draw between the
2 teams. To get these probabilities, we simply sampled ten thousand matches
and counted the outcomes.</p>
        <p>For the second challenge however, we needed to predict the probability for
each team to reach each phase of the tournament. To evaluate these, we built a
simulation of the whole tournament. As in the nal phase, a match cannot end
up in a draw, we adopted the following method for that phase:
1. Sample the score for regular time as before.
2. If it is a draw, sample again the total number of goal and divide it by 3, as
the extra time is three times shorter than the regular time. Then, given that
new number, sample the score as before using the binomial distribution.
3. If it is still a draw, randomly select a team as winner, as penalty shout-outs
are random enough.</p>
        <p>Using this method, we ran ten thousand simulations and counted the results.</p>
        <p>These simulations approximate the probabilities, but the lack of time
prevented the implementation of exact computations. This one is however not trivial
because of the way the player dependant feature are built, which will be explained
in the next section.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature construction</title>
      <p>We will now explain how we built the feature vector used in the model, designated
by X in the previous section. The features can be divided in two categories:
teambased and player-based. While team-based features are characteristics of a team,
player-based features are aggregates of characteristics of player of a team. Here
is a detailed list of the features.</p>
      <p>
        { Team-based:
farank = FIFA rank in June 2016[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
fatrend = (FIFA rank in June 2016) - (FIFA rank in January 2016)
uefarank = UEFA rank in February 20161
elorank = ELO rank in February 20161
{ Player-based:
barometer = 101 - (position2 in UEFA barometer[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on the 10th of June
2016)
goals = Number of goal in national team in the whole career1
value = Transfermarkt value1
Due to the lack of time, only the average was used as an aggregate on the
player-based features.
      </p>
      <p>In football as in other team sports, the players of one team vary over di erent
matches. This is something we wanted to include in the model, for instance to
take into account injured players that would not participate in the tournament.
Also, some players play more often than others. For past matches, we actually
know which players played and the average was weighted by their time on the
pitch. For simulation matches, we draw 11 players for each team from the 23
selected for the tournament. The drawing was done with a roulette strategy
with weights based on the past selections of players. To avoid selecting too much
players of the same post (like 2 goalkeepers), the 11 players were actually sampled
as 1 goalkeeper, 4 defenders, 4 mid elders and 2 attackers. This composition
could be improved by making it more team-speci c.</p>
      <p>A main drawback of the features is that these were xed, either in training or
simulations, apart from the player selection strategy. Thus, some features, like
1 As provided by the challenge organisers.
2 The position is considered equal to 101 if the player does not appear on the
barometer.
the FIFA ranking, were partially built on the results of the training matches,
which probably overestimated their in uence on the score. It would be easy to
adapt the features to correct this bias. For instance, instead of the farank in
June, use the farank before each match to build the training set. This was not
done again by a lack of time. For the tournament prediction, these features could
also be updated during the simulation. While this would be easy for team-based
features, player-based ones would be more tricky: the exact algorithm of the
UEFA barometer is not public, and the model does not predict which player
scored the goals of a team. Also, the tournament taking place in a relatively
short period of time, we consider updating these features during that period less
crucial than in the training examples.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        The model was evaluated in the context of the prediction challenge. For each
challenge, the participants were scored using the multi-class logarithmic loss [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
At the time the model was submitted, we used a random approach to select the
home team when building the training set instead of duplicating the example.
This however always resulted in an asymmetric dataset leading to an advantage
for the home or away team. As a result, the submitted prediction is biased. The
results we present here include the scores of the correct model as presented in
this paper.
      </p>
      <p>Figure 1 shows the results of the challenge of all participants (black lines) and
of the proposed model for di erent values of the regularisation parameter alpha.
The performance of 3 models is shown: one using only the team attributes,
a second one only the team attributes and a nal one using all of them. As
just explained, the submitted prediction (in purple) has a error bias. For that
submission, alpha was set to 10. We can see the regularisation improves the
result by reducing the over tting that probably occurs due to the small size of
the training set. It should also reduce the bias introduced by not updating the
features in the training set. The combination of both set of attributes is the best
for challenge two while using only the players attributes seems to perform better
for the rst challenge.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We presented a new model for predicting football matches. The new idea is
to rst predict the total number of goals and then given that number, which
goals were scored by which team. The model is built on both player-based and
team-based features. This model performed decently in the prediction challenge
organised by the Sports Analytics Lab of the KU Leuven.</p>
      <p>Several improvements could be brought to the model. Preserving the
structure of rst predicting the total number of goals and then how many are of each
team, other distribution might be more t than Poisson and Binomial. A
Poisson distribution represents the number of occurrences of independent events and
goals can hardly be called independent. Although detailed experiments were not
run, the variance of the Binomial distribution seemed slightly too high.</p>
      <p>Additional features, like assists or passes, would also help to improve the
model. Applying other aggregates than the average for player-based features
could also be interesting. The multiplicity of features would however require
some feature selection process.
(a) Challenge 1: match outcome
Fig. 1: Results of prediction challenges. The lower the logarithmic loss, the better
the prediction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>McCullagh</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Nelder</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          (
          <year>1989</year>
          ).
          <article-title>Generalized linear models</article-title>
          (Vol.
          <volume>37</volume>
          ). CRC press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. FIFA ranking. http://www.fifa.com/fifa-world-ranking/index.html.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>UEFA</given-names>
            <surname>Barometer</surname>
          </string-name>
          (
          <year>2016</year>
          ). http://www.uefa.com/uefaeuro/season=2016/players/ faq/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. Multi-class logarithmic loss</article-title>
          . https://www.kaggle.com/wiki/MultiClassLogLoss.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Karlis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ntzoufras</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Analysis of sports data by using bivariate Poisson models</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series D (The Statistician)</source>
          ,
          <volume>52</volume>
          (
          <issue>3</issue>
          ),
          <fpage>381</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>