<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph-based Approaches for Analyzing Team Interaction on the Example of Soccer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Brandt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Brefeld</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIPF</institution>
          ,
          <addr-line>Frankfurt am Main</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universitat Darmstadt</institution>
          ,
          <addr-line>Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a graph-based approach to analyzing player interaction in team sports. A simple pass-based representation is presented that is subsequently used together with the PageRank algorithm to identify the importance of the players. Aggregating player scores to team values allows for turning our approach into a predictor of the winning team. We report on empirical results on ve German Bundesliga seasons.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The remainder is structured as follows. Section 2 introduces graph-based
representations for soccer and metrics that are used in the remainder. Section 3
reports on our empirical results and Section 4 concludes.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Viewing Soccer as a Graph</title>
      <sec id="sec-2-1">
        <title>Representation</title>
        <p>We study three representations of passes that di er in the direction of edges.
Assume that the event on the pitch is a pass from player A to player B. The rst
representation is perhaps the most intuitive one and implements the direction of
the pass. Thus, if player A passes the ball to player B, the event is represented
by a directed edge pointing from A to B. The second representation focuses on
the receiving player and the direction of the edge is inverted, so that it points
now from B to A. Finally, the third representation measures that there has
been an interaction between A and B which is expressed by two directed edges,
one pointing from A to B and the other from B to A. We refer to the three
representations as pass, receive, and interaction, respectively. Figure 1 shows a
visualization.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Player Metrics</title>
        <p>We now introduce simple metrics to compute scores for players. To this end, we
compute pass chains, that is, we join all successive passes of a team in a single
graph. Figure 2 shows an exemplary pass chain that consists of ve passes, that
is, A ! B ! C ! D ! B ! E. For simplicity, we denote the number of passes
of the chain as the chain length; the example in Figure 2 possesses a chain length
of ve. In the remainder, let C denote the set of all pass chains of a team and
C(p) C the subset of chains that involve player p.</p>
        <p>Chain Scores: The chain length for a player p is given by the average chain
score of all chains c 2 C(p) he is involved in, that is,
cs(p) =</p>
        <p>X length(c):
1
jC(p)j c2C(p)
(1)
Note that the chain length of a player is oblivious of the actual number of times
he receives/completes a pass within a chain. Also note that the representation of
the passes is negligible since only the number of edges are counted, irrespectively
of their direction.</p>
        <p>
          PageRank: The PageRank algorithm has originally been devised to analyze
the link structure of the Web. Scores for web pages are computed by simulating
a random surfer who navigates through the directed graph. Let p be a node
(player) and Fp be the set of nodes that p points to and Bp be the set of nodes
that point to p. A simpli ed variant of the PageRank [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] computes scores R(p)
for p according to
        </p>
        <p>R(p) = c X R(q)</p>
        <p>:
q2Bp</p>
        <p>Nq
(2)
The quantity R(p) of a node corresponds to the sum of the ranks R(q) of all
nodes pointing to p, weighted by the amount of total links of q (Nq = jFpj). The
factor c is used for normalization. The computation of Equation (2) is repeated
until convergence.</p>
        <p>
          Note that some implementations of the PageRank algorithm also include
dampening factors in accordance with the metaphor of the behavior of a random
surfer [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. However, modeling random passes is certainly not in the scope of this
paper and the thus left out.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Team Metrics</title>
        <p>The previous metrics compute scores for every player. To obtain a team score,
the individual player scores need to be aggregated accordingly. For simplicity, we
deploy the average of all players of a team as well as the sum of the individual
scores.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Empirical Results</title>
      <p>Our empirical results are based on data from the German Bundesliga that has
kindly been provided by Opta.4 We use all matches from the seasons 2009/2010
to 2013/2014. Every match is given in form of a temporally annotated sequence
of events. We focus on completed passes in the following analysis.
4 http://optasports.com
0.40
In total, there are 1530 matches containing 255,231 pass chains ranging from
a length from 2 to 65. Figure 3 (top) displays the distribution of all passes.
From this distribution, the probability of whether the next pass is
successful/unsuccessful can be computed, respectively. Figure 3 (bottom) shows the
results for the latter. While the probability of an unsuccessful succeeding pass
is 40.79% after the initiating pass, the average probability is 29.17%.</p>
      <p>Figure 4 shows the top-ranked players according to chain lengths (left) and
PageRank (right). The latter scores are normalized values and computed using
the pass-based representation. The lists di er from each other. The chain length
favors players that take part in many chains. However, it is unclear if these
players play a major role in the game as they could just be passing the ball
to each other. PageRank, on the contrary, favors players that are important
for team interaction and playmaking. They perform many passes in the chains.
Analogous results for receive-based and interaction-based representations lead
to an clearer picture, with internationally well-known players as is shown in
Figure 5; the displayed rankings are very similar.
3.2</p>
      <sec id="sec-3-1">
        <title>Predicting the Winning Team</title>
        <p>We now turn our approach into a predictor of match outcomes. To this end, we
split the data into training (seasons 2009/10{2012/13) and test (season 2013/14)
sets. The outcomes are derived in terms of the following features: home team wins
(positive di erence), draw (no di erence), home team loses (negative di erence).</p>
        <sec id="sec-3-1-1">
          <title>S. Benyamina</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>P.-E. Hojbjerg</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>J. Mart nez Aguinaga</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>M. Spiranovic</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>P. Jensen</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>C. Buchtmann</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>X. Shaqiri</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>D. Sereinig</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>M. Titsch-Rivero</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>D. Thomalla</title>
        </sec>
        <sec id="sec-3-1-11">
          <title>Player</title>
        </sec>
        <sec id="sec-3-1-12">
          <title>P. Lahm</title>
        </sec>
        <sec id="sec-3-1-13">
          <title>B. Schweinsteiger</title>
        </sec>
        <sec id="sec-3-1-14">
          <title>H. Badstuber</title>
        </sec>
        <sec id="sec-3-1-15">
          <title>D. Bon m Costa Santos</title>
        </sec>
        <sec id="sec-3-1-16">
          <title>H. Westermann</title>
        </sec>
        <sec id="sec-3-1-17">
          <title>N. Subotic</title>
        </sec>
        <sec id="sec-3-1-18">
          <title>D. van Buyten</title>
        </sec>
        <sec id="sec-3-1-19">
          <title>T. Kroos</title>
        </sec>
        <sec id="sec-3-1-20">
          <title>M. Hummels</title>
          <p>J. Boateng</p>
          <p>To predict the outcome, classi ers are trained to predict one of these three
classes.</p>
          <p>Note that alternative straw men could be easily computed. For instance,
extracting the number of wins/draws/losses yields Table 1 for the training and
test data. The table not only con rms that there is a home advantage but the
numbers could be turned into a predictor that always picks the majority class,
in this case a win for the home team. The accuracy on the test data is 47.39%.
Another straight forward straw man is to choose the team that won more games
in the training data. Although, the ranking of the teams for training and test
set di ers slightly, an accuracy of 52.94% is obtained.</p>
          <p>
            We deploy C5.0 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and SVMs with RBF kernels [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] as the underlying
classiers. User parameters are optimized by a four-fold cross validation. Since SVMs
only support binary classi cation, the one-against-one strategy is used where
models are created for each pair of classes and the prediction is made using a
majority vote of these pairs. Both methods are trained on only two features,
the team scores of the home and the away team, respectively. To compute team
scores, scores of all players of each team are used, irrespectively of whether they
perform in the match or not. The score of a team is therefore computed by
summing/averaging the individual PageRank scores of all players, respectively. As
an additional baseline, we include the number of completed passes by every team
as an additional baseline (also referred to as baseline). This metric is used to
evaluate how the proposed graphs metrics compare to simple descriptive
statistics of the same attributes. As features for the learning algorithms, the number
of the passes, receives and their sum are used, respectively, for the home and
away team, so that again two-dimensional vectors are obtained.
          </p>
          <p>The results of the classi ers for the baseline, chain lengths, and PageRank
are depicted in Table 6. Chain scores perform worst for all representations and
classi ers and stay clearly below the second straw man. The best results are
51.31% by the C5.0 classi er and 51.96% by the SVM, both using average of
player chain lengths. In contrast to the chain length, the baseline surprisingly
shows that simple descriptive statistics can e ectively be exploited by the
learning algorithms. The best results are 53.27% for C5.0 and 55.56% for the SVM.
Both surpass the two straw men but like the chain length predictions, the
prediction of the classi ers does not contain draw games, even if they are trained
to predict three classes.</p>
          <p>Similarly, the results for PageRank provide higher accuracies than using the
chain lengths. Accuracies of 54.25% for C5.0 and 55.88% for the SVM are
observed. Note that identical values for the baseline and PageRank in Table 6 are
caused by normalization that leads to identical representations for
interactionbased representations.</p>
          <p>Although the accuracy of C5.0 is generally lower than that of the SVM,
the resulting decision trees reveal an interesting fact. By focusing on only the
PageRank, received-based, sum-aggregated scores, of the away team, an accuracy
of 53.59% is obtained. This may be an indicator for the home advantage as a
certain PageRank of the away team is seemingly required to beat the home team.</p>
        </sec>
        <sec id="sec-3-1-21">
          <title>P. Lahm</title>
        </sec>
        <sec id="sec-3-1-22">
          <title>B. Schweinsteiger</title>
        </sec>
        <sec id="sec-3-1-23">
          <title>H. Badstuber</title>
        </sec>
        <sec id="sec-3-1-24">
          <title>F. Ribery</title>
        </sec>
        <sec id="sec-3-1-25">
          <title>T. Muller</title>
        </sec>
        <sec id="sec-3-1-26">
          <title>A. Ottl</title>
        </sec>
        <sec id="sec-3-1-27">
          <title>D. van Buyten</title>
        </sec>
        <sec id="sec-3-1-28">
          <title>A. Tymoshchuk</title>
        </sec>
        <sec id="sec-3-1-29">
          <title>D. Alaba</title>
        </sec>
        <sec id="sec-3-1-30">
          <title>A. Robben</title>
        </sec>
        <sec id="sec-3-1-31">
          <title>P. Lahm</title>
        </sec>
        <sec id="sec-3-1-32">
          <title>B. Schweinsteiger</title>
        </sec>
        <sec id="sec-3-1-33">
          <title>H. Badstuber</title>
        </sec>
        <sec id="sec-3-1-34">
          <title>D. van Buyten</title>
        </sec>
        <sec id="sec-3-1-35">
          <title>F. Ribery</title>
        </sec>
        <sec id="sec-3-1-36">
          <title>A. Tymoshchuk</title>
        </sec>
        <sec id="sec-3-1-37">
          <title>T. Muller</title>
        </sec>
        <sec id="sec-3-1-38">
          <title>D. Alaba</title>
        </sec>
        <sec id="sec-3-1-39">
          <title>A. Robben</title>
        </sec>
        <sec id="sec-3-1-40">
          <title>A. Ottl</title>
          <p>where the maximal and minimal values for the sum-aggregated, receive-based
PageRank are 766 and 44, respectively; the average value is 255. Recall that in
the receive-based representation the player that passes gets the credit from the
player the ball has been passed to. In sum, the result underlines that successful
passing is very important as no other single feature in our study achieves an
accuracy in this range. However, the PageRank of players can also be utilized
for other purposes than predicting outcomes such as establishing a ranking of
players as shown in Figure 7.
As mentioned earlier, the feature representation of baseline and PageRank can
be identical for the interaction-based representation when normalized while the
pass- and receive-based representations are always di erent. This means that
both metrics can be combined, despite the fact that they have been generated
from the same data. Using an augmented three-dimensional feature
representation assembled by the number of successful passes of the home team
(interactionbased representation), the summed PageRanks of the home team (again using
the interaction-based representation) and the summed PageRanks for the away
team (receive-based representation), the predictive accuracy could be improved
to 57.19%.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We presented a graph-based approach to soccer by viewing passes as edges
between player nodes. Depending on the direction of the edges, we showed that
either the passing or the receiving player bene ts from the pass in the analysis.
Empirically, we compared several variants including the PageRank algorithm
with appropriate baselines on soccer data from ve Bundesliga seasons. Turning
the approach into a predictor of the winning team, we showed that the best
results are obtained with only three features and observed accuracies of more
than 57%.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>Comput. Netw. ISDN Syst</source>
          .
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ),
          <volume>107</volume>
          {
          <fpage>117</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <issue>2</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Trans. Intell. Syst. Technol</source>
          .
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>27</volume>
          :1{
          <fpage>27</fpage>
          :
          <fpage>27</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lazova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basnarkov</surname>
          </string-name>
          , L.:
          <article-title>PageRank approach to ranking national football teams</article-title>
          .
          <source>CoRR abs/1503</source>
          .01331 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winograd</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The pagerank citation ranking: Bringing order to the web</article-title>
          .
          <source>Tech. Rep. 1999-66</source>
          , Stanford InfoLab (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Pen~a,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Touchette</surname>
          </string-name>
          , H.:
          <article-title>A network theory analysis of football strategies</article-title>
          . In: C. Clanet (ed.),
          <source>Sports Physics: Proc. 2012 Euromech Physics of Sports Conference</source>
          . pp.
          <volume>517</volume>
          {
          <issue>528</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <source>Data Mining Tools See5 and C5</source>
          .
          <volume>0</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>