<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating the Maximal Speed of Soccer Players on Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laszlo Gyarmati</string-name>
          <email>lgyarmati@qf.org.qa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Hefeeda</string-name>
          <email>mhefeeda@qf.org.qa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Qatar Computing Research Institute</institution>
          ,
          <addr-line>HBKU</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Excellent physical performance of soccer players is inevitable for the success of a team. Despite of this, a large-scale, quantitative analysis of the maximal speed of the players is missing due to the sensitive nature of trajectory datasets. We propose a novel method to derive the in-game speed pro le of soccer players from event-based datasets, which are widely accessible. We show that eight games are enough to derive an accurate speed pro le. We also reveal team level discrepancies: to estimate the maximal speed of the players of some teams 50% more games may be necessary. The speed characteristics of the players provide valuable insights for domains such as player scouting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Quantitative performance analysis in sports has become mainstream in the last
decade. The focus of the analyses is shifting towards more sport-speci c metrics
due to novel technologies. These systems measure the movements of the players
and the events happening during trainings and games. This allows for a more
detailed evaluation of the professional athletes with implications on areas such
as opponent scouting, planning of training sessions, or player scouting.</p>
      <p>
        Previous works that analyze soccer-related logs focus on the game-related
performance of the players and teams. Vast majority of these methodologies
concentrate on descriptive statistics that capture some part of the strategy of
the players. For example, in case of soccer, the average number of shots, goals,
fouls, passes are derived both for the teams and the players [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Other works
identify and analyze the outcome of the strategies that teams apply [
        <xref ref-type="bibr" rid="ref10 ref4 ref6 ref8">10, 8, 6,
4</xref>
        ]. However, the physical performance of the players has not received detailed
attention from the research community.
      </p>
      <p>It is a challenging task to get access to metrics related to the physical
performance of soccer players. The teams consider such information highly con dential,
especially if it covers in-game performance. Despite the fact that numerous teams
deployed player tracking systems in their stadiums, datasets of this nature are
not available for the research or public domain. It is nearly impossible to have
quantitative information on the physical performance of all the teams in a
competition. Hence, most of the analysis and evaluation of the players' performance
do not contain too much information on the physical aspect of the game.</p>
      <p>We address this issue by proposing a methodology that is able to derive the
in-game speed pro le of soccer players, i.e., how much time a player needs to
cover a certain distance in the best case scenario. In other words, we determine
the relation between the maximal speed of a player for a given range. In
addition, we are able to do this on scale: our method is able to analyze the physical
performance of the players across multiple seasons and competitions without
any major investment. It is not required to have an expensive, dedicated player
tracking system deployed in the stadium. Instead, if the game is broadcasted,
our methodology can be used. As a consequence, our technique does not require
the consent of the involved teams yet it provides insights on the physical
performance of the players of both teams. Soccer data companies are covering 50+
leagues providing the potential to analyze the speed pro le of tens of thousands
of players. The main contribution of our work is threefold:
1. we propose a methodology to extract the maximal speed characteristics of
the players,
2. we determine the minimal number of games necessary to determine the
physical capabilities of a player,
3. and we show that the playing style of a team has a signi cant impact on the
accuracy of the speed estimation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In this section we introduce our methodology used to extract the movements of
the players and then to estimate their maximal speed. Our nal goal is to derive
a regression model between the distance of the movement and the minimal time
necessary for it. We use an event-based dataset throughout our analyses that we
describe next.</p>
      <p>
        Dataset. We use an event-based dataset generated by Opta [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] covering the
2012/13 season of La Liga (i.e., the rst division soccer league of Spain). The
dataset contains all the major events of a soccer game including passes, shots,
dribbles, tackles, etc.. For example, the dataset has more than 300,000 passes
and nearly 10,000 shots. The feature of the dataset we explore is that it contains
the time and the location of these events as well along with the identity of the
involved players. Hence, it is possible to derive a coarse grain time-series of the
movements of the players. We note that the precision of the time annotation is
one second. The procedure uses all the (x; y) positions a player has during a game
and creates a movement vector using a consecutive pair of (x; y) coordinates and
timestamps to create a movement vector. We illustrate the derived movements of
a player in Figure 1 given a single game. This is the rst step of our methodology:
extracting the movement vectors of the players. The event-based dataset we use
is sparse in terms of the position of the players, i.e., the physical location of a
player is only recorded when the player was involved in some ball-related event1.
As such, the elapsed time between two events of a player can be as low as couple
of seconds but it can reach several minutes too. This introduces signi cant noise
to the data that we have to handle in the regression model.
1 This is a consequence of the data acquisition process: the games are annotated based
on the television broadcast that focuses on the ball all the time.
      </p>
      <p>60
40
y
_
ttr
a
s
20
0
0 25 50 75 100
Fig. 1. Movement vectors of a playerstdarte_xrived from an event-based dataset. Not only
the location of the end points are present in the data but the speed of the movement
too.</p>
      <p>Handling passes. It is straightforward to determine the timestamp and the
position of the players in case of single-player events (i.e., all the events except
passes). In terms of passes, we have a complete datapoint for the initiator of the
pass (i.e., timestamp and location), however, at the receiving end, the dataset
does not contain a timestamp. To overcome this issues, and to increase the wealth
of the extracted time-series, we apply four methods to estimate the time when
a pass was received. The four options are:
{ 0. Neglect. The event of receiving a pass is neglected, i.e., we do not use this
partial information.
{ 1. Previous event. The timestamp of the previous event is used, i.e., the
initiation of the pass. This is a lower-bound estimation of the time of reception.
{ 2. Next event. The timestamp of the next event is applied. This timestamp
is an upper-bound on the reception of the pass.
{ 3. Regression. Two passes may follow each other immediately in soccer,
i.e., when a player receives a pass, handles the ball, and passes the ball
forward with a single touch. We can select these consecutive passes from
the dataset, i.e., in this case the (x; y) coordinates and the identity of the
player are the same (the receiver of the rst pass and the initiator of the
next one). Therefore, in case of the rst passes we know the timestamp of
both the initiation and the reception. Therefore, based on these accurate
ball movements, we build a linear regression model between the range and
the elapsed time of the passes. We apply a 10-fold cross validation of the
model; the accuracy score is 33.26%, while 73.2% of the times we are able to
estimate the time duration of the pass with an error of at most one second.
Using this regression model, we estimate the speed of the passes and as such
the time of the pass reception to increase the instances where the position
of the players are known.</p>
      <p>
        At the end of the data extraction step, for each game and each player we have a
list movements done during the game. Such a tuple contains the start and end
(x; y) coordinates of the player along with the appropriate timestamps.
Diverse eld sizes. An interesting property of the rules of soccer is that the
sizes of the eld are not xed, there is some room to design a soccer pitch even in
case of international matches. According to the rst law of the game, the length
of the pitch shall be between 100 and 110 meters, while the width between 64
and 75 meters [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There is an ongoing standardization e ort, most of the newly
constructed stadiums have a pitch with a size of 105x68m[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Spain is not an
exception to this extent, where the dimension of Elche's stadium is 108x70m
while the same is 100x65m in case of Rayo Vallecano [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The dataset we apply
uses relative coordinates, i.e., both sides of the pitch are measured between 0
and 100 unit. We transform these relative units into the metric system using the
sizes of the stadiums. At the end of this transformation, the end points of the
movement vectors are measured in meters.
      </p>
      <p>
        Filtering. Before building the regression model of the maximal speed, we apply
a data cleaning step. As we mentioned above, the derived movement dataset
contains a lot of noise. On the one hand, it is owed to the methodology we use
to derive the movement vectors, while on the other hand the time is annotated
in seconds. As a sanity check, we apply two lters to remove the obvious aws
from the dataset. We lter out all the movement vectors that span more than
20 seconds. Our choice of this constraint is based on the fact that professional
sprinters are able to run 100 meters in less than 10 seconds. Thus, it is
reasonable to assume that the maximal speed of soccer players is above 50% of the
sprinters. The second lter is based on the speed of the movement: we remove
those movements where the speed of the player is larger than 15m/s.
Quantile regression. We use the ltered movement vectors to build a
regression model that estimates the maximal speed of the players depending on the
distance they cover. Our goal is to determine the minimal time a player needs to
cover a certain distance. For this purpose, we apply the techniques of quantile
regression where the regression model estimates a speci c quantile of the dataset
(instead of the mean in case of the linear regression) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We show the speed of all
the movement vectors of a player throughout a whole season in Figure 2 along
with the 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5 quantile regression lines. We note that
the 0.5 quantile regression model equals the regular linear model. Due to the lack
of accessibility of ground truth, it is challenging to evaluate quantitatively which
quantile is the best estimator of the players' maximal speed. Based on extensive
qualitative analysis we decided to use the 0.05 quantile regression model for the
speed estimation (annotated by red solid line in the gure).
      </p>
      <p>We evaluate the accuracy of the derived regression models based on their
consistency, i.e., how stable the parameters of the regression model are. If the
parameters of the regression model|namely, the intercept and the slope|are
similar irrespective to which subset of the dataset we use, the model can be
considered sound.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>The evaluation of the proposed methodology is twofold. First we focus on the
overall performance of the speed estimators and then we analyze the scalability of
the methods. To investigate the accuracy of the regression models (i.e., the four
1.00
variants how we handle the passes), we derive the quantile regression model of all
the players in the dataset using all the movements the players had throughout the
season. In case of each player, we divide the movement vectors into two and then
compute the parameters of the quantile regression line. Afterwards, we determine
the standard deviation of the parameters in case of all the players separately. In
Figure 3 we show the cumulative distribution function of the parameters in case
of the four methods. In case of both parameters, the previous event (#1) provides
the best accuracy, i.e., it has the lowest deviation in the parameters given the
random subsets of the sample. Not only the precision of speed estimation is the
highest in case of the previous event method but it enables us to investigate
the maximal speed of more players compared to the neglect version (539 vs. 529
players). The next event method (#2) does not enhance the accuracy of the
speed estimation as the results reveal.</p>
      <p>We next focus on the following question: how many games do we need to
accurately estimate the maximal speed of the players? We answer this question
by analyzing the standard deviation of the parameters of the quantile regression
models given di erent subset of the games a player was involved. Speci cally,
0.0
●●● ●●
●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●● ●●●●●●●
0.3 ●●●● ●●●●
l.t.svsddeope00..12 ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●● ●●●●●● ●●●●●
0.0
we randomly select n = 1; 2; : : : games from the ones the player participated in
and derive the regression model; we repeat this ten times for each n and for each
player. We show the deviation of the parameters in Figure 4, where we focus on
the best estimator we have seen above. As the results reveal, the accuracy of the
previous event method is stable if we have data from at least 8 games. This is a
fascinating result that implies we are able to characterize the in-game maximal
speed of a player based on one quarter of a season|which has 38 games.</p>
      <p>We analyze the accuracy and the information need of the di erent methods in
Figure 5. Here we apply thresholds for the deviation of the parameters. For each
player we determine the minimal number of games that enable us to estimate the
maximal speed of the players with the given accuracy. Speci cally, the thresholds
are 0.25 and 0.025 in case of the intercept and the slope, respectively. In case
of 50 percent of the players, it is enough to have data for ve games to have an
accurate enough estimation of their maximal speed (in case of the previous event
method). There are large discrepancies among the methods, e.g., the neglect
and next event methods need twice as much games to provide accurate speed
estimation for 80 percent of the players compared to the previous event method.
Based on the results we can draw the following conclusion: one should use the
previous event or the regression methods.</p>
      <p>There are team speci c discrepancies in case of the information need of the
methods. Table 1 presents the mean number of games required to estimate the
speed of the players of a given team accurately. In general, we need the fewest
number of games in case of the players of FC Barcelona. This is inline with the
fact that FC Barcelona dominates the ball possession in its games and such its
players have numerous ball related events, and as such, movement vectors.
However, in case of the previous event method,we need only 2.6 games to estimate
the speed of the players of Celta de Vigo too. In some cases, the discrepancy
of the required number of games is signi cant, e.g., we need 50% more games
in case of Espanyol and Valencia using the neglect methodology compared to
FC Barcelona. These di erences have a crucial impact on one of the application
domain of the methodology: player scouting (i.e., one has to analyze more games
if the player is part of a speci c team).
● ●
0
● ● ●● ● ●●● ● ● ● ●●● ●●●●● ●● ● ●● ● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●●●●● ● ● ●● ● ● ●</p>
      <p>●
intercept
●
●
●
10
●</p>
      <p>The proposed methodology indeed can be used for player scouting. As
Figure 6 shows, the maximal in-game speed characteristics of the players are diverse,
hence, it provides an additional facet for performance evaluation. One can
identify suitable candidate to sign who has the physical capabilities necessary for the
playing style of a given team.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We proposed a new technique to estimate the maximal speed of soccer players.
Using event-based datasets of eight games we are able to accurately determine
the speed pro le of the players. The investigations revealed that teams require
diverse size of datasets for a precise speed estimations. As a future work, we
plan to analyze the discrepancies of the estimations across players and leagues.
Our method provides a new way to evaluate the performance of soccer players,</p>
      <p>Team
intercept
#0 #1 #2 #3
particularly, from a physical performance point of view. Such insights can be
used as competitive advantage for opponent and player scouting.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sally</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The Numbers Game: Why Everything You Know about Football is Wrong (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Duch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waitzman</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amaral</surname>
            ,
            <given-names>L.A.N.</given-names>
          </string-name>
          :
          <article-title>Quantifying the performance of individual players in a team activity</article-title>
          .
          <source>PloS one 5</source>
          (
          <issue>6</issue>
          ),
          <year>e10937</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. FIFA:
          <article-title>Laws of the Game (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gyarmati</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Searching for a unique style in soccer</article-title>
          .
          <source>In: Proc. 2014 KDD Workshop on Large-Scale Sports Analytics</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koenker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hallock</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Quantile regression: An introduction</article-title>
          .
          <source>Journal of Economic Perspectives</source>
          <volume>15</volume>
          (
          <issue>4</issue>
          ),
          <volume>43</volume>
          {
          <fpage>56</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lucey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matthews</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Assessing team strategy using spatiotemporal data</article-title>
          .
          <source>In: Proc. 19th ACM SIGKDD. ACM</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Marca:
          <article-title>Cual es el campo mas grande de la Liga? (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Narizuka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamazaki</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Statistical properties of positiondependent ball-passing networks in football games</article-title>
          .
          <source>arXiv:1311.0641</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. OptaPro: http://optasportspro.com (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Pen~a,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Touchette</surname>
          </string-name>
          , H.:
          <article-title>A network theory analysis of football strategies</article-title>
          .
          <source>arXiv preprint arXiv:1206.6904</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. UEFA: Guide to Quality Stadiums (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>