<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting the Potential of Professional Soccer Players</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruben Vroonen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tom Decroos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Van Haaren</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesse Davis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven, Department of Computer Science</institution>
          ,
          <addr-line>Celestijnenlaan 200A, 3001 Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SciSports</institution>
          ,
          <addr-line>Hengelosestraat 500, 7251 AN Enschede</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Projecting how a player's skill level will evolve in the future is a crucial problem faced by sports teams. Traditionally, player projections have been evaluated by human scouts, who are subjective and may su er from biases. More recently, there has been interest in automated projection systems such as the PECOTA system for baseball and the CARMELO system for basketball. In this paper, we present a projection system for soccer players called APROPOS which is inspired by the CARMELO and PECOTA systems. APROPOS predicts the potential of a soccer player by searching a historical database to identify similar players of the same age. It then bases its prediction for the target player's progression on how the similar previous players actually evolved. We evaluate APROPOS on players from the ve biggest European soccer leagues and show that it clearly outperforms a more naive baseline.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>With more than 250 million players, soccer is the most popular sport in the world.
Due to technological advances, new soccer data sources such as event streams
and optical tracking from matches are rapidly becoming available. This has lead
to an explosion of interest in the area of soccer analytics. Most research tends to
focus on analyzing soccer gameplay (e.g., [11, 7, 8, 9, 10]). This has ranged from
formation identi cation [7], to evaluating the quality of shots [9, 10] to detecting
commonly employed o ensive strategies [8].</p>
      <p>Another relevant problem in soccer analytics is projecting how a player's
skill level will change over time. This is particularly important for clubs, as
it can in uence a club's player acquisition and retention strategies. In other
sports, projection systems have been developed that predict a player's future
performance. Two well-known examples of such systems are PECOTA [5] for
MLB baseball and CARMELO [2] for NBA basketball. In a similar spirit, this
paper proposes APROPOS, a system that can predict the future potential of
professional soccer players. Like past approaches, we project a target player's
potential by searching a historical database to identify other players with a
similar pro le to the target player when they were the target player's age. Then,
the target player's evolution is predicted based on the observed evolutions of the
identi ed similar players. However, one challenge in soccer is the relative paucity
of events, particularly those that can be related to a match outcome. Thus, we use
a set of expert ratings for a number of skills that are available on the SoFIFA.com
website to compare the similarity between two players. This contrasts with past
systems (e.g., PECOTA and CARMELO) that measure similarity based on past
statistics and personal descriptive characteristics such as height and weight.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Multiple projection systems exist that try to predict a player's future level of
performance. We focus on two speci c systems: PECOTA and CARMELO. Both
follow the same high-level outline that our system uses. First, given a target
player, they compare the target player's pro le to previous players' pro les when
they were at the same stage of development as the target player. Second, they
project the target player's future performance based on how the similar previous
players evolved.
2.1</p>
      <sec id="sec-2-1">
        <title>PECOTA</title>
        <p>The PECOTA system (Player Empirical Comparison and Optimization Test
Algorithm), named after former professional baseball player Bill Pecota, is a
projection system used within Major League Baseball (MLB). It predicts the
career path of a baseball player by tting previous statistics with similar players
using Bill James's similarity scores [6]. Originally developed by Nate Silver in
2002, it is currently managed by Baseball Prospectus [1]. Each year, they release
the seasonal predictions for every MLB player.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>CARMELO</title>
        <p>The CARMELO system is a simpli ed version of the PECOTA projection model
and is adapted to NBA (National Basketball Association) players. The system,
named after basketball player Carmelo Anthony, gathers player statistics,
characteristics and vital attributes. Every player starts with a similarity score of 100
and points are subtracted for each di erence in 19 weighted statistics. The nal
prediction of the level of a player is made by taking the weighted average of the
Wins Above Replacement (WAR) of the players with a score above 0, where
the similarity scores are used as weights. The system is maintained by the site
fivethirtyeight.com [3].
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>To develop the projection system, we need data to measure the level of
players. For this, we used the expert ratings on the SoFIFA.com website [4], which
provides the player ratings that are included in the realistic FIFA video games
published by EA Sports.3 Each player is rated on 24 di erent skills and each skill
is rated on a 0 to 100 scale. On the SoFIFA.com website, each player has a card
which displays his ratings. Figure 1 shows Lionel Messi's card. The SoFIFA.com
website has been publishing FIFA player ratings since 2007. Initially, the player
cards were updated semi-annually. Since 2014, the ratings are released weekly.</p>
      <p>We use data for players from the English, French, German, Italian and
Spanish competitions. These competitions are the most popular and have the most
accurate and complete information. Our database contains 57 860 player cards
for 10 247 players. On average, there are 5.65 years of data for each player.
4</p>
    </sec>
    <sec id="sec-4">
      <title>The APROPOS projection system</title>
      <p>The algorithm we designed is called APROPOS (Algorithm for PRediction Of
the Potential Of Soccer players). Like the PECOTA and CARMELO projection
systems, it uses a nearest neighbors approach to predict how a soccer player's
skill will evolve over time. Formally, the task can be de ned as follows:
Given: A player p, a set of skill ratings Vpa1 for p at his current age a1, and a
future age a2;
3 https://www.easports.com/fifa
Predict: Vpa2 which is p's set of skill ratings at age a2.</p>
      <p>
        To tackle this problem, the APROPOS projection system also requires as
input a similarity metric sim, a similarity threshold t, and a database of players
D. Then given a player p and a future age a2, it works as follows.
1. Add all players p0 in D to set S if (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the data for the age a2 season of p0 is
available in D, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) when p0's age is a1, sim(p; p0) t. If S contains less
than ten players, then S consists of the ten most similar players.
2. Predict Vpa2 by combining the ratings of all players in S.
      </p>
      <p>Next, we describe in detail how we perform each step.
4.1</p>
      <sec id="sec-4-1">
        <title>Similarity scores</title>
        <p>We have developed two di erent scores to measure the similarity between two
players. Both scores compare the similarity for every single skill rating reported
on SoFIFA.com and combine them into a nal similarity score. The nal score
is a real number between 0 and 1, where 0 means not similar at all and 1 means
completely similar (i.e., the two players are identical in all their skill ratings).
Absolute similarity score The absolute similarity score rst calculates a
similarity score for each skill Vr as
absoluteVr (p; p0; a1; y) = 1
qPa1
a=a1 y+1(vr;pa</p>
        <p>
          vr;p0a )2
qPa1
a=a1 y+1 max(vr;pa ; 100
vr;pa )2
;
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where vr;pa represents player p's observed rating for skill Vr at age a, and y
represents over how many years of data the similarity metric should consider. The
denominator normalizes the score relative to the maximum Euclidean distance
possible for player p to re ect the percentage of similarity.
        </p>
        <p>The nal absolute similarity score is computed as the average over all skills:
simabs(p; p0; a1; y) =</p>
        <p>
          PVr2V absoluteVr (p; p0; a1; y)
jV j
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
Evolutionary similarity score The skill level of a player is also partially
dependent on his team and the competition in which he plays. Some competitions
and teams are stronger than others. This may introduce a bias in the ratings as a
player's skill may be over (under) estimated because his skill is rated relative to
his less (more) talented teammates or opponents. To attempt to control for this,
instead of comparing the absolute value of the skill rating, we look at changes
in a player's skill rating between two consecutive years. The evolution similarity
score rst calculates a similarity score for each skill Vr as:
v
u
evolutionVr (p; p0; a1; y) = ut
a1
        </p>
        <p>
          X
vr;p0a 1 ))2 (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where vr;pa represents player p's observed rating for skill Vr at age a, and y
represents how many years of past data the similarity metric should consider.
Because the metric considers the change in skill between two consecutive years,
the measure only looks at y 1 values when comparing each skill.
        </p>
        <p>The total evolutionary score is computed by summing over all skill values:
evotot(p; p0; a1; y) =</p>
        <p>X evolution scoreVr (p; p0; a1; y)
(4)</p>
        <p>Vr2V
Then, the nal score is computed by normalizing the total score relative to the
range of similarity scores for all players in set S:
simevo(p; p0; a1; y) = 1
maxp002S (evotot(p; p00; a1; y)) minp002S (evotot(p; p00; a1; y))
(5)
where S is the set of similar players for p. This normalization maps the least
similar player's score to 0 and the highest to 1. This similarity score no longer
re ects the percentage of similarity. Instead, it can only be used to rank players
according to their similarity. If player p has a higher evolutionary similarity to
player p0 than to player p00, we can conclude that player p is more similar to
player p0 than to player p00.</p>
        <p>evotot(p; p0; a1; y)
minp002S (evotot(p; p00; a1; y))
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Prediction methods</title>
        <p>We consider two ways to predict a player's future rating for a given skill: the
absolute prediction and the evolutionary prediction.</p>
        <p>Absolute prediction method The absolute prediction simply computes p's
expected rating for a skill at age a2 as a weighted average of the observed ratings
for each similar player found in the data. Speci cally, the predicted rating for
skill Vr for player p at age a2 is:
v^ra;bpsa2 =</p>
        <p>Pp02S sim(p; p0; a1; y)</p>
        <p>Pp02S sim(p; p0; a1; y)
vr;p0a2
where sim is the chosen similarity metric, S is the set of similar players, and y
represents how many years of data the similarity metric should consider.
Evolutionary prediction method Because players' skill levels can vary, an
alternative idea is to consider p's current skill level as a baseline and predict how
this will evolve over time. This can be done by adjusting p's current skill by a
weighted average of the observed di erence in rating for the skill at age a2 and
age a1 for each similar player found in the data. Thus the predicted rating for
skill Vr for a player p at age a2 is:
v^re;vpoa2 = vr;pa1 +</p>
        <p>Pp02S sim(p; p0; a1; y) (vr;p0a2</p>
        <p>Pp02S sim(p; p0; a1; y)
vr;p0a1 )
(6)
(7)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>We now evaluate the predictive accuracy of the APROPOS projection system.
Our goal is to evaluate the following four questions:
Q1 How well can APROPOS predict ratings one year in the future?
Q2 How does APROPOS' predictive accuracy vary with how far in the future
it projects ratings for?
Q3 What is the e ect of the number of years of data used to compute the
similarity between two players on APROPOS' predictive performance?
Q4 How does the threshold used to identify similar players e ect APROPOS'
predictive performance?
5.1</p>
      <sec id="sec-5-1">
        <title>Experimental setup</title>
        <p>We compare four di erent systems:
Baseline: Given a prediction age a2, the baseline nds all players of the same
age and for each skill simply predicts the average rating over all players.
ABS-ABS This uses our absolute similarity metric and absolute prediction
mechanism.</p>
        <p>ABS-EVO This uses our absolute similarity metric and evolutionary prediction
mechanism.</p>
        <p>EVO-EVO This uses our evolutionary similarity metric and evolutionary
prediction mechanism.</p>
        <p>To run the experiments, we predict the potential ratings of 1000 players in the
English and German competitions. For each player, we use data from 2012 and
earlier to compute similarities and make predictions for year 2013 and onwards.
We select this cuto as it yields ve years of data in both train and test sets,
which allows us to vary how much data is used to identify similar players while
also making predictions upto ve years in the future. As an error metric, we
report mean absolute error (MAE) which is an average over all players and all
skills. Recall that each skill is scored from 0 to 100.</p>
        <p>To evaluate the rst question, we predict the ratings for 2013. We use three
years to compute player similarities and the threshold for selecting the best
players is set to 0.9. For the second question, we use an identical setup except
we predict the results for each year in the period from 2013 to 2017 inclusive.
For the third question, we predict the 2013 rating using a similarity threshold of
0.9 and vary the number of years used to compute player similarities from one
to ve. Finally, for the fourth question, we predict the 2013 rating using three
years to compute player similarities and vary the threshold used to identify
similar players from 0.7 to 0.9 in increments of 0.05.
We presented a rst approach for predicting the potential of professional soccer
players. We developed and evaluated the APROPOS projection system which
makes predictions for the potential using a k-nearest neighbours approach. We
introduced multiple metrics to measure the similarities between players and
multiple methods to predict player potentials leveraging the resulting similarities.
Our best models predict the player potentials su ciently accurate. The most
in uential parameter is the choice of the predictive method. The best model has
a maximum mean absolute error of only 2.15 on 100.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Acknowledgements</title>
        <p>Tom Decroos is supported by the KU Leuven Research Fund (C22/15/015) and
FWO-Vlaanderen (G.0356.12). Jan Van Haaren was supported by the Agency
for Innovation by Science and Technology in Flanders (IWT). Jesse Davis is
partially supported by the KU Leuven Research Fund (C22/15/015) and
FWOVlaanderen (G.0356.12, SBO-150033).
[4] SoFIFA. URL: https://sofifa.com, last checked on 2017-6-5
[5] Wikipedia: PECOTA. URL: https://en.wikipedia.org/wiki/PECOTA, last
checked on 2017-6-5
[6] Wikipedia: Similarity score. URL: https://en.wikipedia.org/wiki/</p>
        <p>Similarity\_score, last checked on 2017-6-5
[7] Alina Bialkowski, Patrick Lucey, Peter Carr, Yisong Yue, Sridha Sridharan en
Iain Matthews: Large-Scale Analysis of Soccer Matches Using Spatiotemporal
Tracking Data. Proceedings of the IEEE International International Conference
on Data Mining (ICDM) (2014)
[8] Jan Van Haaren, Vladimir Dzyuba, Siebe Hannosset, and Jesse Davis:
Automatically Discovering O ensive Patterns in Soccer Match Data. Advances in Intelligent
Data Analysis XIV, pp.286-297 (2015)
[9] Martin Eastwood: Expected Goals and Support Vector Machines. URL: http:
//pena.lt/y/2015/07/13/expected-goals-svm/, last checked on 2017-6-5
[10] Patrick Lucey, Alina Bialkowski, Mathew Monfort, Peter Carr en Iain Matthews:
Quality vs Quantity: Improved Shot Prediction in Soccer Using Strategic Features
from Spatiotemporal Data. MIT Sloan Sports Analytics Conference (2015)
[11] Tom Decroos, Vladimir Dzyuba, Van Haaren en Jesse Davis: Predicting Soccer
Highlights from Spatio-temporal Match Event Streams. Proceedings of the AAAI
Conference on Arti cial Intelligence (2017)</p>
        <p>Fig. 4. The MAE for predicting skill ratings for 2013 using 0.9 as threshold for
identifying similar players and varying the number of years of used to compute similarity
scores from one to ve.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Basketball</given-names>
            <surname>Prospectus</surname>
          </string-name>
          . URL: http://www.basketballprospectus.com,
          <source>last checked on 2017-6-5</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>CARMELO</given-names>
            <surname>Projection</surname>
          </string-name>
          <article-title>System</article-title>
          . URL: https://fivethirtyeight.com/ features/how
          <article-title>-were-predicting-nba-player-career/</article-title>
          ,
          <source>last checked on 2017-6-5</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>FiveThirtyEight</surname>
          </string-name>
          (
          <volume>538</volume>
          ). URL: https://fivethirtyeight.com,
          <source>last checked on 2017-6-5</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>