<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons Learned</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jesse Davis</string-name>
          <email>jesse.davis@kuleuven.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lotte Bransen</string-name>
          <email>lotte.bransen@kuleuven.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurens Devos</string-name>
          <email>laurens.devos@kuleuven.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wannes Meert</string-name>
          <email>wannes.meert@kuleuven.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Robberechts</string-name>
          <email>pieter.robberechts@kuleuven.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Van Haaren</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maaike Van Roy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Club Brugge</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Leuven.AI, KU Leuven</institution>
          ,
          <addr-line>Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SciSports</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There has been an explosion of data collected about sports. Because such data is extremely rich and complicated, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Unfortunately, how to evaluate the use of machine learning in the context of sports remains extremely challenging. On the one hand, it is necessary to evaluate the developed indicators themselves, where one is confronted by a lack of labels and small sample sizes. On the other hand, it is necessary to evaluate the models themselves, which is complicated by the noisy and non-stationary nature of sports data. In this paper, we highlight the inherent evaluation challenges in sports and discuss a variety of approaches for evaluating both indicators and models. In particular, we highlight how reasoning techniques, such as verification can be used to aid in the evaluation of learned models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sports analytics</kwd>
        <kwd>challenges with evaluation</kwd>
        <kwd>indicator evaluation</kwd>
        <kwd>model evaluation</kwd>
        <kwd>model verification</kwd>
        <kwd>reliability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>At a high level, ML plays a role in team sports in three areas:</title>
      <sec id="sec-1-1">
        <title>1. Introduction</title>
        <p>Sports is becoming an increasingly data-driven field as
there are now large amounts of data about both the phys- Player recruitment. Ultimately, recruitment involves
ical states of athletes such as heart rate, GPS, and iner- (1) assessing a player’s skills and capabilities on a
technitial measurement units (e.g., Catapult Sports) as well as cal, tactical, physical and mental level and how they will
technical performances in matches such as play-by-play evolve, (2) projecting how the player will fit within the
(e.g., Stats Perform, StatsBomb) or optical tracking data team, and (3) forecasting how their financial valuation
(e.g., TRACAB, Second Spectrum, SkillCorner). The vol- will develop. (c.f., [1, 2, 3, 4])
ume, complexity and richness of these data sources have
made machine learning (ML) an increasingly important Match preparation. Preparing for a match requires
analysis tool. Consequently, ML is being used to inform performing an extensive analysis of the opposing team
decision-making in professional sports. On the one hand, to understand their tendencies and tactics. This is can be
it is used to extract actionable insights from the large viewed as a SWOT analysis, which particularly focuses
volumes of data related to player performance, tactical on the opportunities and threats. How can we punish the
approaches, and the physical status of players. On the opponent? How can the opponent punish us? These
findother hand, it is used to partially automate tasks such as ings are used by the coaching staf to prepare a game plan.
video analysis that are typically done manually. Typically, such reports are prepared by analysts who
spent many hours watching videos of upcoming
opponents. The analysts must annotate footage and recognize
reoccurring patterns, which is a very time-consuming
task. Learned models can automatically identify patterns
that are missed or not apparent to humans (e.g., subtle
patterns in big data) [5], automate tasks (e.g., tagging of
situations) [6, 7] that are done by human analysts, and
give insights into players’ skills.
Management of player’s health and fitness.
Building up and maintaining a player’s fitness level is crucial
for achieving good performances [8, 9]. However,
training and matches place athletes’ bodies under tremendous
stress. It is crucial to monitor fitness, have a sense of
© 2022 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)
how much an athlete can do or, most importantly, when
they need to rest and recover. Moreover, managing and
preventing injuries is crucial for a team’s stability and
continuity which is linked to success.
cuss some standard evaluation metrics, we will focus
on a more speculative use of reasoning techniques for
model evaluation. This paper focuses on the context of
professional soccer, where we have substantial
experience. However, we believe the lessons and insights are
applicable to other team sports, or other domains than
sports.</p>
        <p>One of the most common uses of ML for addressing the
aforementioned tasks is developing novel indicators for
quantifying performances. Typically, machine-learned
models are trained on large historical databases of past
matches. Afterwards, the indicator is derived from the 2. Common Sports Data and
model as the indicator cannot be used directly as a target Analytics Tasks
for training because it is not in the data. One prominent
example of such an indicator is expected goals (xG) [10], This section serves as a short, high-level primer on the
which is used in soccer and ice hockey to quantify the data collected from sports matches as well as typical
quality of the scoring opportunities that a team or player styles of performance indicators and tactical analyses.
created. The underlying model is a binary classifier that
predicts the outcome of a shot based on features such as
the distance and angle to the goal, the assist type and the 2.1. Data
passage of play.1 It is typically a more consistent measure While there are a variety of sources of data collected
of performance than actual goals, which are extremely about sports, we will discuss three broad categories:
physimportant in these sports but also very rare. Even shots ical data, play-by-play data and optical tracking data.
are relatively infrequent, and their conversion is subject During training and matches, athletes often wear a
to variance. The idea of xG is to separate the ability to GPS tracker with accelerometer technology (e.g., from
get into good scoring positions from the inherent ran- Catapult Sports). These systems measure various
physidomness (e.g., deflections) of converting them into goals. cal parameters such as distance covered, number of
high</p>
        <p>Typically, an indicator should satisfy several properties. speed sprints, and high-intensity accelerations. These
First, it should provide insights that are not currently parameters are often augmented with questionnaire
available. For example, xG should tell you something data [11] to obtain subjective measurements about the
beyond looking at goals scored. Second, the indicator dificulty of training such as the rating of perceived
exshould be based on domain knowledge and concepts from ertion (RPE) [12]. Such approaches are used to optimize
sports such that it is intuitive and easy for non ML experts an athlete’s fitness level and ensure their availability and
to understand. Finally, the domain experts need to trust ability to compete.
the indicator. This often boils down to being able to Play-by-play or event stream data tracks actions that
contextualize when the indicator is useful and ensuring occur with the ball. Each such action is annotated with
some level of robustness in its value (i.e., it should not information such as the type of the action, the start and
wildly fluctuate).</p>
        <p>These desiderata illustrate that a key challenge in
developing indicators is in how to evaluate them: none of
the desiderata naturally align with the standard
performance metrics used to evaluate learned models. This
does not imply that standard evaluation metrics are not
important. In particular, ensuring that probability
estimates are well-calibrated is crucial in many sports
analytics tasks. It is simply that one must both evaluate
the indicator itself and the models used to compute the
indicator’s value. The goal of this paper is three-fold.</p>
        <p>First, we will highlight some of the challenges that arise
when trying to evaluate work in the context of sports
data. Second, we will discuss the various ways that
indicator evaluation has been approached. Third, we will
overview how learned models that the indicators rely
upon have been evaluated. While we will briefly
dis1For an interactive discussion of xG, see:
//dtai.cs.kuleuven.be/sports/blog/illustrating-the-interplaybetween-features-and-models-in-xg</p>
        <p>https:
is the key diference among existing models [10, 15, 16].</p>
        <p>Such indicators exist for a variety of sports including
American football (e.g., expected completion percentage
for quarterbacks and expected yards after the catch
for receivers),2 basketball (e.g., expected field goal
percentage [17]), and ice hockey (expected goals [18]).</p>
        <p>All actions. Instead of building bespoke models for
each action, these indicators use the same framework to
aggregate a player’s contributions over a set of action
types. Regardless of sport, almost all approaches exploit
the fact that each action  changes the game state from
 to +1 (as illustrated in Figure 3). These approaches
value the contribution of an action  as:
(, ) =  (+1) −  (),
(1)
end locations of the action, the result of the action (i.e., where  (.) is the value of a game state and +1 is the
successful or not), the time at which the action was per- game state that results from executing action  in game
formed, the player who performed the action, and the state .
team of which the acting player is a part of. Figure 1
illustrates six actions that are part of the game between Brazil
and Belgium at the 2018 World Cup as they were recorded
in the event stream data format. This data is collected
for a variety of sports by vendors such as Stats Perform
who typically employ human annotators to collect the
data by watching broadcast video.</p>
        <p>Optical tracking data reports the locations of all the
players and the ball multiple times per second (typically
between 10 and 25 Hz). This data is collected using a fixed
installation in a team’s stadium using high-resolution
cameras. Such a setup is expensive and typically only
used in top leagues. There is now also extensive work Figure 3: Lukaku’s dribble () changes the game state from
on tracking solutions based on broadcast video [13, 14]. the pre-action state  to the post-action state +1.
Figure 2 shows a frame of tracking data.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approaches difer on how they value game states, with</title>
      <p>2.2. Individual Performance Indicators two dominant paradigms emerging: scoring-based and
Performance indicators for individual players usually win-based. Scoring-based approaches take a narrower
fall in one of two categories. The first type focuses possession-based view. These approaches value a game
on a single action such as a pass or shot. The second state by estimating the probability that the team
possesstype takes a holistic approach by developing a unifying ing the ball will score. In soccer, this may entail looking
framework that can value a wide range of action types. at the near-term probability of a goal in the next 10
actions or 10 seconds [2] or the long-term probability of</p>
      <p>Single action. Single action indicators typically take scoring the next goal [19]. Win-based approaches look at
the form of expected value-based statistics: they measure valuing actions by assessing a team’s chance of winning
the expected chance that a typical player would success- the match in each game state. That is, these approaches
fully execute the considered action in a specific game look at the diference in in-game win-probability between
context. For example, the aforementioned xG model in two consecutive game states [20, 21, 22, 23]. Such models
soccer assigns a probability to each shot that represents have been developed for many sports, including
basketits chance of directly resulting in a goal. These models ball [24], American football [25], ice hockey [26, 3] and
are learned using standard probabilistic classifiers such as rugby [27].
logistic regression or tree ensembles from large historical
datasets of shots. Each shot is described by the game
context from when it was taken, and how this is represented
2https://nextgenstats.nfl .com/glossary
2.3. Tactical Analyses
a match or training session, some measures are invasive
(e.g., blood lactate or creatine kinase). Similarly, in
endurance sports such as distance running and cycling,
monitoring athletes’ aerobic fitness levels is important,
which is often measured in terms of the maximal
oxygen uptake (VO2max) [36]. However, the test to measure
this variable is extremely strenuous and disrupts training
regimes, so it can only be measured sporadically.</p>
    </sec>
    <sec id="sec-3">
      <title>Tactics are short-term behaviors that are used to achieve</title>
      <p>a strategic objective such as winning or scoring a goal.</p>
      <p>At a high level, AI/ML is used for tactical analyses in two
ways: to discover patterns and to evaluate the eficacy of
a tactic.</p>
      <p>Discovering patterns is a broad task that may range
from simply trying to understand where on the field
certain players tend to operate and who tends to pass to Credit assignment. It is often unclear why an action
whom, to more complicated analyses that involve identi- succeeded or failed. For example, did a pass not reach a
fying sequences of reoccurring actions. Typically, tech- teammate because the passer mishit the ball or did their
niques such as clustering, non-negative matrix factoriza- teammate simply make the wrong run? Similarly, for
tion, and pattern mining are used to find such reoccurring those actions that are observed, we are unsure why they
behaviors [28, 29, 30]. arose. For example, does a player make a lot of tackles in</p>
      <p>Evaluating the eficacy of tactics is an equally broad a soccer match because they are covering for a teammate
task that can generally be split up into two parts: evalu- who is constantly out of position? Or is the player a weak
ating the eficacy of (1) a current and (2) a counterfactual defender that is being targeted by the opposing team?
tactic. Assessing the eficacy of currently employed
tactics is typically done by focusing on a specific tactic (e.g., Noisy features and/or labels. When monitoring the
counterattack, pressing) and relating it to other success health status of players, teams often partially rely on
indicators (e.g., goals, wins) [31, 32]. In contrast, assess- questionnaires [11] and subjective measures like the
rating the eficacy of counterfactual tactics is more challeng- ing of perceived exertion [12]. Players respond to such
ing as it entails understanding what would happen if a questionnaires in diferent ways, with some being more
team (or player) employed diferent tactics than those honest than others. There is a risk for deception (e.g.,
that were observed. This is extremely interesting and players want to play, and may downplay injuries). There
challenging from an AI/ML and evaluation perspective are also well-known challenges when working with
subas it involves both (1) accurately modeling the current jective data. Similarly, play-by-play data is often collected
behavior of teams, and (2) reasoning in a counterfactual by human annotators, who make mistakes. Moreover,
way about alternative behaviors. Such approaches have the definitions of events and actions can change over
been developed in basketball and soccer to assess coun- time.
terfactual shot [33, 34] and movement3 [35] tactics.</p>
    </sec>
    <sec id="sec-4">
      <title>Small sample sizes. There may only be limited data</title>
      <p>about teams and players. For example, a top flight soccer
3. Challenges with Evaluation team plays between 34 and 38 league games in a season
and will perform between 1500 and 3000 on-the-ball
acThe nature of sports data and the tasks typically con- tions in a game.5 Even top players do not appear every
sidered within sports analytics and science pose many game and sit out matches strategically for rest.
challenges from an evaluation and analysis perspective.</p>
      <p>Lack of ground truth. For many variables of interest,
there are simply very few or even no labels, which arises
when analyzing both match and physical data. When
analyzing matches, a team’s specific tactical plan is
unknown to outside observers. One can make educated
guesses on a high level, but often not for fine-grained
decisions. Similarly, when trying to assign ratings to
players’ actions in a soccer match, there is no variable
that directly records this. In fact, in this case, no such
objective rating even exists.</p>
    </sec>
    <sec id="sec-5">
      <title>Physical parameters can also be dificult to collect. For</title>
      <p>example, if one is interested in measuring fatigue4 during</p>
    </sec>
    <sec id="sec-6">
      <title>3https://grantland.com/features/the-toronto-raptors-sportvu</title>
      <p>cameras-nba-analytical-revolution/</p>
      <p>4Note that there are diferent types of fatigue that could be
monitored such as musculoskeletal or cardiovascular fatigue.
Non-stationary data. The sample size issues are
compounded by the fact that sports is a very non-stationary
setting, meaning data that is more than one or two
seasons old may not be relevant. On a team level, playing
styles tend to vary over time due to changes in playing
and management personnel. On a player level, skills
evolve over time, often improving until a player reaches
their peak, prior to an age-related decline. More
generally, tactics evolve and change.</p>
      <p>Counterfactuals. Many evaluation questions in sports
involve reasoning about outcomes that were not
observed. This is most notable in the case of defense, where
defensive tactics are often aimed at preventing dangerous</p>
    </sec>
    <sec id="sec-7">
      <title>5The number depends on what is annotated in the data (e.g.,</title>
      <p>pressure events) and modeling choices such as whether a pass
receival is treated as a separate action.</p>
      <p>Interventions. The data is observational and teams
constantly make decisions that afect what is observed. This
is particularly true for injury risk assessment and load
management, where the staf will alter players’ training
regime if they are worried about the risk of injury.
Managers also change tactics during the course of the game,
depending on the score and the team’s performance.
actions from arising such as wide-open three-point shots mance. For example, salary can be tied to draft position
in the NBA or one vs. the goalie in soccer. Unfortunately, and years of service. Similarly, a soccer player’s market
it is hard to know why certain actions were or were not value or transfer fee also encompasses their commercial
taken. For example, it is dificult to estimate whether appeal. Even playing time is not necessarily merit-based.
the goalie would have saved the shot if they had been Other work tries to associate performance and/or
prespositioned slightly diferently. Similarly, evaluating tac- ence in the game with winning. This is appealing as the
tics also involves counterfactual reasoning as a coach is ultimate goal is to win a game.6 For example, indicators
often interested in knowing what would have happened can be based on correlating how often certain actions
if another policy had been followed, such as shooting are performed with match outcomes, points scored, or
more or less often from outside the penalty box in soccer. score diferentials [ 37, 38]. An alternative approach is
to build a predictive model based on the indicators and
see if it can be used to predict the outcomes of future
matches [39].</p>
      <sec id="sec-7-1">
        <title>4.2. The Messi Test</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>When evaluating indicators about player performance,</title>
      <p>one advantage is that there is typically consensus on
who are among the very top players. While experts,
4. Evaluating an Indicator pundits, and fans may debate the relative merits of Lionel
Messi and Cristiano Ronaldo, there is little debate that
A novel indicator should capture something about a they fall in the very top set of ofensive players. An
player’s (or team’s) performance or capabilities. Eval- ofensive metric where neither of those players scores
uating a novel indicator’s usefulness is dificult as it is well, is not likely to convince practitioners. In other
unclear what it should be compared against. This prob- words, if a metric blatantly contradicts most conventional
lem is addressed in multiple diferent ways in the litera- wisdom, there is likely a problem with it. This is also
ture. called face validity [40]. Of course, some unexpected or
more surprising names could appear towards the top of
4.1. Correlation with Existing Success such a ranking, but one would be wary if all such names
were surprising.</p>
      <p>Indicators Unfortunately, this style of evaluation is most suited
to analyzing ofensive contributions. In general, their
is more consensus on the ofensive performances of
individual players than their defensive performances, as
good defense is a collective endeavor and more heavily
reliant on tactics.</p>
    </sec>
    <sec id="sec-9">
      <title>In all sports, a variety of indicators exist that denote whether a player (or team) is considered or perceived to be good. Such indicators can be on either the individual or team level.</title>
      <p>When evaluating individual players, there are a wealth
of existing indicators that are commonly reported and
used. First, there are indirect indicators such as a player’s
market value, salary, playing time, or draft position.
Second, there are indicators derived from competition
such as goals and assists in soccer (or ice hockey). It is
therefore possible to design an evaluation by looking at
the correlation between each indicator’s value for every
player [20, 26, 3]. Alternatively, it is possible to produce
two rank-ordered lists of players (or teams): one based
on an existing success indicator and another based on a
newly designed indicator. Then the correlation between
rankings can be computed.</p>
      <p>Arguably, an evaluation that strives for high
correlations with existing indicators misses the point: the goal
is to design indicators that provide insights that current
ones do not. If a new indicator simply yields the same
ranking as looking at goals, then it does not provide any
new information. Moreover, some existing success
indicators capture information that is not related to
perfor</p>
      <sec id="sec-9-1">
        <title>4.3. Make a Prediction</title>
        <p>While backtesting indicators (and models) is clearly a key
component of development, sports does ofer the
possibility for real-world predictions on unseen data. One can
predict, and most importantly publish, the outcomes of
matches or tournaments prior to their start. In fact, there
have been several competitions designed around this
principle [41] or people who have collected predictions
online.7</p>
        <p>This is even possible for player indicators, and is
often done in the work on quantifying player
performance [2, 3]. Decroos et al. [2] included lists of the top</p>
        <p>6This is not always the case: Sometimes teams play for draws,
rest players for strategic reasons, prioritize getting young players
experience or try to lose to improve draft position.</p>
        <p>7https://twitter.com/TonyElHabr/status/1414619621659971588
under-21 players in several of the major European
soccer leagues for the 2017/2018 season. It is interesting
to look back on the list, and see that there were both
hits and misses. For example, the list had some players
who were less heralded then such as Mason Mount and
Mikel Oyarzabal, who are now key players. Similarly, it
had several recognized prospects such as Kylian Mbappé,
Trent Alexander-Arnold, and Frenkie de Jong who have
ascended. Finally, there were misses like Jonjoe Kenny
and David Neres. While one has to wait, it does give an
immutable forecast that can be evaluated.</p>
        <p>Because they do not allow for immediate results, such Figure 4: Pearson correlation between player performance
evaluations tend to be done infrequently. However, we indicators for ten pairs of successive seasons in the English
believe this is something that should be done more of- Premier League (2009/10 – 2019/20). The diamond shape
inten. It avoids the possibility of cherry-picking results and dicates the mean correlation. The simple “minutes played”
overfitting by forcing one to commit to a result with an indicator is the least reliable, while the Atomic-VAEP 8
indicaunknown outcome. This may also encourage more criti- tor is more reliable than its VAEP [2] predecessor and xT [45].
cal thinking about the utility of the developed indicator. As shots are infrequent and have a variable outcome, omitting
The caveat is that the predictions must be revisited and them increases an indicator’s reliability. The xT indicator does
discussed in the future, which also implies that publica- (ntohteveaqluuievsahleontts.oOftnelny gpalamyeesr)sitnheaatcphlaoyfetdheatsulecacsets9si0v0e mseiansuotness
tion venues would be open to such submissions. Beyond are included.
the time delay, another drawback is that they involve
sample sizes such as one match day, one tournament, or
a short list of players.</p>
      </sec>
      <sec id="sec-9-2">
        <title>4.5. Reliability</title>
      </sec>
      <sec id="sec-9-3">
        <title>4.4. Ask an Expert</title>
        <p>Developed indicators and approaches can be validated
by comparing them to an external source provided by
domain experts. This goes beyond the Messi test as it
requires both deeper domain expertise and a more
extensive evaluation such as comparing tactical patterns
discovered by an automated system to those found by a
video analyst. Pappalardo et al. [38] compared a player
ranking system they developed to rankings produced by
scouts. Similarly, Dick et al. [42] asked soccer coaches
to rate how available players were to receive a pass in
game situations and compared this assessment to a novel
indicator they developed.</p>
        <p>Ideally, such an expert-based evaluation considers
aspects beyond model accuracy. Ultimately, an indicator
should provide “value” to the workflow of practitioners.
Hence, it is relevant to measure how much time it saves
an analyst in his workflow, whether an indicator can
provide relevant new insights and whether the expert
can correctly interpret the model’s output. This type of
evaluation checks whether indicators fulfill the needs
of users (i.e., usefulness and usability) and also arises in
human-computer interaction [43].</p>
        <p>However, this type of evaluation can be dificult as not
all researchers have access to domain experts, particularly
when it comes to high-level sports. Moreover, teams want
to maintain a competitive advantage, so one may not be
able to publish such an evaluation.</p>
        <p>Indicators are typically developed to measure a skill or
capability such as shooting ability in basketball or
ofensive contributions. While these skills can and do change
over a longer timeframe (multiple seasons), they
typically are consistent within a season or even across two
consecutive seasons. Therefore, an indicator should take
on similar values in such a time frame.</p>
        <p>One approach [39, 44] to measure an indicator’s
reliability is to split the data set into two, and then compute
the correlation between the indicators computed on each
dataset. An example of such an evaluation is shown in
Figure 4. Methodologically, one consideration is how to
partition the available data. Typically, one is concerned
with respecting chronological orderings in temporal data.
However, in this setting, such a division is likely
suboptimal. First, games missed by injury will be clustered
and players likely perform diferently right when they
come back. Second, the dificulty of a team’s schedule
is not uniformly spread over a season. Third, if the time
horizon is long enough, there will be aging efects.</p>
        <p>Franks et al. [46] propose a metric to capture an
indicator’s stability. It tries to assess how much an indicator’s
value depends on context (e.g., a team’s tactical system,
quality of teammates) and changes in skill (e.g.,
improvement through practice). It does so by looking at the
variance of indicators using a within-season bootstrap
procedure.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>8https://dtai.cs.kuleuven.be/sports/blog/introducing-atomic</title>
      <p>spadl-a-new-way-to-represent-event-stream-data/</p>
      <p>Another approach [29] is to look at consecutive
seasons and pose the evaluation as a nearest neighbors
problem. That is, based on the indicators computed from one
season of data for a specific player, find a rank-ordered
list of the most similar players in the subsequent (or
preceding) season. The robustness of the indicator is then
related to the position of the chosen player in the ranking.
5. Evaluating a Model
a causal counterfactual (because the considered
models are not causal models).
• Does the model behave as expected in scenarios
where we have strong intuitions based on domain
knowledge? For example, one can analyze what
values the model can predict for shots that are
taken from a very tight angle or very far away
from the goal. One can then check whether the
predictions for the generated game situations are
realistic.</p>
    </sec>
    <sec id="sec-11">
      <title>Evaluating the models used to produce the indicator in</title>
      <p>volves two key aspects. First, it is important to ensure Typical aggregated test metrics do not reveal the answers
that the model will behave as expected on unseen data. to these questions. Nevertheless, the answers can be very
This is particularly important for sports since the data can valuable because they provide insights into the model
have errors or noise (e.g., incorrect annotations, sensor and can reveal problems with the model or the data.
failures, errors in tracking data) and rare or unexpected We have used verification to evaluate soccer models
events. Hence, one wants to reason about the model. in two novel ways. First, we show how it is possible to
Second, there are standard evaluation metrics that are debug the training data and pinpoint labeling errors (or
important to use to ensure, e.g., that probability estimates inconsistencies). Second, we identify scenarios where the
are accurate. model produces unexpected and undesired predictions.
These are shortcomings in the model itself. We use
Veritas [51] to analyze two previously mentioned soccer
5.1. Reasoning about Learned Models analytics models: xG and the VAEP holistic action-value
Verification is a powerful alternative to traditional aggre- model.
gated metrics to evaluate and inspect a learned model. First, we analyzed an xG model in order to identify
Verification attempts to reason about how a learned “what are the optimal locations to shoot from outside the
model will behave [47, 48, 49, 50]. Given a desired tar- penalty box?”. We used Veritas to generate 200 examples
get value (i.e., prediction), and possibly some constraints of shots from outside the penalty box that would have
on the values that the features can take on, a verifica- the highest probability of resulting in a goal, which are
tion algorithm either generates one or more instances shown as a heatmap in Figure 5. The cluster in front
that satisfy the constraints, or it proves that no such in- of the goal is expected as it corresponds to the areas
stance exists. This is similar to satisfiability checking. In most advantageous to shoot from. The locations near the
practice, verification allows users to query a model, i.e., corners of the pitch are unexpected. We looked at the
reason about the model’s possible outputs and examine shots from the 5 meter square area touching the corner
what the model has learned from the data. It can be used and counted 11 shots and 8 goals, yielding an extremely
to investigate how a model behaves in certain sub-areas high 72% conversion rate. This reveals an unexpected
of the input space. Examples of verification questions labeling behavior by the human annotators. Given the
are: distance to the goal and the tight angle, one would expect
a much lower conversion rate. A plausible explanation
is that annotators are only labeling actions as a shot in
the rare situations where the action results in a goal or
• Is a model robust to small changes to the inputs?</p>
      <p>For example, does a small change in the time of
the game and the position of the ball significantly
change the probability that a shot will result in
a goal? This relates to adversarial examples (c.f.</p>
      <p>image recognition).
• Related to the previous question, but with a
different interpretation: given a specific example of
interest, can one or more attributes be (slightly)
changed so that the indicator is maximized? This
is often called a counterfactual explanation, e.g.,
if the goalie would have been positioned closer
to the near post, how would that have afected
the estimated probability of the shot resulting in
a goal? We want to emphasize that, this is not
be evaluated in a number of diferent ways such as using
reliability diagrams [52], the Brier score [53],
logarithmic loss, and the multi-class expected calibration error
(ECE) [54]. It is less clear when one of these metrics may
be more appropriate than another. Here, it may be worth
considering if the probabilities will be summed (e.g., for
computing player ratings) or multiplied (e.g., modeling
decision making) [55]. It is important to remember that
these metrics depend on the class distribution, and hence
their values need to be interpreted in this context. This is
important as scoring rates can vary by competition (e.g.,
men’s leagues vs. women’s leagues) [56].</p>
      <sec id="sec-11-1">
        <title>6. Discussion</title>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Evaluating learning systems in the context of sports is an</title>
      <p>a save. Otherwise, the actions are labeled as a pass or a extremely tricky endeavor that largely relies on expertise
cross. gained through experience. On the one hand, the outputs</p>
      <p>Second, we analyzed VAEP [2], a holistic-action model of learned models are often combined in order to
confor soccer. The models underlying this indicator look at a struct novel indicators of performance, and the validity
short sequence of consecutive game actions and predict of these indicators needs to be assessed. Here, we would
the probability of a goal in the next 10 actions. Unlike like to caution against looking at correlations to other
xG models, all possible actions (passes, dribbles, tack- success metrics as we believe that a high correlation to
les, . . . ) are considered, not just shots. For the data in an existing indicator fails the central goal: gaining new
an unseen test set, the model produces well-calibrated insights. We also believe that the reliability and stability
probability estimates in aggregate. However, we looked of indicators is important, and should be more widely
for specific scenarios where the model performs badly studied. Still, what remains the best approach for
evaluand found several instances that are technically possible, ating a specific problem is often not clear, and the field
but very unlikely. More interestingly, Veritas gener- would benefit from a broader discussion of best practices.
ated instances where all the values of all features were On the other hand, it is also necessary to evaluate the
ifxed except for the time in the match, and found that models used to construct the underlying systems and
the probability of scoring varied dramatically according indicators. Here, we believe that evaluating models by
to match time. Figure 6 shows this variability for one reasoning about their behavior is crucial: this changes the
such instance. The probability gradually increases over focus from a purely data-based evaluation perspective
time, which is not necessarily unexpected as scoring rates to one that considers the efect of the data on the model.
tend to slightly increase as a match progresses. However, The ability to have insight into a model’s behavior also
about 27 minutes into the first half the probability of facilitates interactions with domain experts. Critically
scoring dramatically spikes. Clearly, this behavior is un- reflecting on what situations a model will work well in
desirable: we would not expect such large variations. and which situations it may struggle in, helps build trust
This suggests that time should probably be handled dif- and set appropriate expectations.
ferently in the model, e.g., by discretizing it to make it Still, using reasoning is not a magic solution. When
less fine-grained. a reasoner identifies unexpected behaviors, there are at</p>
      <p>Such an evaluation is still challenging. One has to least two possible causes. One cause is errors in the
trainknow what to look for, which typically requires signifi- ing data which are picked up by the model and warp the
cant domain expertise or access to a domain expert. More- decision boundary in unexpected ways (e.g., Figure 5).
over, the process is exploratory: there is a huge space Some errors can be found by inspecting the data, but
of scenarios to consider and the questions have to be given the nature of the data, it can be challenging to know
iteratively refined. where to look. The other cause is peculiarities with the
model itself, the learning algorithm that constructed the
model, or the biases resulting from the model
represen5.2. Standard Metrics tation (e.g., Figure 6). Traditional evaluation metrics are
Many novel indicators involve using a learned model that completely oblivious to these issues. They can only be
makes probabilistic predictions, making calibration the discovered by reasoning about the model. Unfortunately,
standard choice for a classical evaluation. Calibration can it remains dificult to correct a model that has picked up</p>
      <sec id="sec-12-1">
        <title>Acknowledgments</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>This work was supported by iBOF/21/075, Research</title>
      <p>Foundation-Flanders (EOS No. 30992574, 1SB1320N to
LD) and the Flemish Government under the
“Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen”
program.
on an unwanted pattern. For example, the time’s efect
on the probability of scoring can only be resolved via
representing the feature in a diferent way, relearning
the model, and reassessing its performance. Alas, this
is an iterative guess-and-check approach. We believe
that reasoning approaches to evaluation are only in their
infancy and need to be further explored.</p>
      <p>While this paper discussed evaluation in the context
of sports, we do feel that some of the challenges and
insights are relevant for other application domains where
machine learning is applied. For example, evaluation
challenges also arise in prognostics, especially when it is
impossible to directly collect data about a target such as
time until failure. In both domains, we do not want to let
the athlete nor machine be damaged beyond repair. Also,
we perform multiple actions to avoid failure, making it
dificult to attribute value to individual actions or
identify root causes. Another example is how to deal with
subjective ratings provided by users, which often occurs
when monitoring players’ fitness and was also a key issue
in the Netflix challenge. Finally, in terms of approaches
to evaluation, there is also more emphasis within ML in
general on trying to ensure the robustness of learned
models by checking, for example, how susceptible they
are to adversarial attacks.
of NHL players using in-game win probabilities, in: J. Davis, Leaving goals on the pitch: Evaluating
MIT Sloan Sports Analytics Conference, 2015. decision making in soccer, in: MIT Sloan Sports
[21] B. Burke, WPA explained, 2010. URL: http: Analytics Conference, 2021.
//archive.advancedfootballanalytics.com/2010/01/ [35] H. M. Le, Y. Yue, P. Carr, P. Lucey, Coordinated
win-probability-added-wpa-explained.html. multi-agent imitation learning, in: Proceedings
[22] P. Robberechts, J. Van Haaren, J. Davis, A bayesian of the 34th International Conference on Machine
approach to in-game win probability in soccer, in: Learning, 2017, pp. 1995–2003.</p>
      <p>Proc. of 27th ACM SIGKDD International Confer- [36] M. J. Joyner, Modeling: optimal marathon
perforence on Knowledge Discovery &amp; Data Mining, 2021, mance on the basis of physiological factors, Journal
pp. 3512–3521. of Applied Physiology 70 (1991) 683–687.
[23] M. Bouey, NBA win probability added, 2013. [37] I. McHale, P. Scarf, D. Folker, On the development
URL: https://www.inpredictable.com/2013/06/nba- of a soccer player performance rating system for
win-probability-added.html. the english premier league, Interfaces 42 (2012)
[24] D. Cervone, A. D’Amour, L. Bornn, K. Goldsberry, 339–351.</p>
      <p>POINTWISE: Predicting Points and Valuing Deci- [38] L. Pappalardo, P. Cintia, P. Ferragina, E. Massucco,
sions in Real Time with NBA Optical Tracking Data, D. Pedreschi, F. Giannotti, Playerank: Data-driven
in: MIT Sloan Sports Analytics Conference, 2014. performance evaluation and player ranking in
soc[25] D. Romer, Do firms maximize? Evidence from cer via a machine learning approach, ACM Trans.
professional football, Journal of Political Economy Intell. Syst. Technol. 10 (2019) 59:1–59:27.
114 (2006) 340–365. [39] L. M. Hvattum, Ofensive and defensive plus–minus
[26] K. Routley, O. Schulte, A Markov game model for player ratings for soccer, Applied Sciences 10
valuing player actions in ice hockey, in: Proc. 31st (2020).</p>
      <p>Conference on Uncertainty in Artificial Intelligence, [40] A. Z. Jacobs, H. Wallach, Measurement and fairness,
2015, pp. 782–791. in: Proc. of the 2021 ACM Conference on Fairness,
[27] T. Kempton, N. Kennedy, A. J. Coutts, The expected Accountability, and Transparency, 2021, p. 375–385.
value of possession in professional rugby league [41] W. Dubitzky, P. Lopes, J. Davis, D. Berrar, The open
match-play, Journal of sports sciences 34 (2016) international soccer database for machine learning,
645–650. Machine learning 108 (2019) 9–28.
[28] Q. Wang, H. Zhu, W. Hu, Z. Shen, Y. Yao, Discerning [42] U. Dick, D. Link, U. Brefeld, Who can receive the
tactical patterns for professional soccer teams: An pass? A computational model for quantifying
availenhanced topic model with applications, in: Proc. ability in soccer, Data Mining and Knowledge
Disof the 21th ACM SIGKDD International Conference covery (2022).
on Knowledge Discovery and Data Mining, ACM, [43] W. Xu, Toward human-centered AI: A perspective
2015, pp. 2197–2206. from human-computer interaction, Interactions 26
[29] T. Decroos, J. Davis, Player vectors: Characteriz- (2019) 42–46.</p>
      <p>ing soccer players’ playing style from match event [44] M. Van Roy, P. Robberechts, T. Decroos, J. Davis,
streams, in: Joint European Conference on Machine Valuing on-the-ball actions in soccer: A critical
comLearning and Knowledge Discovery in Databases, parison of xT and VAEP, in: 2020 AAAI Workshop
Springer, 2019, pp. 569–584. on AI in Team Sports, 2020.
[30] J. Bekkers, S. S. Dabadghao, Flow motifs in soc- [45] K. Singh, Introducing expected threat, 2019. URL:
cer: What can passing behavior tell us?, Journal of https://karun.in/blog/expected-threat.html.</p>
      <p>Systems Architecture 5 (2019) 299–311. [46] A. M. Franks, A. D’Amour, D. Cervone, L. Bornn,
[31] J. Fernandez-Navarro, L. Fradua, A. Zubillaga, A. P. Meta-analytics: Tools for understanding the
statisMcRobert, Evaluating the efectiveness of styles of tical properties of sports metrics, Journal of
Quanplay in elite soccer, International Journal of Sports titative Analysis in Sports 12 (2016) 151–165.</p>
      <p>Science &amp; Coaching 14 (2019) 514–527. [47] M. Kwiatkowska, G. Norman, D. Parker, PRISM 4.0:
[32] S. Merckx, P. Robberechts, Y. Euvrard, J. Davis, Mea- Verification of probabilistic real-time systems, in:
suring the efectiveness of pressing in soccer, in: Proc. 23rd Int. Conf. on Computer Aided
VerificaWorkshop on Machine Learning and Data Mining tion, 2011, pp. 585–591.</p>
      <p>for Sports Analytics, 2021. [48] S. Russell, D. Dewey, M. Tegmark, Research
priori[33] N. Sandholtz, L. Bornn, Markov decision processes ties for robust and beneficial artificial intelligence,
with dynamic transition probabilities: An analysis AI Magazine 36 (2015) 105–114.
of shooting strategies in basketball, Annals of App [49] A. Kantchelian, J. D. Tygar, A. Joseph, Evasion and
Stat 14 (2020) 1122–1145. hardening of tree ensemble classifiers, in: Proc.
[34] M. Van Roy, P. Robberechts, W.-C. Yang, L. De Raedt, of the 33rd International Conference on Machine</p>
      <p>Learning, 2016, pp. 2387–2396.
[50] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J.
Kochenderfer, Reluplex: An eficient smt solver for
verifying deep neural networks, in: Computer Aided</p>
      <p>Verification, 2017, pp. 97–117.
[51] L. Devos, W. Meert, J. Davis, Versatile verification
of tree ensembles, in: Proc. of the 38th International
Conference on Machine Learning, 2021, pp. 2654–
2664.
[52] A. Niculescu-Mizil, R. Caruana, Predicting good
probabilities with supervised learning, in: Proc. of
the 22nd Int. Conf. on Machine learning, 2005, p.</p>
      <p>625–632.
[53] G. W. Brier, Verification of forecasts expressed in
terms of probability, Monthly weather review 78
(1950) 1–3.
[54] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On
calibration of modern neural networks, in: Proc. of
the 34th Int. Conf. on Machine Learning, 2017, pp.</p>
      <p>1321–1330.
[55] T. Decroos, J. Davis, Interpretable prediction of
goals in soccer, in: AAAI 2020 Workshop on AI in</p>
      <p>Team Sports, 2020.
[56] L. Pappalardo, A. Rossi, M. Natilli, P. Cintia,
Explaining the diference between men’s and women’s
football, PLoS ONE 16 (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>