Evaluating Sports Analytics Models: Challenges,
Approaches, and Lessons Learned
Jesse Davis1 , Lotte Bransen1,2 , Laurens Devos1 , Wannes Meert1 , Pieter Robberechts1 , Jan Van
Haaren1,3 and Maaike Van Roy1
1
  Department of Computer Science, Leuven.AI, KU Leuven, Leuven, Belgium
2
  SciSports, The Netherlands
3
  Club Brugge, Belgium


                                             Abstract
                                             There has been an explosion of data collected about sports. Because such data is extremely rich and complicated, machine
                                             learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models
                                             and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in
                                             turn used to inform decision-making at professional clubs. Unfortunately, how to evaluate the use of machine learning in
                                             the context of sports remains extremely challenging. On the one hand, it is necessary to evaluate the developed indicators
                                             themselves, where one is confronted by a lack of labels and small sample sizes. On the other hand, it is necessary to evaluate
                                             the models themselves, which is complicated by the noisy and non-stationary nature of sports data. In this paper, we highlight
                                             the inherent evaluation challenges in sports and discuss a variety of approaches for evaluating both indicators and models. In
                                             particular, we highlight how reasoning techniques, such as verification can be used to aid in the evaluation of learned models.

                                             Keywords
                                             sports analytics, challenges with evaluation, indicator evaluation, model evaluation, model verification, reliability


1. Introduction                                                                                                         At a high level, ML plays a role in team sports in three
                                                                                                                      areas:
Sports is becoming an increasingly data-driven field as
there are now large amounts of data about both the phys-                                                              Player recruitment. Ultimately, recruitment involves
ical states of athletes such as heart rate, GPS, and iner-                                                            (1) assessing a player’s skills and capabilities on a techni-
tial measurement units (e.g., Catapult Sports) as well as                                                             cal, tactical, physical and mental level and how they will
technical performances in matches such as play-by-play                                                                evolve, (2) projecting how the player will fit within the
(e.g., Stats Perform, StatsBomb) or optical tracking data                                                             team, and (3) forecasting how their financial valuation
(e.g., TRACAB, Second Spectrum, SkillCorner). The vol-                                                                will develop. (c.f., [1, 2, 3, 4])
ume, complexity and richness of these data sources have
                                                                                                                      Match preparation. Preparing for a match requires
made machine learning (ML) an increasingly important
                                                                                                                      performing an extensive analysis of the opposing team
analysis tool. Consequently, ML is being used to inform
                                                                                                                      to understand their tendencies and tactics. This is can be
decision-making in professional sports. On the one hand,
                                                                                                                      viewed as a SWOT analysis, which particularly focuses
it is used to extract actionable insights from the large
                                                                                                                      on the opportunities and threats. How can we punish the
volumes of data related to player performance, tactical
                                                                                                                      opponent? How can the opponent punish us? These find-
approaches, and the physical status of players. On the
                                                                                                                      ings are used by the coaching staff to prepare a game plan.
other hand, it is used to partially automate tasks such as
                                                                                                                      Typically, such reports are prepared by analysts who
video analysis that are typically done manually.
                                                                                                                      spent many hours watching videos of upcoming oppo-
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,                                                     nents. The analysts must annotate footage and recognize
Vienna, Austria                                                                                                       reoccurring patterns, which is a very time-consuming
$ jesse.davis@kuleuven.be (J. Davis); lotte.bransen@kuleuven.be                                                       task. Learned models can automatically identify patterns
(L. Bransen); laurens.devos@kuleuven.be (L. Devos);
wannes.meert@kuleuven.be (W. Meert);
                                                                                                                      that are missed or not apparent to humans (e.g., subtle
pieter.robberechts@kuleuven.be (P. Robberechts);                                                                      patterns in big data) [5], automate tasks (e.g., tagging of
jan.vanhaaren@kuleuven.be (J. Van Haaren);                                                                            situations) [6, 7] that are done by human analysts, and
maaike.vanroy@kuleuven.be (M. Van Roy)                                                                                give insights into players’ skills.
 0000-0002-3748-9263 (J. Davis); 0000-0002-0612-7999
(L. Bransen); 0000-0002-1549-749X (L. Devos); 0000-0001-9560-3872                                                     Management of player’s health and fitness. Build-
(W. Meert); 0000-0002-3734-0047 (P. Robberechts);                                                                     ing up and maintaining a player’s fitness level is crucial
0000-0001-7737-5490 (J. Van Haaren); 0000-0001-8959-3575 (M. Van
Roy)
                                                                                                                      for achieving good performances [8, 9]. However, train-
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative   ing and matches place athletes’ bodies under tremendous
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        stress. It is crucial to monitor fitness, have a sense of
how much an athlete can do or, most importantly, when                 cuss some standard evaluation metrics, we will focus
they need to rest and recover. Moreover, managing and                 on a more speculative use of reasoning techniques for
preventing injuries is crucial for a team’s stability and             model evaluation. This paper focuses on the context of
continuity which is linked to success.                                professional soccer, where we have substantial experi-
                                                                      ence. However, we believe the lessons and insights are
   One of the most common uses of ML for addressing the               applicable to other team sports, or other domains than
aforementioned tasks is developing novel indicators for               sports.
quantifying performances. Typically, machine-learned
models are trained on large historical databases of past
matches. Afterwards, the indicator is derived from the                2. Common Sports Data and
model as the indicator cannot be used directly as a target
for training because it is not in the data. One prominent
                                                                         Analytics Tasks
example of such an indicator is expected goals (xG) [10],             This section serves as a short, high-level primer on the
which is used in soccer and ice hockey to quantify the                data collected from sports matches as well as typical
quality of the scoring opportunities that a team or player            styles of performance indicators and tactical analyses.
created. The underlying model is a binary classifier that
predicts the outcome of a shot based on features such as
the distance and angle to the goal, the assist type and the           2.1. Data
passage of play.1 It is typically a more consistent measure           While there are a variety of sources of data collected
of performance than actual goals, which are extremely                 about sports, we will discuss three broad categories: phys-
important in these sports but also very rare. Even shots              ical data, play-by-play data and optical tracking data.
are relatively infrequent, and their conversion is subject               During training and matches, athletes often wear a
to variance. The idea of xG is to separate the ability to             GPS tracker with accelerometer technology (e.g., from
get into good scoring positions from the inherent ran-                Catapult Sports). These systems measure various physi-
domness (e.g., deflections) of converting them into goals.            cal parameters such as distance covered, number of high-
   Typically, an indicator should satisfy several properties.         speed sprints, and high-intensity accelerations. These
First, it should provide insights that are not currently              parameters are often augmented with questionnaire
available. For example, xG should tell you something                  data [11] to obtain subjective measurements about the
beyond looking at goals scored. Second, the indicator                 difficulty of training such as the rating of perceived ex-
should be based on domain knowledge and concepts from                 ertion (RPE) [12]. Such approaches are used to optimize
sports such that it is intuitive and easy for non ML experts          an athlete’s fitness level and ensure their availability and
to understand. Finally, the domain experts need to trust              ability to compete.
the indicator. This often boils down to being able to                    Play-by-play or event stream data tracks actions that
contextualize when the indicator is useful and ensuring               occur with the ball. Each such action is annotated with
some level of robustness in its value (i.e., it should not            information such as the type of the action, the start and
wildly fluctuate).
   These desiderata illustrate that a key challenge in de-
veloping indicators is in how to evaluate them: none of
the desiderata naturally align with the standard perfor-
mance metrics used to evaluate learned models. This
does not imply that standard evaluation metrics are not
important. In particular, ensuring that probability esti-
mates are well-calibrated is crucial in many sports an-
alytics tasks. It is simply that one must both evaluate
the indicator itself and the models used to compute the
indicator’s value. The goal of this paper is three-fold.
First, we will highlight some of the challenges that arise
when trying to evaluate work in the context of sports
data. Second, we will discuss the various ways that in-
dicator evaluation has been approached. Third, we will
overview how learned models that the indicators rely
upon have been evaluated. While we will briefly dis-                  Figure 1: The sequence of actions leading up to Belgium’s
                                                                      second goal during the 2018 World Cup quarter-final. Each
     1
       For an interactive discussion of xG, see:             https:   on-the-ball action is annotated with a couple of attributes, as
//dtai.cs.kuleuven.be/sports/blog/illustrating-the-interplay-         illustrated for Lukaku’s dribble. (Data source: StatsBomb)
between-features-and-models-in-xg
                                                                  is the key difference among existing models [10, 15, 16].
                                                                     Such indicators exist for a variety of sports including
                                                                  American football (e.g., expected completion percentage
                                                                  for quarterbacks and expected yards after the catch
                                                                  for receivers),2 basketball (e.g., expected field goal
                                                                  percentage [17]), and ice hockey (expected goals [18]).

                                                                     All actions. Instead of building bespoke models for
                                                                  each action, these indicators use the same framework to
                                                                  aggregate a player’s contributions over a set of action
Figure 2: Illustration of a tracking data frame for the first     types. Regardless of sport, almost all approaches exploit
goal of Liverpool against Bournemouth on Dec 7, 2019. The         the fact that each action 𝑎𝑖 changes the game state from
black lines represent each player’s and the ball’s trajectories   𝑠𝑖 to 𝑠𝑖+1 (as illustrated in Figure 3). These approaches
during the previous 1.5 seconds. (Data source: Last Row)          value the contribution of an action 𝑎𝑖 as:

                                                                                   𝐶(𝑠𝑖 , 𝑎𝑖 ) = 𝑉 (𝑠𝑖+1 ) − 𝑉 (𝑠𝑖 ),      (1)
end locations of the action, the result of the action (i.e.,      where 𝑉 (.) is the value of a game state and 𝑠𝑖+1 is the
successful or not), the time at which the action was per-         game state that results from executing action 𝑎𝑖 in game
formed, the player who performed the action, and the              state 𝑠𝑖 .
team of which the acting player is a part of. Figure 1 illus-
trates six actions that are part of the game between Brazil
and Belgium at the 2018 World Cup as they were recorded
in the event stream data format. This data is collected
for a variety of sports by vendors such as Stats Perform
who typically employ human annotators to collect the
data by watching broadcast video.
   Optical tracking data reports the locations of all the
players and the ball multiple times per second (typically
between 10 and 25 Hz). This data is collected using a fixed
installation in a team’s stadium using high-resolution
cameras. Such a setup is expensive and typically only
used in top leagues. There is now also extensive work             Figure 3: Lukaku’s dribble (𝑎𝑖 ) changes the game state from
on tracking solutions based on broadcast video [13, 14].          the pre-action state 𝑠𝑖 to the post-action state 𝑠𝑖+1 .
Figure 2 shows a frame of tracking data.
                                                                Approaches differ on how they value game states, with
2.2. Individual Performance Indicators                       two dominant paradigms emerging: scoring-based and
                                                             win-based. Scoring-based approaches take a narrower
Performance indicators for individual players usually
                                                             possession-based view. These approaches value a game
fall in one of two categories. The first type focuses
                                                             state by estimating the probability that the team possess-
on a single action such as a pass or shot. The second
                                                             ing the ball will score. In soccer, this may entail looking
type takes a holistic approach by developing a unifying
                                                             at the near-term probability of a goal in the next 10 ac-
framework that can value a wide range of action types.
                                                             tions or 10 seconds [2] or the long-term probability of
                                                             scoring the next goal [19]. Win-based approaches look at
   Single action. Single action indicators typically take
                                                             valuing actions by assessing a team’s chance of winning
the form of expected value-based statistics: they measure
                                                             the match in each game state. That is, these approaches
the expected chance that a typical player would success-
                                                             look at the difference in in-game win-probability between
fully execute the considered action in a specific game
                                                             two consecutive game states [20, 21, 22, 23]. Such models
context. For example, the aforementioned xG model in
                                                             have been developed for many sports, including basket-
soccer assigns a probability to each shot that represents
                                                             ball [24], American football [25], ice hockey [26, 3] and
its chance of directly resulting in a goal. These models
                                                             rugby [27].
are learned using standard probabilistic classifiers such as
logistic regression or tree ensembles from large historical
datasets of shots. Each shot is described by the game con-
text from when it was taken, and how this is represented
                                                                      2
                                                                          https://nextgenstats.nfl.com/glossary
2.3. Tactical Analyses                                         a match or training session, some measures are invasive
                                                               (e.g., blood lactate or creatine kinase). Similarly, in en-
Tactics are short-term behaviors that are used to achieve
                                                               durance sports such as distance running and cycling,
a strategic objective such as winning or scoring a goal.
                                                               monitoring athletes’ aerobic fitness levels is important,
At a high level, AI/ML is used for tactical analyses in two
                                                               which is often measured in terms of the maximal oxy-
ways: to discover patterns and to evaluate the efficacy of
                                                               gen uptake (VO2max ) [36]. However, the test to measure
a tactic.
                                                               this variable is extremely strenuous and disrupts training
   Discovering patterns is a broad task that may range
                                                               regimes, so it can only be measured sporadically.
from simply trying to understand where on the field cer-
tain players tend to operate and who tends to pass to Credit assignment. It is often unclear why an action
whom, to more complicated analyses that involve identi- succeeded or failed. For example, did a pass not reach a
fying sequences of reoccurring actions. Typically, tech- teammate because the passer mishit the ball or did their
niques such as clustering, non-negative matrix factoriza- teammate simply make the wrong run? Similarly, for
tion, and pattern mining are used to find such reoccurring those actions that are observed, we are unsure why they
behaviors [28, 29, 30].                                        arose. For example, does a player make a lot of tackles in
   Evaluating the efficacy of tactics is an equally broad a soccer match because they are covering for a teammate
task that can generally be split up into two parts: evalu- who is constantly out of position? Or is the player a weak
ating the efficacy of (1) a current and (2) a counterfactual defender that is being targeted by the opposing team?
tactic. Assessing the efficacy of currently employed tac-
tics is typically done by focusing on a specific tactic (e.g., Noisy features and/or labels. When monitoring the
counterattack, pressing) and relating it to other success health status of players, teams often partially rely on
indicators (e.g., goals, wins) [31, 32]. In contrast, assess- questionnaires [11] and subjective measures like the rat-
ing the efficacy of counterfactual tactics is more challeng- ing of perceived exertion [12]. Players respond to such
ing as it entails understanding what would happen if a questionnaires in different ways, with some being more
team (or player) employed different tactics than those honest than others. There is a risk for deception (e.g.,
that were observed. This is extremely interesting and players want to play, and may downplay injuries). There
challenging from an AI/ML and evaluation perspective are also well-known challenges when working with sub-
as it involves both (1) accurately modeling the current jective data. Similarly, play-by-play data is often collected
behavior of teams, and (2) reasoning in a counterfactual by human annotators, who make mistakes. Moreover,
way about alternative behaviors. Such approaches have the definitions of events and actions can change over
been developed in basketball and soccer to assess coun- time.
terfactual shot [33, 34] and movement3 [35] tactics.
                                                               Small sample sizes. There may only be limited data
                                                               about teams and players. For example, a top flight soccer
3. Challenges with Evaluation                                  team plays between 34 and 38 league games in a season
                                                               and will perform between 1500 and 3000 on-the-ball ac-
The nature of sports data and the tasks typically con- tions in a game.5 Even top players do not appear every
sidered within sports analytics and science pose many game and sit out matches strategically for rest.
challenges from an evaluation and analysis perspective.
                                                               Non-stationary data. The sample size issues are com-
Lack of ground truth. For many variables of interest, pounded by the fact that sports is a very non-stationary
there are simply very few or even no labels, which arises setting, meaning data that is more than one or two sea-
when analyzing both match and physical data. When sons old may not be relevant. On a team level, playing
analyzing matches, a team’s specific tactical plan is un- styles tend to vary over time due to changes in playing
known to outside observers. One can make educated and management personnel. On a player level, skills
guesses on a high level, but often not for fine-grained evolve over time, often improving until a player reaches
decisions. Similarly, when trying to assign ratings to their peak, prior to an age-related decline. More gener-
players’ actions in a soccer match, there is no variable ally, tactics evolve and change.
that directly records this. In fact, in this case, no such
                                                               Counterfactuals. Many evaluation questions in sports
objective rating even exists.
                                                               involve reasoning about outcomes that were not ob-
Physical parameters can also be difficult to collect. For served. This is most notable in the case of defense, where
example, if one is interested in measuring fatigue4 during defensive tactics are often aimed at preventing dangerous
   3
     https://grantland.com/features/the-toronto-raptors-sportvu-
                                                                        5
cameras-nba-analytical-revolution/                                        The number depends on what is annotated in the data (e.g.,
   4
     Note that there are different types of fatigue that could be   pressure events) and modeling choices such as whether a pass re-
monitored such as musculoskeletal or cardiovascular fatigue.        ceival is treated as a separate action.
actions from arising such as wide-open three-point shots  mance. For example, salary can be tied to draft position
in the NBA or one vs. the goalie in soccer. Unfortunately,and years of service. Similarly, a soccer player’s market
it is hard to know why certain actions were or were not   value or transfer fee also encompasses their commercial
taken. For example, it is difficult to estimate whether   appeal. Even playing time is not necessarily merit-based.
the goalie would have saved the shot if they had been        Other work tries to associate performance and/or pres-
positioned slightly differently. Similarly, evaluating tac-
                                                          ence in the game with winning. This is appealing as the
tics also involves counterfactual reasoning as a coach is ultimate goal is to win a game.6 For example, indicators
often interested in knowing what would have happened      can be based on correlating how often certain actions
if another policy had been followed, such as shooting     are performed with match outcomes, points scored, or
more or less often from outside the penalty box in soccer.score differentials [37, 38]. An alternative approach is
                                                          to build a predictive model based on the indicators and
Interventions. The data is observational and teams con- see if it can be used to predict the outcomes of future
stantly make decisions that affect what is observed. This matches [39].
is particularly true for injury risk assessment and load
management, where the staff will alter players’ training
regime if they are worried about the risk of injury. Man- 4.2. The Messi Test
agers also change tactics during the course of the game, When evaluating indicators about player performance,
depending on the score and the team’s performance.        one advantage is that there is typically consensus on
                                                              who are among the very top players. While experts,
4. Evaluating an Indicator                                    pundits, and fans may debate the relative merits of Lionel
                                                              Messi and Cristiano Ronaldo, there is little debate that
A novel indicator should capture something about a they fall in the very top set of offensive players. An
player’s (or team’s) performance or capabilities. Eval- offensive metric where neither of those players scores
uating a novel indicator’s usefulness is difficult as it is well, is not likely to convince practitioners. In other
unclear what it should be compared against. This prob- words, if a metric blatantly contradicts most conventional
lem is addressed in multiple different ways in the litera- wisdom, there is likely a problem with it. This is also
ture.                                                         called face validity [40]. Of course, some unexpected or
                                                              more surprising names could appear towards the top of
                                                              such a ranking, but one would be wary if all such names
4.1. Correlation with Existing Success                        were surprising.
       Indicators                                                Unfortunately, this style of evaluation is most suited
In all sports, a variety of indicators exist that denote to analyzing offensive contributions. In general, their
whether a player (or team) is considered or perceived to is more consensus on the offensive performances of in-
be good. Such indicators can be on either the individual dividual players than their defensive performances, as
or team level.                                                good defense is a collective endeavor and more heavily
   When evaluating individual players, there are a wealth reliant on tactics.
of existing indicators that are commonly reported and
used. First, there are indirect indicators such as a player’s 4.3. Make a Prediction
market value, salary, playing time, or draft position.
Second, there are indicators derived from competition While backtesting indicators (and models) is clearly a key
such as goals and assists in soccer (or ice hockey). It is component of development, sports does offer the possi-
therefore possible to design an evaluation by looking at bility for real-world predictions on unseen data. One can
the correlation between each indicator’s value for every predict, and most importantly publish, the outcomes of
player [20, 26, 3]. Alternatively, it is possible to produce matches or tournaments prior to their start. In fact, there
two rank-ordered lists of players (or teams): one based have been several competitions designed around this
on an existing success indicator and another based on a principle     7
                                                                        [41] or people who have collected predictions
newly designed indicator. Then the correlation between online.
rankings can be computed.                                        This is even possible for player indicators, and is
   Arguably, an evaluation that strives for high correla-     often  done in the work on quantifying player perfor-
tions with existing indicators misses the point: the goal mance [2, 3]. Decroos et al. [2] included lists of the top
is to design indicators that provide insights that current
ones do not. If a new indicator simply yields the same         6
                                                                 This is not always the case: Sometimes teams play for draws,
ranking as looking at goals, then it does not provide any rest players for strategic reasons, prioritize getting young players
new information. Moreover, some existing success indi- experience or try to lose to improve draft position.
cators capture information that is not related to perfor-      7
                                                                 https://twitter.com/TonyElHabr/status/1414619621659971588
under-21 players in several of the major European soc-
cer leagues for the 2017/2018 season. It is interesting
to look back on the list, and see that there were both
hits and misses. For example, the list had some players
who were less heralded then such as Mason Mount and
Mikel Oyarzabal, who are now key players. Similarly, it
had several recognized prospects such as Kylian Mbappé,
Trent Alexander-Arnold, and Frenkie de Jong who have
ascended. Finally, there were misses like Jonjoe Kenny
and David Neres. While one has to wait, it does give an
immutable forecast that can be evaluated.
   Because they do not allow for immediate results, such        Figure 4: Pearson correlation between player performance
evaluations tend to be done infrequently. However, we           indicators for ten pairs of successive seasons in the English
believe this is something that should be done more of-          Premier League (2009/10 – 2019/20). The diamond shape in-
ten. It avoids the possibility of cherry-picking results and    dicates the mean correlation. The simple “minutes played”
overfitting by forcing one to commit to a result with an        indicator is the least reliable, while the Atomic-VAEP 8 indica-
unknown outcome. This may also encourage more criti-            tor is more reliable than its VAEP [2] predecessor and xT [45].
cal thinking about the utility of the developed indicator.      As shots are infrequent and have a variable outcome, omitting
                                                                them increases an indicator’s reliability. The xT indicator does
The caveat is that the predictions must be revisited and
                                                                not value shots. Only players that played at least 900 minutes
discussed in the future, which also implies that publica-
                                                                (the equivalent of ten games) in each of the successive seasons
tion venues would be open to such submissions. Beyond           are included.
the time delay, another drawback is that they involve
sample sizes such as one match day, one tournament, or
a short list of players.
                                                                4.5. Reliability
4.4. Ask an Expert                                           Indicators are typically developed to measure a skill or
                                                             capability such as shooting ability in basketball or offen-
Developed indicators and approaches can be validated sive contributions. While these skills can and do change
by comparing them to an external source provided by over a longer timeframe (multiple seasons), they typi-
domain experts. This goes beyond the Messi test as it cally are consistent within a season or even across two
requires both deeper domain expertise and a more ex- consecutive seasons. Therefore, an indicator should take
tensive evaluation such as comparing tactical patterns on similar values in such a time frame.
discovered by an automated system to those found by a          One approach [39, 44] to measure an indicator’s relia-
video analyst. Pappalardo et al. [38] compared a player bility is to split the data set into two, and then compute
ranking system they developed to rankings produced by the correlation between the indicators computed on each
scouts. Similarly, Dick et al. [42] asked soccer coaches dataset. An example of such an evaluation is shown in
to rate how available players were to receive a pass in Figure 4. Methodologically, one consideration is how to
game situations and compared this assessment to a novel partition the available data. Typically, one is concerned
indicator they developed.                                    with respecting chronological orderings in temporal data.
   Ideally, such an expert-based evaluation considers as- However, in this setting, such a division is likely sub-
pects beyond model accuracy. Ultimately, an indicator optimal. First, games missed by injury will be clustered
should provide “value” to the workflow of practitioners. and players likely perform differently right when they
Hence, it is relevant to measure how much time it saves come back. Second, the difficulty of a team’s schedule
an analyst in his workflow, whether an indicator can is not uniformly spread over a season. Third, if the time
provide relevant new insights and whether the expert horizon is long enough, there will be aging effects.
can correctly interpret the model’s output. This type of       Franks et al. [46] propose a metric to capture an indica-
evaluation checks whether indicators fulfill the needs tor’s stability. It tries to assess how much an indicator’s
of users (i.e., usefulness and usability) and also arises in value depends on context (e.g., a team’s tactical system,
human-computer interaction [43].                             quality of teammates) and changes in skill (e.g., improve-
   However, this type of evaluation can be difficult as not ment through practice). It does so by looking at the
all researchers have access to domain experts, particularly variance of indicators using a within-season bootstrap
when it comes to high-level sports. Moreover, teams want procedure.
to maintain a competitive advantage, so one may not be
able to publish such an evaluation.                              8
                                                                   https://dtai.cs.kuleuven.be/sports/blog/introducing-atomic-
                                                                spadl-a-new-way-to-represent-event-stream-data/
   Another approach [29] is to look at consecutive sea-                a causal counterfactual (because the considered
sons and pose the evaluation as a nearest neighbors prob-              models are not causal models).
lem. That is, based on the indicators computed from one
season of data for a specific player, find a rank-ordered            • Does the model behave as expected in scenarios
list of the most similar players in the subsequent (or pre-            where we have strong intuitions based on domain
ceding) season. The robustness of the indicator is then                knowledge? For example, one can analyze what
related to the position of the chosen player in the ranking.           values the model can predict for shots that are
                                                                       taken from a very tight angle or very far away
                                                                       from the goal. One can then check whether the
5. Evaluating a Model                                                  predictions for the generated game situations are
                                                                       realistic.
Evaluating the models used to produce the indicator in-
volves two key aspects. First, it is important to ensure        Typical aggregated test metrics do not reveal the answers
that the model will behave as expected on unseen data.          to these questions. Nevertheless, the answers can be very
This is particularly important for sports since the data can    valuable because they provide insights into the model
have errors or noise (e.g., incorrect annotations, sensor       and can reveal problems with the model or the data.
failures, errors in tracking data) and rare or unexpected          We have used verification to evaluate soccer models
events. Hence, one wants to reason about the model.             in two novel ways. First, we show how it is possible to
Second, there are standard evaluation metrics that are          debug the training data and pinpoint labeling errors (or
important to use to ensure, e.g., that probability estimates    inconsistencies). Second, we identify scenarios where the
are accurate.                                                   model produces unexpected and undesired predictions.
                                                                These are shortcomings in the model itself. We use Ver-
                                                                itas [51] to analyze two previously mentioned soccer
5.1. Reasoning about Learned Models                             analytics models: xG and the VAEP holistic action-value
Verification is a powerful alternative to traditional aggre-    model.
gated metrics to evaluate and inspect a learned model.             First, we analyzed an xG model in order to identify
Verification attempts to reason about how a learned             “what are the optimal locations to shoot from outside the
model will behave [47, 48, 49, 50]. Given a desired tar-        penalty box?”. We used Veritas to generate 200 examples
get value (i.e., prediction), and possibly some constraints     of shots from outside the penalty box that would have
on the values that the features can take on, a verifica-        the highest probability of resulting in a goal, which are
tion algorithm either generates one or more instances           shown as a heatmap in Figure 5. The cluster in front
that satisfy the constraints, or it proves that no such in-     of the goal is expected as it corresponds to the areas
stance exists. This is similar to satisfiability checking. In   most advantageous to shoot from. The locations near the
practice, verification allows users to query a model, i.e.,     corners of the pitch are unexpected. We looked at the
reason about the model’s possible outputs and examine           shots from the 5 meter square area touching the corner
what the model has learned from the data. It can be used        and counted 11 shots and 8 goals, yielding an extremely
to investigate how a model behaves in certain sub-areas         high 72% conversion rate. This reveals an unexpected
of the input space. Examples of verification questions          labeling behavior by the human annotators. Given the
are:                                                            distance to the goal and the tight angle, one would expect
                                                                a much lower conversion rate. A plausible explanation
     • Is a model robust to small changes to the inputs?        is that annotators are only labeling actions as a shot in
       For example, does a small change in the time of          the rare situations where the action results in a goal or
       the game and the position of the ball significantly
       change the probability that a shot will result in
       a goal? This relates to adversarial examples (c.f.
       image recognition).
     • Related to the previous question, but with a dif-
       ferent interpretation: given a specific example of
       interest, can one or more attributes be (slightly)
       changed so that the indicator is maximized? This
       is often called a counterfactual explanation, e.g.,
       if the goalie would have been positioned closer
       to the near post, how would that have affected Figure 5: A heatmap showing where Veritas generates in-
       the estimated probability of the shot resulting in stances of shots from outside the penalty box with the highest
       a goal? We want to emphasize that, this is not xG values.
                                                                   be evaluated in a number of different ways such as using
                                                                   reliability diagrams [52], the Brier score [53], logarith-
                                                                   mic loss, and the multi-class expected calibration error
                                                                   (ECE) [54]. It is less clear when one of these metrics may
                                                                   be more appropriate than another. Here, it may be worth
                                                                   considering if the probabilities will be summed (e.g., for
                                                                   computing player ratings) or multiplied (e.g., modeling
                                                                   decision making) [55]. It is important to remember that
                                                                   these metrics depend on the class distribution, and hence
                                                                   their values need to be interpreted in this context. This is
                                                                   important as scoring rates can vary by competition (e.g.,
Figure 6: For specific action sequences, the time remaining in     men’s leagues vs. women’s leagues) [56].
the game has a large variable effect on the probability of scor-
ing in the VAEP action model. This variability is unexpected
and reveals a robustness issue with the model.                     6. Discussion
                                                                   Evaluating learning systems in the context of sports is an
a save. Otherwise, the actions are labeled as a pass or a          extremely tricky endeavor that largely relies on expertise
cross.                                                             gained through experience. On the one hand, the outputs
   Second, we analyzed VAEP [2], a holistic-action model           of learned models are often combined in order to con-
for soccer. The models underlying this indicator look at a         struct novel indicators of performance, and the validity
short sequence of consecutive game actions and predict             of these indicators needs to be assessed. Here, we would
the probability of a goal in the next 10 actions. Unlike           like to caution against looking at correlations to other
xG models, all possible actions (passes, dribbles, tack-           success metrics as we believe that a high correlation to
les, . . . ) are considered, not just shots. For the data in       an existing indicator fails the central goal: gaining new
an unseen test set, the model produces well-calibrated             insights. We also believe that the reliability and stability
probability estimates in aggregate. However, we looked             of indicators is important, and should be more widely
for specific scenarios where the model performs badly              studied. Still, what remains the best approach for evalu-
and found several instances that are technically possible,         ating a specific problem is often not clear, and the field
but very unlikely. More interestingly, Veritas gener-              would benefit from a broader discussion of best practices.
ated instances where all the values of all features were              On the other hand, it is also necessary to evaluate the
fixed except for the time in the match, and found that             models used to construct the underlying systems and
the probability of scoring varied dramatically according           indicators. Here, we believe that evaluating models by
to match time. Figure 6 shows this variability for one             reasoning about their behavior is crucial: this changes the
such instance. The probability gradually increases over            focus from a purely data-based evaluation perspective
time, which is not necessarily unexpected as scoring rates         to one that considers the effect of the data on the model.
tend to slightly increase as a match progresses. However,          The ability to have insight into a model’s behavior also
about 27 minutes into the first half the probability of            facilitates interactions with domain experts. Critically
scoring dramatically spikes. Clearly, this behavior is un-         reflecting on what situations a model will work well in
desirable: we would not expect such large variations.              and which situations it may struggle in, helps build trust
This suggests that time should probably be handled dif-            and set appropriate expectations.
ferently in the model, e.g., by discretizing it to make it            Still, using reasoning is not a magic solution. When
less fine-grained.                                                 a reasoner identifies unexpected behaviors, there are at
   Such an evaluation is still challenging. One has to             least two possible causes. One cause is errors in the train-
know what to look for, which typically requires signifi-           ing data which are picked up by the model and warp the
cant domain expertise or access to a domain expert. More-          decision boundary in unexpected ways (e.g., Figure 5).
over, the process is exploratory: there is a huge space            Some errors can be found by inspecting the data, but
of scenarios to consider and the questions have to be              given the nature of the data, it can be challenging to know
iteratively refined.                                               where to look. The other cause is peculiarities with the
                                                                   model itself, the learning algorithm that constructed the
                                                                   model, or the biases resulting from the model represen-
5.2. Standard Metrics                                              tation (e.g., Figure 6). Traditional evaluation metrics are
Many novel indicators involve using a learned model that           completely oblivious to these issues. They can only be
makes probabilistic predictions, making calibration the            discovered by reasoning about the model. Unfortunately,
standard choice for a classical evaluation. Calibration can        it remains difficult to correct a model that has picked up
on an unwanted pattern. For example, the time’s effect       [5] L. Shaw, S. Gopaladesikan, Routine inspection: A
on the probability of scoring can only be resolved via           playbook for corner kicks, in: MIT Sloan Sports
representing the feature in a different way, relearning          Analytics Conference, 2021.
the model, and reassessing its performance. Alas, this       [6] P. Bauer, G. Anzer, Data-driven detection of coun-
is an iterative guess-and-check approach. We believe             terpressing in professional football, Data Mining
that reasoning approaches to evaluation are only in their        and Knowledge Discovery 35 (2021) 2009—-2049.
infancy and need to be further explored.                     [7] A. Miller, L. Bornn, Possession sketches: Mapping
   While this paper discussed evaluation in the context          NBA strategies, in: MIT Sloan Sports Analytics
of sports, we do feel that some of the challenges and in-        Conference, 2017.
sights are relevant for other application domains where      [8] S. L. Halson, Monitoring training load to under-
machine learning is applied. For example, evaluation             stand fatigue in athletes, Sports Med 44 (2014) 139–
challenges also arise in prognostics, especially when it is      147.
impossible to directly collect data about a target such as   [9] P. C. Bourdon, M. Cardinale, A. Murray, P. Gastin,
time until failure. In both domains, we do not want to let       M. Kellmann, M. C. Varley, T. J. Gabbett, A. J. Coutts,
the athlete nor machine be damaged beyond repair. Also,          D. J. Burgess, W. Gregson, N. T. Cable, Monitoring
we perform multiple actions to avoid failure, making it          athlete training loads: consensus statement, Int J
difficult to attribute value to individual actions or iden-      Sports Physiol Perform 12 (2017) 161–170.
tify root causes. Another example is how to deal with [10] S. Green, Assessing the performance of Pre-
subjective ratings provided by users, which often occurs         mier League goalscorers, 2012. URL: https:
when monitoring players’ fitness and was also a key issue        //www.statsperform.com/resource/assessing-the-
in the Netflix challenge. Finally, in terms of approaches        performance-of-premier-league-goalscorers/.
to evaluation, there is also more emphasis within ML in [11] M. Buchheit, Y. Cholley, P. Lambert, Psychometric
general on trying to ensure the robustness of learned            and physiological responses to a preseason compet-
models by checking, for example, how susceptible they            itive camp in the heat with a 6-hour time difference
are to adversarial attacks.                                      in elite soccer players, Int J Sports Physiol Perform
                                                                 11 (2016) 176–181.
                                                            [12] G. Borg, Psychophysical bases of perceived exer-
Acknowledgments                                                  tion, Med sci sports exer 14 (1982) 377–381.
                                                            [13] N. Johnson, Extracting player tracking data from
This work was supported by iBOF/21/075, Research
                                                                 video using non-stationary cameras and a combina-
Foundation-Flanders (EOS No. 30992574, 1SB1320N to
                                                                 tion of computer vision techniques, in: MIT Sloan
LD) and the Flemish Government under the “Onder-
                                                                 Sports Analytics Conference, 2020.
zoeksprogramma Artificiële Intelligentie (AI) Vlaanderen”
                                                            [14] A. Arbués Sangüesa, A journey of computer vision
program.
                                                                 in sports: from tracking to orientation-base metrics,
                                                                 Ph.D. thesis, Universitat Pompeu Fabra, 2021.
References                                                  [15] P. Lucey, A. Bialkowski, M. Monfort, P. Carr,
                                                                 I. Matthews, Quality vs quantity: Improved shot
  [1] L. Bransen, P. Robberechts, J. Van Haaren, J. Davis,       prediction in soccer using strategic features from
      Choke or shine? quantifying soccer players’ abil-          spatiotemporal data, in: MIT Sloan Sports Analytics
      ities to perform under mental pressure, in: MIT            Conference, 2015.
      Sloan Sports Analytics Conference, 2019.              [16] P. Robberechts, J. Davis, How data availability af-
  [2] T. Decroos, L. Bransen, J. Van Haaren, J. Davis, Ac-       fects the ability to learn good xg models, in: Work-
      tions Speak Louder Than Goals: Valuing Player              shop on Machine Learning and Data Mining for
      Actions in Soccer, in: Proc. of 25th ACM SIGKDD            Sports Analytics, 2020, pp. 17–27.
      International Conference on Knowledge Discovery [17] V. Sarlis, C. Tjortjis, Sports analytics — evaluation
      & Data Mining, 2019, pp. 1851–1861.                        of basketball players and team performance, Infor-
  [3] G. Liu, O. Schulte, Deep reinforcement learning            mation Systems 93 (2020) 101562.
      in ice hockey for context-aware player evaluation, [18] B. Macdonald, An expected goals model for evalu-
      in: Proc. of 27th Int. Joint Conference on Artificial      ating nhl teams and players, in: MIT Sloan Sports
      Intelligence, 2018, pp. 3442–3448.                         Analytics Conference, 2012.
  [4] A. Franks, A. Miller, L. Bornn, K. Goldsberry, Coun- [19] J. Fernández, L. Bornn, D. Cervone, A framework
      terpoints: Advanced defensive metrics for NBA bas-         for the fine-grained evaluation of the instantaneous
      ketball, in: MIT Sloan Sports Analytics Conference,        expected value of soccer possessions, Machine
      2015.                                                      Learning 110 (2021) 1389–1427.
                                                            [20] S. Pettigrew, Assessing the offensive productivity
     of NHL players using in-game win probabilities, in:          J. Davis, Leaving goals on the pitch: Evaluating
     MIT Sloan Sports Analytics Conference, 2015.                 decision making in soccer, in: MIT Sloan Sports
[21] B. Burke, WPA explained, 2010. URL: http:                    Analytics Conference, 2021.
     //archive.advancedfootballanalytics.com/2010/01/        [35] H. M. Le, Y. Yue, P. Carr, P. Lucey, Coordinated
     win-probability-added-wpa-explained.html.                    multi-agent imitation learning, in: Proceedings
[22] P. Robberechts, J. Van Haaren, J. Davis, A bayesian          of the 34th International Conference on Machine
     approach to in-game win probability in soccer, in:           Learning, 2017, pp. 1995–2003.
     Proc. of 27th ACM SIGKDD International Confer-          [36] M. J. Joyner, Modeling: optimal marathon perfor-
     ence on Knowledge Discovery & Data Mining, 2021,             mance on the basis of physiological factors, Journal
     pp. 3512–3521.                                               of Applied Physiology 70 (1991) 683–687.
[23] M. Bouey, NBA win probability added, 2013.              [37] I. McHale, P. Scarf, D. Folker, On the development
     URL: https://www.inpredictable.com/2013/06/nba-              of a soccer player performance rating system for
     win-probability-added.html.                                  the english premier league, Interfaces 42 (2012)
[24] D. Cervone, A. D’Amour, L. Bornn, K. Goldsberry,             339–351.
     POINTWISE: Predicting Points and Valuing Deci-          [38] L. Pappalardo, P. Cintia, P. Ferragina, E. Massucco,
     sions in Real Time with NBA Optical Tracking Data,           D. Pedreschi, F. Giannotti, Playerank: Data-driven
     in: MIT Sloan Sports Analytics Conference, 2014.             performance evaluation and player ranking in soc-
[25] D. Romer, Do firms maximize? Evidence from                   cer via a machine learning approach, ACM Trans.
     professional football, Journal of Political Economy          Intell. Syst. Technol. 10 (2019) 59:1–59:27.
     114 (2006) 340–365.                                     [39] L. M. Hvattum, Offensive and defensive plus–minus
[26] K. Routley, O. Schulte, A Markov game model for              player ratings for soccer, Applied Sciences 10
     valuing player actions in ice hockey, in: Proc. 31st         (2020).
     Conference on Uncertainty in Artificial Intelligence,   [40] A. Z. Jacobs, H. Wallach, Measurement and fairness,
     2015, pp. 782–791.                                           in: Proc. of the 2021 ACM Conference on Fairness,
[27] T. Kempton, N. Kennedy, A. J. Coutts, The expected           Accountability, and Transparency, 2021, p. 375–385.
     value of possession in professional rugby league        [41] W. Dubitzky, P. Lopes, J. Davis, D. Berrar, The open
     match-play, Journal of sports sciences 34 (2016)             international soccer database for machine learning,
     645–650.                                                     Machine learning 108 (2019) 9–28.
[28] Q. Wang, H. Zhu, W. Hu, Z. Shen, Y. Yao, Discerning     [42] U. Dick, D. Link, U. Brefeld, Who can receive the
     tactical patterns for professional soccer teams: An          pass? A computational model for quantifying avail-
     enhanced topic model with applications, in: Proc.            ability in soccer, Data Mining and Knowledge Dis-
     of the 21th ACM SIGKDD International Conference              covery (2022).
     on Knowledge Discovery and Data Mining, ACM,            [43] W. Xu, Toward human-centered AI: A perspective
     2015, pp. 2197–2206.                                         from human-computer interaction, Interactions 26
[29] T. Decroos, J. Davis, Player vectors: Characteriz-           (2019) 42–46.
     ing soccer players’ playing style from match event      [44] M. Van Roy, P. Robberechts, T. Decroos, J. Davis,
     streams, in: Joint European Conference on Machine            Valuing on-the-ball actions in soccer: A critical com-
     Learning and Knowledge Discovery in Databases,               parison of xT and VAEP, in: 2020 AAAI Workshop
     Springer, 2019, pp. 569–584.                                 on AI in Team Sports, 2020.
[30] J. Bekkers, S. S. Dabadghao, Flow motifs in soc-        [45] K. Singh, Introducing expected threat, 2019. URL:
     cer: What can passing behavior tell us?, Journal of          https://karun.in/blog/expected-threat.html.
     Systems Architecture 5 (2019) 299–311.                  [46] A. M. Franks, A. D’Amour, D. Cervone, L. Bornn,
[31] J. Fernandez-Navarro, L. Fradua, A. Zubillaga, A. P.         Meta-analytics: Tools for understanding the statis-
     McRobert, Evaluating the effectiveness of styles of          tical properties of sports metrics, Journal of Quan-
     play in elite soccer, International Journal of Sports        titative Analysis in Sports 12 (2016) 151–165.
     Science & Coaching 14 (2019) 514–527.                   [47] M. Kwiatkowska, G. Norman, D. Parker, PRISM 4.0:
[32] S. Merckx, P. Robberechts, Y. Euvrard, J. Davis, Mea-        Verification of probabilistic real-time systems, in:
     suring the effectiveness of pressing in soccer, in:          Proc. 23rd Int. Conf. on Computer Aided Verifica-
     Workshop on Machine Learning and Data Mining                 tion, 2011, pp. 585–591.
     for Sports Analytics, 2021.                             [48] S. Russell, D. Dewey, M. Tegmark, Research priori-
[33] N. Sandholtz, L. Bornn, Markov decision processes            ties for robust and beneficial artificial intelligence,
     with dynamic transition probabilities: An analysis           AI Magazine 36 (2015) 105–114.
     of shooting strategies in basketball, Annals of App     [49] A. Kantchelian, J. D. Tygar, A. Joseph, Evasion and
     Stat 14 (2020) 1122–1145.                                    hardening of tree ensemble classifiers, in: Proc.
[34] M. Van Roy, P. Robberechts, W.-C. Yang, L. De Raedt,         of the 33rd International Conference on Machine
     Learning, 2016, pp. 2387–2396.
[50] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen-
     derfer, Reluplex: An efficient smt solver for veri-
     fying deep neural networks, in: Computer Aided
     Verification, 2017, pp. 97–117.
[51] L. Devos, W. Meert, J. Davis, Versatile verification
     of tree ensembles, in: Proc. of the 38th International
     Conference on Machine Learning, 2021, pp. 2654–
     2664.
[52] A. Niculescu-Mizil, R. Caruana, Predicting good
     probabilities with supervised learning, in: Proc. of
     the 22nd Int. Conf. on Machine learning, 2005, p.
     625–632.
[53] G. W. Brier, Verification of forecasts expressed in
     terms of probability, Monthly weather review 78
     (1950) 1–3.
[54] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On
     calibration of modern neural networks, in: Proc. of
     the 34th Int. Conf. on Machine Learning, 2017, pp.
     1321–1330.
[55] T. Decroos, J. Davis, Interpretable prediction of
     goals in soccer, in: AAAI 2020 Workshop on AI in
     Team Sports, 2020.
[56] L. Pappalardo, A. Rossi, M. Natilli, P. Cintia, Ex-
     plaining the difference between men’s and women’s
     football, PLoS ONE 16 (2021).