At a high level, ML plays a role in team sports in three areas:

Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons Learned

Jesse Davis

jesse.davis@kuleuven.be 1

Lotte Bransen

lotte.bransen@kuleuven.be 1 2

Laurens Devos

laurens.devos@kuleuven.be 1

Wannes Meert

wannes.meert@kuleuven.be 1

Pieter Robberechts

pieter.robberechts@kuleuven.be 1

Jan Van Haaren

0 1

Maaike Van Roy

1 0 Club Brugge , Belgium 1 Department of Computer Science, Leuven.AI, KU Leuven , Leuven , Belgium 2 SciSports , The Netherlands

There has been an explosion of data collected about sports. Because such data is extremely rich and complicated, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Unfortunately, how to evaluate the use of machine learning in the context of sports remains extremely challenging. On the one hand, it is necessary to evaluate the developed indicators themselves, where one is confronted by a lack of labels and small sample sizes. On the other hand, it is necessary to evaluate the models themselves, which is complicated by the noisy and non-stationary nature of sports data. In this paper, we highlight the inherent evaluation challenges in sports and discuss a variety of approaches for evaluating both indicators and models. In particular, we highlight how reasoning techniques, such as verification can be used to aid in the evaluation of learned models.

eol>sports analytics challenges with evaluation indicator evaluation model evaluation model verification reliability

At a high level, ML plays a role in team sports in three areas: 1. Introduction

Sports is becoming an increasingly data-driven field as there are now large amounts of data about both the phys- Player recruitment. Ultimately, recruitment involves ical states of athletes such as heart rate, GPS, and iner- (1) assessing a player’s skills and capabilities on a technitial measurement units (e.g., Catapult Sports) as well as cal, tactical, physical and mental level and how they will technical performances in matches such as play-by-play evolve, (2) projecting how the player will fit within the (e.g., Stats Perform, StatsBomb) or optical tracking data team, and (3) forecasting how their financial valuation (e.g., TRACAB, Second Spectrum, SkillCorner). The vol- will develop. (c.f., [1, 2, 3, 4]) ume, complexity and richness of these data sources have made machine learning (ML) an increasingly important Match preparation. Preparing for a match requires analysis tool. Consequently, ML is being used to inform performing an extensive analysis of the opposing team decision-making in professional sports. On the one hand, to understand their tendencies and tactics. This is can be it is used to extract actionable insights from the large viewed as a SWOT analysis, which particularly focuses volumes of data related to player performance, tactical on the opportunities and threats. How can we punish the approaches, and the physical status of players. On the opponent? How can the opponent punish us? These findother hand, it is used to partially automate tasks such as ings are used by the coaching staf to prepare a game plan. video analysis that are typically done manually. Typically, such reports are prepared by analysts who spent many hours watching videos of upcoming opponents. The analysts must annotate footage and recognize reoccurring patterns, which is a very time-consuming task. Learned models can automatically identify patterns that are missed or not apparent to humans (e.g., subtle patterns in big data) [5], automate tasks (e.g., tagging of situations) [6, 7] that are done by human analysts, and give insights into players’ skills. Management of player’s health and fitness. Building up and maintaining a player’s fitness level is crucial for achieving good performances [8, 9]. However, training and matches place athletes’ bodies under tremendous stress. It is crucial to monitor fitness, have a sense of © 2022 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) how much an athlete can do or, most importantly, when they need to rest and recover. Moreover, managing and preventing injuries is crucial for a team’s stability and continuity which is linked to success. cuss some standard evaluation metrics, we will focus on a more speculative use of reasoning techniques for model evaluation. This paper focuses on the context of professional soccer, where we have substantial experience. However, we believe the lessons and insights are applicable to other team sports, or other domains than sports.

One of the most common uses of ML for addressing the aforementioned tasks is developing novel indicators for quantifying performances. Typically, machine-learned models are trained on large historical databases of past matches. Afterwards, the indicator is derived from the 2. Common Sports Data and model as the indicator cannot be used directly as a target Analytics Tasks for training because it is not in the data. One prominent example of such an indicator is expected goals (xG) [10], This section serves as a short, high-level primer on the which is used in soccer and ice hockey to quantify the data collected from sports matches as well as typical quality of the scoring opportunities that a team or player styles of performance indicators and tactical analyses. created. The underlying model is a binary classifier that predicts the outcome of a shot based on features such as the distance and angle to the goal, the assist type and the 2.1. Data passage of play.1 It is typically a more consistent measure While there are a variety of sources of data collected of performance than actual goals, which are extremely about sports, we will discuss three broad categories: physimportant in these sports but also very rare. Even shots ical data, play-by-play data and optical tracking data. are relatively infrequent, and their conversion is subject During training and matches, athletes often wear a to variance. The idea of xG is to separate the ability to GPS tracker with accelerometer technology (e.g., from get into good scoring positions from the inherent ran- Catapult Sports). These systems measure various physidomness (e.g., deflections) of converting them into goals. cal parameters such as distance covered, number of high

Typically, an indicator should satisfy several properties. speed sprints, and high-intensity accelerations. These First, it should provide insights that are not currently parameters are often augmented with questionnaire available. For example, xG should tell you something data [11] to obtain subjective measurements about the beyond looking at goals scored. Second, the indicator dificulty of training such as the rating of perceived exshould be based on domain knowledge and concepts from ertion (RPE) [12]. Such approaches are used to optimize sports such that it is intuitive and easy for non ML experts an athlete’s fitness level and ensure their availability and to understand. Finally, the domain experts need to trust ability to compete. the indicator. This often boils down to being able to Play-by-play or event stream data tracks actions that contextualize when the indicator is useful and ensuring occur with the ball. Each such action is annotated with some level of robustness in its value (i.e., it should not information such as the type of the action, the start and wildly fluctuate).

These desiderata illustrate that a key challenge in developing indicators is in how to evaluate them: none of the desiderata naturally align with the standard performance metrics used to evaluate learned models. This does not imply that standard evaluation metrics are not important. In particular, ensuring that probability estimates are well-calibrated is crucial in many sports analytics tasks. It is simply that one must both evaluate the indicator itself and the models used to compute the indicator’s value. The goal of this paper is three-fold.

First, we will highlight some of the challenges that arise when trying to evaluate work in the context of sports data. Second, we will discuss the various ways that indicator evaluation has been approached. Third, we will overview how learned models that the indicators rely upon have been evaluated. While we will briefly dis1For an interactive discussion of xG, see: //dtai.cs.kuleuven.be/sports/blog/illustrating-the-interplaybetween-features-and-models-in-xg

https: is the key diference among existing models [10, 15, 16].

Such indicators exist for a variety of sports including American football (e.g., expected completion percentage for quarterbacks and expected yards after the catch for receivers),2 basketball (e.g., expected field goal percentage [17]), and ice hockey (expected goals [18]).

All actions. Instead of building bespoke models for each action, these indicators use the same framework to aggregate a player’s contributions over a set of action types. Regardless of sport, almost all approaches exploit the fact that each action changes the game state from to +1 (as illustrated in Figure 3). These approaches value the contribution of an action as: (, ) = (+1) − (), (1) end locations of the action, the result of the action (i.e., where (.) is the value of a game state and +1 is the successful or not), the time at which the action was per- game state that results from executing action in game formed, the player who performed the action, and the state . team of which the acting player is a part of. Figure 1 illustrates six actions that are part of the game between Brazil and Belgium at the 2018 World Cup as they were recorded in the event stream data format. This data is collected for a variety of sports by vendors such as Stats Perform who typically employ human annotators to collect the data by watching broadcast video.

Optical tracking data reports the locations of all the players and the ball multiple times per second (typically between 10 and 25 Hz). This data is collected using a fixed installation in a team’s stadium using high-resolution cameras. Such a setup is expensive and typically only used in top leagues. There is now also extensive work Figure 3: Lukaku’s dribble () changes the game state from on tracking solutions based on broadcast video [13, 14]. the pre-action state to the post-action state +1. Figure 2 shows a frame of tracking data.

Approaches difer on how they value game states, with

2.2. Individual Performance Indicators two dominant paradigms emerging: scoring-based and Performance indicators for individual players usually win-based. Scoring-based approaches take a narrower fall in one of two categories. The first type focuses possession-based view. These approaches value a game on a single action such as a pass or shot. The second state by estimating the probability that the team possesstype takes a holistic approach by developing a unifying ing the ball will score. In soccer, this may entail looking framework that can value a wide range of action types. at the near-term probability of a goal in the next 10 actions or 10 seconds [2] or the long-term probability of

Single action. Single action indicators typically take scoring the next goal [19]. Win-based approaches look at the form of expected value-based statistics: they measure valuing actions by assessing a team’s chance of winning the expected chance that a typical player would success- the match in each game state. That is, these approaches fully execute the considered action in a specific game look at the diference in in-game win-probability between context. For example, the aforementioned xG model in two consecutive game states [20, 21, 22, 23]. Such models soccer assigns a probability to each shot that represents have been developed for many sports, including basketits chance of directly resulting in a goal. These models ball [24], American football [25], ice hockey [26, 3] and are learned using standard probabilistic classifiers such as rugby [27]. logistic regression or tree ensembles from large historical datasets of shots. Each shot is described by the game context from when it was taken, and how this is represented 2https://nextgenstats.nfl .com/glossary 2.3. Tactical Analyses a match or training session, some measures are invasive (e.g., blood lactate or creatine kinase). Similarly, in endurance sports such as distance running and cycling, monitoring athletes’ aerobic fitness levels is important, which is often measured in terms of the maximal oxygen uptake (VO2max) [36]. However, the test to measure this variable is extremely strenuous and disrupts training regimes, so it can only be measured sporadically.

Tactics are short-term behaviors that are used to achieve

a strategic objective such as winning or scoring a goal.

At a high level, AI/ML is used for tactical analyses in two ways: to discover patterns and to evaluate the eficacy of a tactic.

Discovering patterns is a broad task that may range from simply trying to understand where on the field certain players tend to operate and who tends to pass to Credit assignment. It is often unclear why an action whom, to more complicated analyses that involve identi- succeeded or failed. For example, did a pass not reach a fying sequences of reoccurring actions. Typically, tech- teammate because the passer mishit the ball or did their niques such as clustering, non-negative matrix factoriza- teammate simply make the wrong run? Similarly, for tion, and pattern mining are used to find such reoccurring those actions that are observed, we are unsure why they behaviors [28, 29, 30]. arose. For example, does a player make a lot of tackles in

Evaluating the eficacy of tactics is an equally broad a soccer match because they are covering for a teammate task that can generally be split up into two parts: evalu- who is constantly out of position? Or is the player a weak ating the eficacy of (1) a current and (2) a counterfactual defender that is being targeted by the opposing team? tactic. Assessing the eficacy of currently employed tactics is typically done by focusing on a specific tactic (e.g., Noisy features and/or labels. When monitoring the counterattack, pressing) and relating it to other success health status of players, teams often partially rely on indicators (e.g., goals, wins) [31, 32]. In contrast, assess- questionnaires [11] and subjective measures like the rating the eficacy of counterfactual tactics is more challeng- ing of perceived exertion [12]. Players respond to such ing as it entails understanding what would happen if a questionnaires in diferent ways, with some being more team (or player) employed diferent tactics than those honest than others. There is a risk for deception (e.g., that were observed. This is extremely interesting and players want to play, and may downplay injuries). There challenging from an AI/ML and evaluation perspective are also well-known challenges when working with subas it involves both (1) accurately modeling the current jective data. Similarly, play-by-play data is often collected behavior of teams, and (2) reasoning in a counterfactual by human annotators, who make mistakes. Moreover, way about alternative behaviors. Such approaches have the definitions of events and actions can change over been developed in basketball and soccer to assess coun- time. terfactual shot [33, 34] and movement3 [35] tactics.

Small sample sizes. There may only be limited data

about teams and players. For example, a top flight soccer 3. Challenges with Evaluation team plays between 34 and 38 league games in a season and will perform between 1500 and 3000 on-the-ball acThe nature of sports data and the tasks typically con- tions in a game.5 Even top players do not appear every sidered within sports analytics and science pose many game and sit out matches strategically for rest. challenges from an evaluation and analysis perspective.

Lack of ground truth. For many variables of interest, there are simply very few or even no labels, which arises when analyzing both match and physical data. When analyzing matches, a team’s specific tactical plan is unknown to outside observers. One can make educated guesses on a high level, but often not for fine-grained decisions. Similarly, when trying to assign ratings to players’ actions in a soccer match, there is no variable that directly records this. In fact, in this case, no such objective rating even exists.

Physical parameters can also be dificult to collect. For

example, if one is interested in measuring fatigue4 during

3https://grantland.com/features/the-toronto-raptors-sportvu

cameras-nba-analytical-revolution/

4Note that there are diferent types of fatigue that could be monitored such as musculoskeletal or cardiovascular fatigue. Non-stationary data. The sample size issues are compounded by the fact that sports is a very non-stationary setting, meaning data that is more than one or two seasons old may not be relevant. On a team level, playing styles tend to vary over time due to changes in playing and management personnel. On a player level, skills evolve over time, often improving until a player reaches their peak, prior to an age-related decline. More generally, tactics evolve and change.

Counterfactuals. Many evaluation questions in sports involve reasoning about outcomes that were not observed. This is most notable in the case of defense, where defensive tactics are often aimed at preventing dangerous

5The number depends on what is annotated in the data (e.g.,

pressure events) and modeling choices such as whether a pass receival is treated as a separate action.

Interventions. The data is observational and teams constantly make decisions that afect what is observed. This is particularly true for injury risk assessment and load management, where the staf will alter players’ training regime if they are worried about the risk of injury. Managers also change tactics during the course of the game, depending on the score and the team’s performance. actions from arising such as wide-open three-point shots mance. For example, salary can be tied to draft position in the NBA or one vs. the goalie in soccer. Unfortunately, and years of service. Similarly, a soccer player’s market it is hard to know why certain actions were or were not value or transfer fee also encompasses their commercial taken. For example, it is dificult to estimate whether appeal. Even playing time is not necessarily merit-based. the goalie would have saved the shot if they had been Other work tries to associate performance and/or prespositioned slightly diferently. Similarly, evaluating tac- ence in the game with winning. This is appealing as the tics also involves counterfactual reasoning as a coach is ultimate goal is to win a game.6 For example, indicators often interested in knowing what would have happened can be based on correlating how often certain actions if another policy had been followed, such as shooting are performed with match outcomes, points scored, or more or less often from outside the penalty box in soccer. score diferentials [ 37, 38]. An alternative approach is to build a predictive model based on the indicators and see if it can be used to predict the outcomes of future matches [39].

4.2. The Messi Test When evaluating indicators about player performance,

one advantage is that there is typically consensus on who are among the very top players. While experts, 4. Evaluating an Indicator pundits, and fans may debate the relative merits of Lionel Messi and Cristiano Ronaldo, there is little debate that A novel indicator should capture something about a they fall in the very top set of ofensive players. An player’s (or team’s) performance or capabilities. Eval- ofensive metric where neither of those players scores uating a novel indicator’s usefulness is dificult as it is well, is not likely to convince practitioners. In other unclear what it should be compared against. This prob- words, if a metric blatantly contradicts most conventional lem is addressed in multiple diferent ways in the litera- wisdom, there is likely a problem with it. This is also ture. called face validity [40]. Of course, some unexpected or more surprising names could appear towards the top of 4.1. Correlation with Existing Success such a ranking, but one would be wary if all such names were surprising.

Indicators Unfortunately, this style of evaluation is most suited to analyzing ofensive contributions. In general, their is more consensus on the ofensive performances of individual players than their defensive performances, as good defense is a collective endeavor and more heavily reliant on tactics.

In all sports, a variety of indicators exist that denote whether a player (or team) is considered or perceived to be good. Such indicators can be on either the individual or team level.

When evaluating individual players, there are a wealth of existing indicators that are commonly reported and used. First, there are indirect indicators such as a player’s market value, salary, playing time, or draft position. Second, there are indicators derived from competition such as goals and assists in soccer (or ice hockey). It is therefore possible to design an evaluation by looking at the correlation between each indicator’s value for every player [20, 26, 3]. Alternatively, it is possible to produce two rank-ordered lists of players (or teams): one based on an existing success indicator and another based on a newly designed indicator. Then the correlation between rankings can be computed.

Arguably, an evaluation that strives for high correlations with existing indicators misses the point: the goal is to design indicators that provide insights that current ones do not. If a new indicator simply yields the same ranking as looking at goals, then it does not provide any new information. Moreover, some existing success indicators capture information that is not related to perfor

4.3. Make a Prediction

While backtesting indicators (and models) is clearly a key component of development, sports does ofer the possibility for real-world predictions on unseen data. One can predict, and most importantly publish, the outcomes of matches or tournaments prior to their start. In fact, there have been several competitions designed around this principle [41] or people who have collected predictions online.7

This is even possible for player indicators, and is often done in the work on quantifying player performance [2, 3]. Decroos et al. [2] included lists of the top

6This is not always the case: Sometimes teams play for draws, rest players for strategic reasons, prioritize getting young players experience or try to lose to improve draft position.

7https://twitter.com/TonyElHabr/status/1414619621659971588 under-21 players in several of the major European soccer leagues for the 2017/2018 season. It is interesting to look back on the list, and see that there were both hits and misses. For example, the list had some players who were less heralded then such as Mason Mount and Mikel Oyarzabal, who are now key players. Similarly, it had several recognized prospects such as Kylian Mbappé, Trent Alexander-Arnold, and Frenkie de Jong who have ascended. Finally, there were misses like Jonjoe Kenny and David Neres. While one has to wait, it does give an immutable forecast that can be evaluated.

Because they do not allow for immediate results, such Figure 4: Pearson correlation between player performance evaluations tend to be done infrequently. However, we indicators for ten pairs of successive seasons in the English believe this is something that should be done more of- Premier League (2009/10 – 2019/20). The diamond shape inten. It avoids the possibility of cherry-picking results and dicates the mean correlation. The simple “minutes played” overfitting by forcing one to commit to a result with an indicator is the least reliable, while the Atomic-VAEP 8 indicaunknown outcome. This may also encourage more criti- tor is more reliable than its VAEP [2] predecessor and xT [45]. cal thinking about the utility of the developed indicator. As shots are infrequent and have a variable outcome, omitting The caveat is that the predictions must be revisited and them increases an indicator’s reliability. The xT indicator does discussed in the future, which also implies that publica- (ntohteveaqluuievsahleontts.oOftnelny gpalamyeesr)sitnheaatcphlaoyfetdheatsulecacsets9si0v0e mseiansuotness tion venues would be open to such submissions. Beyond are included. the time delay, another drawback is that they involve sample sizes such as one match day, one tournament, or a short list of players.

4.5. Reliability 4.4. Ask an Expert

Developed indicators and approaches can be validated by comparing them to an external source provided by domain experts. This goes beyond the Messi test as it requires both deeper domain expertise and a more extensive evaluation such as comparing tactical patterns discovered by an automated system to those found by a video analyst. Pappalardo et al. [38] compared a player ranking system they developed to rankings produced by scouts. Similarly, Dick et al. [42] asked soccer coaches to rate how available players were to receive a pass in game situations and compared this assessment to a novel indicator they developed.

Ideally, such an expert-based evaluation considers aspects beyond model accuracy. Ultimately, an indicator should provide “value” to the workflow of practitioners. Hence, it is relevant to measure how much time it saves an analyst in his workflow, whether an indicator can provide relevant new insights and whether the expert can correctly interpret the model’s output. This type of evaluation checks whether indicators fulfill the needs of users (i.e., usefulness and usability) and also arises in human-computer interaction [43].

However, this type of evaluation can be dificult as not all researchers have access to domain experts, particularly when it comes to high-level sports. Moreover, teams want to maintain a competitive advantage, so one may not be able to publish such an evaluation.

Indicators are typically developed to measure a skill or capability such as shooting ability in basketball or ofensive contributions. While these skills can and do change over a longer timeframe (multiple seasons), they typically are consistent within a season or even across two consecutive seasons. Therefore, an indicator should take on similar values in such a time frame.

One approach [39, 44] to measure an indicator’s reliability is to split the data set into two, and then compute the correlation between the indicators computed on each dataset. An example of such an evaluation is shown in Figure 4. Methodologically, one consideration is how to partition the available data. Typically, one is concerned with respecting chronological orderings in temporal data. However, in this setting, such a division is likely suboptimal. First, games missed by injury will be clustered and players likely perform diferently right when they come back. Second, the dificulty of a team’s schedule is not uniformly spread over a season. Third, if the time horizon is long enough, there will be aging efects.

Franks et al. [46] propose a metric to capture an indicator’s stability. It tries to assess how much an indicator’s value depends on context (e.g., a team’s tactical system, quality of teammates) and changes in skill (e.g., improvement through practice). It does so by looking at the variance of indicators using a within-season bootstrap procedure.

8https://dtai.cs.kuleuven.be/sports/blog/introducing-atomic

spadl-a-new-way-to-represent-event-stream-data/

Another approach [29] is to look at consecutive seasons and pose the evaluation as a nearest neighbors problem. That is, based on the indicators computed from one season of data for a specific player, find a rank-ordered list of the most similar players in the subsequent (or preceding) season. The robustness of the indicator is then related to the position of the chosen player in the ranking. 5. Evaluating a Model a causal counterfactual (because the considered models are not causal models). • Does the model behave as expected in scenarios where we have strong intuitions based on domain knowledge? For example, one can analyze what values the model can predict for shots that are taken from a very tight angle or very far away from the goal. One can then check whether the predictions for the generated game situations are realistic.

Evaluating the models used to produce the indicator in

volves two key aspects. First, it is important to ensure Typical aggregated test metrics do not reveal the answers that the model will behave as expected on unseen data. to these questions. Nevertheless, the answers can be very This is particularly important for sports since the data can valuable because they provide insights into the model have errors or noise (e.g., incorrect annotations, sensor and can reveal problems with the model or the data. failures, errors in tracking data) and rare or unexpected We have used verification to evaluate soccer models events. Hence, one wants to reason about the model. in two novel ways. First, we show how it is possible to Second, there are standard evaluation metrics that are debug the training data and pinpoint labeling errors (or important to use to ensure, e.g., that probability estimates inconsistencies). Second, we identify scenarios where the are accurate. model produces unexpected and undesired predictions. These are shortcomings in the model itself. We use Veritas [51] to analyze two previously mentioned soccer 5.1. Reasoning about Learned Models analytics models: xG and the VAEP holistic action-value Verification is a powerful alternative to traditional aggre- model. gated metrics to evaluate and inspect a learned model. First, we analyzed an xG model in order to identify Verification attempts to reason about how a learned “what are the optimal locations to shoot from outside the model will behave [47, 48, 49, 50]. Given a desired tar- penalty box?”. We used Veritas to generate 200 examples get value (i.e., prediction), and possibly some constraints of shots from outside the penalty box that would have on the values that the features can take on, a verifica- the highest probability of resulting in a goal, which are tion algorithm either generates one or more instances shown as a heatmap in Figure 5. The cluster in front that satisfy the constraints, or it proves that no such in- of the goal is expected as it corresponds to the areas stance exists. This is similar to satisfiability checking. In most advantageous to shoot from. The locations near the practice, verification allows users to query a model, i.e., corners of the pitch are unexpected. We looked at the reason about the model’s possible outputs and examine shots from the 5 meter square area touching the corner what the model has learned from the data. It can be used and counted 11 shots and 8 goals, yielding an extremely to investigate how a model behaves in certain sub-areas high 72% conversion rate. This reveals an unexpected of the input space. Examples of verification questions labeling behavior by the human annotators. Given the are: distance to the goal and the tight angle, one would expect a much lower conversion rate. A plausible explanation is that annotators are only labeling actions as a shot in the rare situations where the action results in a goal or • Is a model robust to small changes to the inputs?

For example, does a small change in the time of the game and the position of the ball significantly change the probability that a shot will result in a goal? This relates to adversarial examples (c.f.

image recognition). • Related to the previous question, but with a different interpretation: given a specific example of interest, can one or more attributes be (slightly) changed so that the indicator is maximized? This is often called a counterfactual explanation, e.g., if the goalie would have been positioned closer to the near post, how would that have afected the estimated probability of the shot resulting in a goal? We want to emphasize that, this is not be evaluated in a number of diferent ways such as using reliability diagrams [52], the Brier score [53], logarithmic loss, and the multi-class expected calibration error (ECE) [54]. It is less clear when one of these metrics may be more appropriate than another. Here, it may be worth considering if the probabilities will be summed (e.g., for computing player ratings) or multiplied (e.g., modeling decision making) [55]. It is important to remember that these metrics depend on the class distribution, and hence their values need to be interpreted in this context. This is important as scoring rates can vary by competition (e.g., men’s leagues vs. women’s leagues) [56].

6. Discussion Evaluating learning systems in the context of sports is an

a save. Otherwise, the actions are labeled as a pass or a extremely tricky endeavor that largely relies on expertise cross. gained through experience. On the one hand, the outputs

Second, we analyzed VAEP [2], a holistic-action model of learned models are often combined in order to confor soccer. The models underlying this indicator look at a struct novel indicators of performance, and the validity short sequence of consecutive game actions and predict of these indicators needs to be assessed. Here, we would the probability of a goal in the next 10 actions. Unlike like to caution against looking at correlations to other xG models, all possible actions (passes, dribbles, tack- success metrics as we believe that a high correlation to les, . . . ) are considered, not just shots. For the data in an existing indicator fails the central goal: gaining new an unseen test set, the model produces well-calibrated insights. We also believe that the reliability and stability probability estimates in aggregate. However, we looked of indicators is important, and should be more widely for specific scenarios where the model performs badly studied. Still, what remains the best approach for evaluand found several instances that are technically possible, ating a specific problem is often not clear, and the field but very unlikely. More interestingly, Veritas gener- would benefit from a broader discussion of best practices. ated instances where all the values of all features were On the other hand, it is also necessary to evaluate the ifxed except for the time in the match, and found that models used to construct the underlying systems and the probability of scoring varied dramatically according indicators. Here, we believe that evaluating models by to match time. Figure 6 shows this variability for one reasoning about their behavior is crucial: this changes the such instance. The probability gradually increases over focus from a purely data-based evaluation perspective time, which is not necessarily unexpected as scoring rates to one that considers the efect of the data on the model. tend to slightly increase as a match progresses. However, The ability to have insight into a model’s behavior also about 27 minutes into the first half the probability of facilitates interactions with domain experts. Critically scoring dramatically spikes. Clearly, this behavior is un- reflecting on what situations a model will work well in desirable: we would not expect such large variations. and which situations it may struggle in, helps build trust This suggests that time should probably be handled dif- and set appropriate expectations. ferently in the model, e.g., by discretizing it to make it Still, using reasoning is not a magic solution. When less fine-grained. a reasoner identifies unexpected behaviors, there are at

Such an evaluation is still challenging. One has to least two possible causes. One cause is errors in the trainknow what to look for, which typically requires signifi- ing data which are picked up by the model and warp the cant domain expertise or access to a domain expert. More- decision boundary in unexpected ways (e.g., Figure 5). over, the process is exploratory: there is a huge space Some errors can be found by inspecting the data, but of scenarios to consider and the questions have to be given the nature of the data, it can be challenging to know iteratively refined. where to look. The other cause is peculiarities with the model itself, the learning algorithm that constructed the model, or the biases resulting from the model represen5.2. Standard Metrics tation (e.g., Figure 6). Traditional evaluation metrics are Many novel indicators involve using a learned model that completely oblivious to these issues. They can only be makes probabilistic predictions, making calibration the discovered by reasoning about the model. Unfortunately, standard choice for a classical evaluation. Calibration can it remains dificult to correct a model that has picked up

Acknowledgments This work was supported by iBOF/21/075, Research

Foundation-Flanders (EOS No. 30992574, 1SB1320N to LD) and the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program. on an unwanted pattern. For example, the time’s efect on the probability of scoring can only be resolved via representing the feature in a diferent way, relearning the model, and reassessing its performance. Alas, this is an iterative guess-and-check approach. We believe that reasoning approaches to evaluation are only in their infancy and need to be further explored.

While this paper discussed evaluation in the context of sports, we do feel that some of the challenges and insights are relevant for other application domains where machine learning is applied. For example, evaluation challenges also arise in prognostics, especially when it is impossible to directly collect data about a target such as time until failure. In both domains, we do not want to let the athlete nor machine be damaged beyond repair. Also, we perform multiple actions to avoid failure, making it dificult to attribute value to individual actions or identify root causes. Another example is how to deal with subjective ratings provided by users, which often occurs when monitoring players’ fitness and was also a key issue in the Netflix challenge. Finally, in terms of approaches to evaluation, there is also more emphasis within ML in general on trying to ensure the robustness of learned models by checking, for example, how susceptible they are to adversarial attacks. of NHL players using in-game win probabilities, in: J. Davis, Leaving goals on the pitch: Evaluating MIT Sloan Sports Analytics Conference, 2015. decision making in soccer, in: MIT Sloan Sports [21] B. Burke, WPA explained, 2010. URL: http: Analytics Conference, 2021. //archive.advancedfootballanalytics.com/2010/01/ [35] H. M. Le, Y. Yue, P. Carr, P. Lucey, Coordinated win-probability-added-wpa-explained.html. multi-agent imitation learning, in: Proceedings [22] P. Robberechts, J. Van Haaren, J. Davis, A bayesian of the 34th International Conference on Machine approach to in-game win probability in soccer, in: Learning, 2017, pp. 1995–2003.

Proc. of 27th ACM SIGKDD International Confer- [36] M. J. Joyner, Modeling: optimal marathon perforence on Knowledge Discovery & Data Mining, 2021, mance on the basis of physiological factors, Journal pp. 3512–3521. of Applied Physiology 70 (1991) 683–687. [23] M. Bouey, NBA win probability added, 2013. [37] I. McHale, P. Scarf, D. Folker, On the development URL: https://www.inpredictable.com/2013/06/nba- of a soccer player performance rating system for win-probability-added.html. the english premier league, Interfaces 42 (2012) [24] D. Cervone, A. D’Amour, L. Bornn, K. Goldsberry, 339–351.

POINTWISE: Predicting Points and Valuing Deci- [38] L. Pappalardo, P. Cintia, P. Ferragina, E. Massucco, sions in Real Time with NBA Optical Tracking Data, D. Pedreschi, F. Giannotti, Playerank: Data-driven in: MIT Sloan Sports Analytics Conference, 2014. performance evaluation and player ranking in soc[25] D. Romer, Do firms maximize? Evidence from cer via a machine learning approach, ACM Trans. professional football, Journal of Political Economy Intell. Syst. Technol. 10 (2019) 59:1–59:27. 114 (2006) 340–365. [39] L. M. Hvattum, Ofensive and defensive plus–minus [26] K. Routley, O. Schulte, A Markov game model for player ratings for soccer, Applied Sciences 10 valuing player actions in ice hockey, in: Proc. 31st (2020).

Conference on Uncertainty in Artificial Intelligence, [40] A. Z. Jacobs, H. Wallach, Measurement and fairness, 2015, pp. 782–791. in: Proc. of the 2021 ACM Conference on Fairness, [27] T. Kempton, N. Kennedy, A. J. Coutts, The expected Accountability, and Transparency, 2021, p. 375–385. value of possession in professional rugby league [41] W. Dubitzky, P. Lopes, J. Davis, D. Berrar, The open match-play, Journal of sports sciences 34 (2016) international soccer database for machine learning, 645–650. Machine learning 108 (2019) 9–28. [28] Q. Wang, H. Zhu, W. Hu, Z. Shen, Y. Yao, Discerning [42] U. Dick, D. Link, U. Brefeld, Who can receive the tactical patterns for professional soccer teams: An pass? A computational model for quantifying availenhanced topic model with applications, in: Proc. ability in soccer, Data Mining and Knowledge Disof the 21th ACM SIGKDD International Conference covery (2022). on Knowledge Discovery and Data Mining, ACM, [43] W. Xu, Toward human-centered AI: A perspective 2015, pp. 2197–2206. from human-computer interaction, Interactions 26 [29] T. Decroos, J. Davis, Player vectors: Characteriz- (2019) 42–46.

ing soccer players’ playing style from match event [44] M. Van Roy, P. Robberechts, T. Decroos, J. Davis, streams, in: Joint European Conference on Machine Valuing on-the-ball actions in soccer: A critical comLearning and Knowledge Discovery in Databases, parison of xT and VAEP, in: 2020 AAAI Workshop Springer, 2019, pp. 569–584. on AI in Team Sports, 2020. [30] J. Bekkers, S. S. Dabadghao, Flow motifs in soc- [45] K. Singh, Introducing expected threat, 2019. URL: cer: What can passing behavior tell us?, Journal of https://karun.in/blog/expected-threat.html.

Systems Architecture 5 (2019) 299–311. [46] A. M. Franks, A. D’Amour, D. Cervone, L. Bornn, [31] J. Fernandez-Navarro, L. Fradua, A. Zubillaga, A. P. Meta-analytics: Tools for understanding the statisMcRobert, Evaluating the efectiveness of styles of tical properties of sports metrics, Journal of Quanplay in elite soccer, International Journal of Sports titative Analysis in Sports 12 (2016) 151–165.

Science & Coaching 14 (2019) 514–527. [47] M. Kwiatkowska, G. Norman, D. Parker, PRISM 4.0: [32] S. Merckx, P. Robberechts, Y. Euvrard, J. Davis, Mea- Verification of probabilistic real-time systems, in: suring the efectiveness of pressing in soccer, in: Proc. 23rd Int. Conf. on Computer Aided VerificaWorkshop on Machine Learning and Data Mining tion, 2011, pp. 585–591.

for Sports Analytics, 2021. [48] S. Russell, D. Dewey, M. Tegmark, Research priori[33] N. Sandholtz, L. Bornn, Markov decision processes ties for robust and beneficial artificial intelligence, with dynamic transition probabilities: An analysis AI Magazine 36 (2015) 105–114. of shooting strategies in basketball, Annals of App [49] A. Kantchelian, J. D. Tygar, A. Joseph, Evasion and Stat 14 (2020) 1122–1145. hardening of tree ensemble classifiers, in: Proc. [34] M. Van Roy, P. Robberechts, W.-C. Yang, L. De Raedt, of the 33rd International Conference on Machine

Learning, 2016, pp. 2387–2396. [50] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochenderfer, Reluplex: An eficient smt solver for verifying deep neural networks, in: Computer Aided

Verification, 2017, pp. 97–117. [51] L. Devos, W. Meert, J. Davis, Versatile verification of tree ensembles, in: Proc. of the 38th International Conference on Machine Learning, 2021, pp. 2654– 2664. [52] A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in: Proc. of the 22nd Int. Conf. on Machine learning, 2005, p.

625–632. [53] G. W. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3. [54] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in: Proc. of the 34th Int. Conf. on Machine Learning, 2017, pp.

1321–1330. [55] T. Decroos, J. Davis, Interpretable prediction of goals in soccer, in: AAAI 2020 Workshop on AI in

Team Sports, 2020. [56] L. Pappalardo, A. Rossi, M. Natilli, P. Cintia, Explaining the diference between men’s and women’s football, PLoS ONE 16 (2021).