Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons Learned Jesse Davis1 , Lotte Bransen1,2 , Laurens Devos1 , Wannes Meert1 , Pieter Robberechts1 , Jan Van Haaren1,3 and Maaike Van Roy1 1 Department of Computer Science, Leuven.AI, KU Leuven, Leuven, Belgium 2 SciSports, The Netherlands 3 Club Brugge, Belgium Abstract There has been an explosion of data collected about sports. Because such data is extremely rich and complicated, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Unfortunately, how to evaluate the use of machine learning in the context of sports remains extremely challenging. On the one hand, it is necessary to evaluate the developed indicators themselves, where one is confronted by a lack of labels and small sample sizes. On the other hand, it is necessary to evaluate the models themselves, which is complicated by the noisy and non-stationary nature of sports data. In this paper, we highlight the inherent evaluation challenges in sports and discuss a variety of approaches for evaluating both indicators and models. In particular, we highlight how reasoning techniques, such as verification can be used to aid in the evaluation of learned models. Keywords sports analytics, challenges with evaluation, indicator evaluation, model evaluation, model verification, reliability 1. Introduction At a high level, ML plays a role in team sports in three areas: Sports is becoming an increasingly data-driven field as there are now large amounts of data about both the phys- Player recruitment. Ultimately, recruitment involves ical states of athletes such as heart rate, GPS, and iner- (1) assessing a player’s skills and capabilities on a techni- tial measurement units (e.g., Catapult Sports) as well as cal, tactical, physical and mental level and how they will technical performances in matches such as play-by-play evolve, (2) projecting how the player will fit within the (e.g., Stats Perform, StatsBomb) or optical tracking data team, and (3) forecasting how their financial valuation (e.g., TRACAB, Second Spectrum, SkillCorner). The vol- will develop. (c.f., [1, 2, 3, 4]) ume, complexity and richness of these data sources have Match preparation. Preparing for a match requires made machine learning (ML) an increasingly important performing an extensive analysis of the opposing team analysis tool. Consequently, ML is being used to inform to understand their tendencies and tactics. This is can be decision-making in professional sports. On the one hand, viewed as a SWOT analysis, which particularly focuses it is used to extract actionable insights from the large on the opportunities and threats. How can we punish the volumes of data related to player performance, tactical opponent? How can the opponent punish us? These find- approaches, and the physical status of players. On the ings are used by the coaching staff to prepare a game plan. other hand, it is used to partially automate tasks such as Typically, such reports are prepared by analysts who video analysis that are typically done manually. spent many hours watching videos of upcoming oppo- EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, nents. The analysts must annotate footage and recognize Vienna, Austria reoccurring patterns, which is a very time-consuming $ jesse.davis@kuleuven.be (J. Davis); lotte.bransen@kuleuven.be task. Learned models can automatically identify patterns (L. Bransen); laurens.devos@kuleuven.be (L. Devos); wannes.meert@kuleuven.be (W. Meert); that are missed or not apparent to humans (e.g., subtle pieter.robberechts@kuleuven.be (P. Robberechts); patterns in big data) [5], automate tasks (e.g., tagging of jan.vanhaaren@kuleuven.be (J. Van Haaren); situations) [6, 7] that are done by human analysts, and maaike.vanroy@kuleuven.be (M. Van Roy) give insights into players’ skills.  0000-0002-3748-9263 (J. Davis); 0000-0002-0612-7999 (L. Bransen); 0000-0002-1549-749X (L. Devos); 0000-0001-9560-3872 Management of player’s health and fitness. Build- (W. Meert); 0000-0002-3734-0047 (P. Robberechts); ing up and maintaining a player’s fitness level is crucial 0000-0001-7737-5490 (J. Van Haaren); 0000-0001-8959-3575 (M. Van Roy) for achieving good performances [8, 9]. However, train- © 2022 Copyright for this paper by its authors. Use permitted under Creative ing and matches place athletes’ bodies under tremendous Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) stress. It is crucial to monitor fitness, have a sense of how much an athlete can do or, most importantly, when cuss some standard evaluation metrics, we will focus they need to rest and recover. Moreover, managing and on a more speculative use of reasoning techniques for preventing injuries is crucial for a team’s stability and model evaluation. This paper focuses on the context of continuity which is linked to success. professional soccer, where we have substantial experi- ence. However, we believe the lessons and insights are One of the most common uses of ML for addressing the applicable to other team sports, or other domains than aforementioned tasks is developing novel indicators for sports. quantifying performances. Typically, machine-learned models are trained on large historical databases of past matches. Afterwards, the indicator is derived from the 2. Common Sports Data and model as the indicator cannot be used directly as a target for training because it is not in the data. One prominent Analytics Tasks example of such an indicator is expected goals (xG) [10], This section serves as a short, high-level primer on the which is used in soccer and ice hockey to quantify the data collected from sports matches as well as typical quality of the scoring opportunities that a team or player styles of performance indicators and tactical analyses. created. The underlying model is a binary classifier that predicts the outcome of a shot based on features such as the distance and angle to the goal, the assist type and the 2.1. Data passage of play.1 It is typically a more consistent measure While there are a variety of sources of data collected of performance than actual goals, which are extremely about sports, we will discuss three broad categories: phys- important in these sports but also very rare. Even shots ical data, play-by-play data and optical tracking data. are relatively infrequent, and their conversion is subject During training and matches, athletes often wear a to variance. The idea of xG is to separate the ability to GPS tracker with accelerometer technology (e.g., from get into good scoring positions from the inherent ran- Catapult Sports). These systems measure various physi- domness (e.g., deflections) of converting them into goals. cal parameters such as distance covered, number of high- Typically, an indicator should satisfy several properties. speed sprints, and high-intensity accelerations. These First, it should provide insights that are not currently parameters are often augmented with questionnaire available. For example, xG should tell you something data [11] to obtain subjective measurements about the beyond looking at goals scored. Second, the indicator difficulty of training such as the rating of perceived ex- should be based on domain knowledge and concepts from ertion (RPE) [12]. Such approaches are used to optimize sports such that it is intuitive and easy for non ML experts an athlete’s fitness level and ensure their availability and to understand. Finally, the domain experts need to trust ability to compete. the indicator. This often boils down to being able to Play-by-play or event stream data tracks actions that contextualize when the indicator is useful and ensuring occur with the ball. Each such action is annotated with some level of robustness in its value (i.e., it should not information such as the type of the action, the start and wildly fluctuate). These desiderata illustrate that a key challenge in de- veloping indicators is in how to evaluate them: none of the desiderata naturally align with the standard perfor- mance metrics used to evaluate learned models. This does not imply that standard evaluation metrics are not important. In particular, ensuring that probability esti- mates are well-calibrated is crucial in many sports an- alytics tasks. It is simply that one must both evaluate the indicator itself and the models used to compute the indicator’s value. The goal of this paper is three-fold. First, we will highlight some of the challenges that arise when trying to evaluate work in the context of sports data. Second, we will discuss the various ways that in- dicator evaluation has been approached. Third, we will overview how learned models that the indicators rely upon have been evaluated. While we will briefly dis- Figure 1: The sequence of actions leading up to Belgium’s second goal during the 2018 World Cup quarter-final. Each 1 For an interactive discussion of xG, see: https: on-the-ball action is annotated with a couple of attributes, as //dtai.cs.kuleuven.be/sports/blog/illustrating-the-interplay- illustrated for Lukaku’s dribble. (Data source: StatsBomb) between-features-and-models-in-xg is the key difference among existing models [10, 15, 16]. Such indicators exist for a variety of sports including American football (e.g., expected completion percentage for quarterbacks and expected yards after the catch for receivers),2 basketball (e.g., expected field goal percentage [17]), and ice hockey (expected goals [18]). All actions. Instead of building bespoke models for each action, these indicators use the same framework to aggregate a player’s contributions over a set of action Figure 2: Illustration of a tracking data frame for the first types. Regardless of sport, almost all approaches exploit goal of Liverpool against Bournemouth on Dec 7, 2019. The the fact that each action 𝑎𝑖 changes the game state from black lines represent each player’s and the ball’s trajectories 𝑠𝑖 to 𝑠𝑖+1 (as illustrated in Figure 3). These approaches during the previous 1.5 seconds. (Data source: Last Row) value the contribution of an action 𝑎𝑖 as: 𝐶(𝑠𝑖 , 𝑎𝑖 ) = 𝑉 (𝑠𝑖+1 ) − 𝑉 (𝑠𝑖 ), (1) end locations of the action, the result of the action (i.e., where 𝑉 (.) is the value of a game state and 𝑠𝑖+1 is the successful or not), the time at which the action was per- game state that results from executing action 𝑎𝑖 in game formed, the player who performed the action, and the state 𝑠𝑖 . team of which the acting player is a part of. Figure 1 illus- trates six actions that are part of the game between Brazil and Belgium at the 2018 World Cup as they were recorded in the event stream data format. This data is collected for a variety of sports by vendors such as Stats Perform who typically employ human annotators to collect the data by watching broadcast video. Optical tracking data reports the locations of all the players and the ball multiple times per second (typically between 10 and 25 Hz). This data is collected using a fixed installation in a team’s stadium using high-resolution cameras. Such a setup is expensive and typically only used in top leagues. There is now also extensive work Figure 3: Lukaku’s dribble (𝑎𝑖 ) changes the game state from on tracking solutions based on broadcast video [13, 14]. the pre-action state 𝑠𝑖 to the post-action state 𝑠𝑖+1 . Figure 2 shows a frame of tracking data. Approaches differ on how they value game states, with 2.2. Individual Performance Indicators two dominant paradigms emerging: scoring-based and win-based. Scoring-based approaches take a narrower Performance indicators for individual players usually possession-based view. These approaches value a game fall in one of two categories. The first type focuses state by estimating the probability that the team possess- on a single action such as a pass or shot. The second ing the ball will score. In soccer, this may entail looking type takes a holistic approach by developing a unifying at the near-term probability of a goal in the next 10 ac- framework that can value a wide range of action types. tions or 10 seconds [2] or the long-term probability of scoring the next goal [19]. Win-based approaches look at Single action. Single action indicators typically take valuing actions by assessing a team’s chance of winning the form of expected value-based statistics: they measure the match in each game state. That is, these approaches the expected chance that a typical player would success- look at the difference in in-game win-probability between fully execute the considered action in a specific game two consecutive game states [20, 21, 22, 23]. Such models context. For example, the aforementioned xG model in have been developed for many sports, including basket- soccer assigns a probability to each shot that represents ball [24], American football [25], ice hockey [26, 3] and its chance of directly resulting in a goal. These models rugby [27]. are learned using standard probabilistic classifiers such as logistic regression or tree ensembles from large historical datasets of shots. Each shot is described by the game con- text from when it was taken, and how this is represented 2 https://nextgenstats.nfl.com/glossary 2.3. Tactical Analyses a match or training session, some measures are invasive (e.g., blood lactate or creatine kinase). Similarly, in en- Tactics are short-term behaviors that are used to achieve durance sports such as distance running and cycling, a strategic objective such as winning or scoring a goal. monitoring athletes’ aerobic fitness levels is important, At a high level, AI/ML is used for tactical analyses in two which is often measured in terms of the maximal oxy- ways: to discover patterns and to evaluate the efficacy of gen uptake (VO2max ) [36]. However, the test to measure a tactic. this variable is extremely strenuous and disrupts training Discovering patterns is a broad task that may range regimes, so it can only be measured sporadically. from simply trying to understand where on the field cer- tain players tend to operate and who tends to pass to Credit assignment. It is often unclear why an action whom, to more complicated analyses that involve identi- succeeded or failed. For example, did a pass not reach a fying sequences of reoccurring actions. Typically, tech- teammate because the passer mishit the ball or did their niques such as clustering, non-negative matrix factoriza- teammate simply make the wrong run? Similarly, for tion, and pattern mining are used to find such reoccurring those actions that are observed, we are unsure why they behaviors [28, 29, 30]. arose. For example, does a player make a lot of tackles in Evaluating the efficacy of tactics is an equally broad a soccer match because they are covering for a teammate task that can generally be split up into two parts: evalu- who is constantly out of position? Or is the player a weak ating the efficacy of (1) a current and (2) a counterfactual defender that is being targeted by the opposing team? tactic. Assessing the efficacy of currently employed tac- tics is typically done by focusing on a specific tactic (e.g., Noisy features and/or labels. When monitoring the counterattack, pressing) and relating it to other success health status of players, teams often partially rely on indicators (e.g., goals, wins) [31, 32]. In contrast, assess- questionnaires [11] and subjective measures like the rat- ing the efficacy of counterfactual tactics is more challeng- ing of perceived exertion [12]. Players respond to such ing as it entails understanding what would happen if a questionnaires in different ways, with some being more team (or player) employed different tactics than those honest than others. There is a risk for deception (e.g., that were observed. This is extremely interesting and players want to play, and may downplay injuries). There challenging from an AI/ML and evaluation perspective are also well-known challenges when working with sub- as it involves both (1) accurately modeling the current jective data. Similarly, play-by-play data is often collected behavior of teams, and (2) reasoning in a counterfactual by human annotators, who make mistakes. Moreover, way about alternative behaviors. Such approaches have the definitions of events and actions can change over been developed in basketball and soccer to assess coun- time. terfactual shot [33, 34] and movement3 [35] tactics. Small sample sizes. There may only be limited data about teams and players. For example, a top flight soccer 3. Challenges with Evaluation team plays between 34 and 38 league games in a season and will perform between 1500 and 3000 on-the-ball ac- The nature of sports data and the tasks typically con- tions in a game.5 Even top players do not appear every sidered within sports analytics and science pose many game and sit out matches strategically for rest. challenges from an evaluation and analysis perspective. Non-stationary data. The sample size issues are com- Lack of ground truth. For many variables of interest, pounded by the fact that sports is a very non-stationary there are simply very few or even no labels, which arises setting, meaning data that is more than one or two sea- when analyzing both match and physical data. When sons old may not be relevant. On a team level, playing analyzing matches, a team’s specific tactical plan is un- styles tend to vary over time due to changes in playing known to outside observers. One can make educated and management personnel. On a player level, skills guesses on a high level, but often not for fine-grained evolve over time, often improving until a player reaches decisions. Similarly, when trying to assign ratings to their peak, prior to an age-related decline. More gener- players’ actions in a soccer match, there is no variable ally, tactics evolve and change. that directly records this. In fact, in this case, no such Counterfactuals. Many evaluation questions in sports objective rating even exists. involve reasoning about outcomes that were not ob- Physical parameters can also be difficult to collect. For served. This is most notable in the case of defense, where example, if one is interested in measuring fatigue4 during defensive tactics are often aimed at preventing dangerous 3 https://grantland.com/features/the-toronto-raptors-sportvu- 5 cameras-nba-analytical-revolution/ The number depends on what is annotated in the data (e.g., 4 Note that there are different types of fatigue that could be pressure events) and modeling choices such as whether a pass re- monitored such as musculoskeletal or cardiovascular fatigue. ceival is treated as a separate action. actions from arising such as wide-open three-point shots mance. For example, salary can be tied to draft position in the NBA or one vs. the goalie in soccer. Unfortunately,and years of service. Similarly, a soccer player’s market it is hard to know why certain actions were or were not value or transfer fee also encompasses their commercial taken. For example, it is difficult to estimate whether appeal. Even playing time is not necessarily merit-based. the goalie would have saved the shot if they had been Other work tries to associate performance and/or pres- positioned slightly differently. Similarly, evaluating tac- ence in the game with winning. This is appealing as the tics also involves counterfactual reasoning as a coach is ultimate goal is to win a game.6 For example, indicators often interested in knowing what would have happened can be based on correlating how often certain actions if another policy had been followed, such as shooting are performed with match outcomes, points scored, or more or less often from outside the penalty box in soccer.score differentials [37, 38]. An alternative approach is to build a predictive model based on the indicators and Interventions. The data is observational and teams con- see if it can be used to predict the outcomes of future stantly make decisions that affect what is observed. This matches [39]. is particularly true for injury risk assessment and load management, where the staff will alter players’ training regime if they are worried about the risk of injury. Man- 4.2. The Messi Test agers also change tactics during the course of the game, When evaluating indicators about player performance, depending on the score and the team’s performance. one advantage is that there is typically consensus on who are among the very top players. While experts, 4. Evaluating an Indicator pundits, and fans may debate the relative merits of Lionel Messi and Cristiano Ronaldo, there is little debate that A novel indicator should capture something about a they fall in the very top set of offensive players. An player’s (or team’s) performance or capabilities. Eval- offensive metric where neither of those players scores uating a novel indicator’s usefulness is difficult as it is well, is not likely to convince practitioners. In other unclear what it should be compared against. This prob- words, if a metric blatantly contradicts most conventional lem is addressed in multiple different ways in the litera- wisdom, there is likely a problem with it. This is also ture. called face validity [40]. Of course, some unexpected or more surprising names could appear towards the top of such a ranking, but one would be wary if all such names 4.1. Correlation with Existing Success were surprising. Indicators Unfortunately, this style of evaluation is most suited In all sports, a variety of indicators exist that denote to analyzing offensive contributions. In general, their whether a player (or team) is considered or perceived to is more consensus on the offensive performances of in- be good. Such indicators can be on either the individual dividual players than their defensive performances, as or team level. good defense is a collective endeavor and more heavily When evaluating individual players, there are a wealth reliant on tactics. of existing indicators that are commonly reported and used. First, there are indirect indicators such as a player’s 4.3. Make a Prediction market value, salary, playing time, or draft position. Second, there are indicators derived from competition While backtesting indicators (and models) is clearly a key such as goals and assists in soccer (or ice hockey). It is component of development, sports does offer the possi- therefore possible to design an evaluation by looking at bility for real-world predictions on unseen data. One can the correlation between each indicator’s value for every predict, and most importantly publish, the outcomes of player [20, 26, 3]. Alternatively, it is possible to produce matches or tournaments prior to their start. In fact, there two rank-ordered lists of players (or teams): one based have been several competitions designed around this on an existing success indicator and another based on a principle 7 [41] or people who have collected predictions newly designed indicator. Then the correlation between online. rankings can be computed. This is even possible for player indicators, and is Arguably, an evaluation that strives for high correla- often done in the work on quantifying player perfor- tions with existing indicators misses the point: the goal mance [2, 3]. Decroos et al. [2] included lists of the top is to design indicators that provide insights that current ones do not. If a new indicator simply yields the same 6 This is not always the case: Sometimes teams play for draws, ranking as looking at goals, then it does not provide any rest players for strategic reasons, prioritize getting young players new information. Moreover, some existing success indi- experience or try to lose to improve draft position. cators capture information that is not related to perfor- 7 https://twitter.com/TonyElHabr/status/1414619621659971588 under-21 players in several of the major European soc- cer leagues for the 2017/2018 season. It is interesting to look back on the list, and see that there were both hits and misses. For example, the list had some players who were less heralded then such as Mason Mount and Mikel Oyarzabal, who are now key players. Similarly, it had several recognized prospects such as Kylian Mbappé, Trent Alexander-Arnold, and Frenkie de Jong who have ascended. Finally, there were misses like Jonjoe Kenny and David Neres. While one has to wait, it does give an immutable forecast that can be evaluated. Because they do not allow for immediate results, such Figure 4: Pearson correlation between player performance evaluations tend to be done infrequently. However, we indicators for ten pairs of successive seasons in the English believe this is something that should be done more of- Premier League (2009/10 – 2019/20). The diamond shape in- ten. It avoids the possibility of cherry-picking results and dicates the mean correlation. The simple “minutes played” overfitting by forcing one to commit to a result with an indicator is the least reliable, while the Atomic-VAEP 8 indica- unknown outcome. This may also encourage more criti- tor is more reliable than its VAEP [2] predecessor and xT [45]. cal thinking about the utility of the developed indicator. As shots are infrequent and have a variable outcome, omitting them increases an indicator’s reliability. The xT indicator does The caveat is that the predictions must be revisited and not value shots. Only players that played at least 900 minutes discussed in the future, which also implies that publica- (the equivalent of ten games) in each of the successive seasons tion venues would be open to such submissions. Beyond are included. the time delay, another drawback is that they involve sample sizes such as one match day, one tournament, or a short list of players. 4.5. Reliability 4.4. Ask an Expert Indicators are typically developed to measure a skill or capability such as shooting ability in basketball or offen- Developed indicators and approaches can be validated sive contributions. While these skills can and do change by comparing them to an external source provided by over a longer timeframe (multiple seasons), they typi- domain experts. This goes beyond the Messi test as it cally are consistent within a season or even across two requires both deeper domain expertise and a more ex- consecutive seasons. Therefore, an indicator should take tensive evaluation such as comparing tactical patterns on similar values in such a time frame. discovered by an automated system to those found by a One approach [39, 44] to measure an indicator’s relia- video analyst. Pappalardo et al. [38] compared a player bility is to split the data set into two, and then compute ranking system they developed to rankings produced by the correlation between the indicators computed on each scouts. Similarly, Dick et al. [42] asked soccer coaches dataset. An example of such an evaluation is shown in to rate how available players were to receive a pass in Figure 4. Methodologically, one consideration is how to game situations and compared this assessment to a novel partition the available data. Typically, one is concerned indicator they developed. with respecting chronological orderings in temporal data. Ideally, such an expert-based evaluation considers as- However, in this setting, such a division is likely sub- pects beyond model accuracy. Ultimately, an indicator optimal. First, games missed by injury will be clustered should provide “value” to the workflow of practitioners. and players likely perform differently right when they Hence, it is relevant to measure how much time it saves come back. Second, the difficulty of a team’s schedule an analyst in his workflow, whether an indicator can is not uniformly spread over a season. Third, if the time provide relevant new insights and whether the expert horizon is long enough, there will be aging effects. can correctly interpret the model’s output. This type of Franks et al. [46] propose a metric to capture an indica- evaluation checks whether indicators fulfill the needs tor’s stability. It tries to assess how much an indicator’s of users (i.e., usefulness and usability) and also arises in value depends on context (e.g., a team’s tactical system, human-computer interaction [43]. quality of teammates) and changes in skill (e.g., improve- However, this type of evaluation can be difficult as not ment through practice). It does so by looking at the all researchers have access to domain experts, particularly variance of indicators using a within-season bootstrap when it comes to high-level sports. Moreover, teams want procedure. to maintain a competitive advantage, so one may not be able to publish such an evaluation. 8 https://dtai.cs.kuleuven.be/sports/blog/introducing-atomic- spadl-a-new-way-to-represent-event-stream-data/ Another approach [29] is to look at consecutive sea- a causal counterfactual (because the considered sons and pose the evaluation as a nearest neighbors prob- models are not causal models). lem. That is, based on the indicators computed from one season of data for a specific player, find a rank-ordered • Does the model behave as expected in scenarios list of the most similar players in the subsequent (or pre- where we have strong intuitions based on domain ceding) season. The robustness of the indicator is then knowledge? For example, one can analyze what related to the position of the chosen player in the ranking. values the model can predict for shots that are taken from a very tight angle or very far away from the goal. One can then check whether the 5. Evaluating a Model predictions for the generated game situations are realistic. Evaluating the models used to produce the indicator in- volves two key aspects. First, it is important to ensure Typical aggregated test metrics do not reveal the answers that the model will behave as expected on unseen data. to these questions. Nevertheless, the answers can be very This is particularly important for sports since the data can valuable because they provide insights into the model have errors or noise (e.g., incorrect annotations, sensor and can reveal problems with the model or the data. failures, errors in tracking data) and rare or unexpected We have used verification to evaluate soccer models events. Hence, one wants to reason about the model. in two novel ways. First, we show how it is possible to Second, there are standard evaluation metrics that are debug the training data and pinpoint labeling errors (or important to use to ensure, e.g., that probability estimates inconsistencies). Second, we identify scenarios where the are accurate. model produces unexpected and undesired predictions. These are shortcomings in the model itself. We use Ver- itas [51] to analyze two previously mentioned soccer 5.1. Reasoning about Learned Models analytics models: xG and the VAEP holistic action-value Verification is a powerful alternative to traditional aggre- model. gated metrics to evaluate and inspect a learned model. First, we analyzed an xG model in order to identify Verification attempts to reason about how a learned “what are the optimal locations to shoot from outside the model will behave [47, 48, 49, 50]. Given a desired tar- penalty box?”. We used Veritas to generate 200 examples get value (i.e., prediction), and possibly some constraints of shots from outside the penalty box that would have on the values that the features can take on, a verifica- the highest probability of resulting in a goal, which are tion algorithm either generates one or more instances shown as a heatmap in Figure 5. The cluster in front that satisfy the constraints, or it proves that no such in- of the goal is expected as it corresponds to the areas stance exists. This is similar to satisfiability checking. In most advantageous to shoot from. The locations near the practice, verification allows users to query a model, i.e., corners of the pitch are unexpected. We looked at the reason about the model’s possible outputs and examine shots from the 5 meter square area touching the corner what the model has learned from the data. It can be used and counted 11 shots and 8 goals, yielding an extremely to investigate how a model behaves in certain sub-areas high 72% conversion rate. This reveals an unexpected of the input space. Examples of verification questions labeling behavior by the human annotators. Given the are: distance to the goal and the tight angle, one would expect a much lower conversion rate. A plausible explanation • Is a model robust to small changes to the inputs? is that annotators are only labeling actions as a shot in For example, does a small change in the time of the rare situations where the action results in a goal or the game and the position of the ball significantly change the probability that a shot will result in a goal? This relates to adversarial examples (c.f. image recognition). • Related to the previous question, but with a dif- ferent interpretation: given a specific example of interest, can one or more attributes be (slightly) changed so that the indicator is maximized? This is often called a counterfactual explanation, e.g., if the goalie would have been positioned closer to the near post, how would that have affected Figure 5: A heatmap showing where Veritas generates in- the estimated probability of the shot resulting in stances of shots from outside the penalty box with the highest a goal? We want to emphasize that, this is not xG values. be evaluated in a number of different ways such as using reliability diagrams [52], the Brier score [53], logarith- mic loss, and the multi-class expected calibration error (ECE) [54]. It is less clear when one of these metrics may be more appropriate than another. Here, it may be worth considering if the probabilities will be summed (e.g., for computing player ratings) or multiplied (e.g., modeling decision making) [55]. It is important to remember that these metrics depend on the class distribution, and hence their values need to be interpreted in this context. This is important as scoring rates can vary by competition (e.g., Figure 6: For specific action sequences, the time remaining in men’s leagues vs. women’s leagues) [56]. the game has a large variable effect on the probability of scor- ing in the VAEP action model. This variability is unexpected and reveals a robustness issue with the model. 6. Discussion Evaluating learning systems in the context of sports is an a save. Otherwise, the actions are labeled as a pass or a extremely tricky endeavor that largely relies on expertise cross. gained through experience. On the one hand, the outputs Second, we analyzed VAEP [2], a holistic-action model of learned models are often combined in order to con- for soccer. The models underlying this indicator look at a struct novel indicators of performance, and the validity short sequence of consecutive game actions and predict of these indicators needs to be assessed. Here, we would the probability of a goal in the next 10 actions. Unlike like to caution against looking at correlations to other xG models, all possible actions (passes, dribbles, tack- success metrics as we believe that a high correlation to les, . . . ) are considered, not just shots. For the data in an existing indicator fails the central goal: gaining new an unseen test set, the model produces well-calibrated insights. We also believe that the reliability and stability probability estimates in aggregate. However, we looked of indicators is important, and should be more widely for specific scenarios where the model performs badly studied. Still, what remains the best approach for evalu- and found several instances that are technically possible, ating a specific problem is often not clear, and the field but very unlikely. More interestingly, Veritas gener- would benefit from a broader discussion of best practices. ated instances where all the values of all features were On the other hand, it is also necessary to evaluate the fixed except for the time in the match, and found that models used to construct the underlying systems and the probability of scoring varied dramatically according indicators. Here, we believe that evaluating models by to match time. Figure 6 shows this variability for one reasoning about their behavior is crucial: this changes the such instance. The probability gradually increases over focus from a purely data-based evaluation perspective time, which is not necessarily unexpected as scoring rates to one that considers the effect of the data on the model. tend to slightly increase as a match progresses. However, The ability to have insight into a model’s behavior also about 27 minutes into the first half the probability of facilitates interactions with domain experts. Critically scoring dramatically spikes. Clearly, this behavior is un- reflecting on what situations a model will work well in desirable: we would not expect such large variations. and which situations it may struggle in, helps build trust This suggests that time should probably be handled dif- and set appropriate expectations. ferently in the model, e.g., by discretizing it to make it Still, using reasoning is not a magic solution. When less fine-grained. a reasoner identifies unexpected behaviors, there are at Such an evaluation is still challenging. One has to least two possible causes. One cause is errors in the train- know what to look for, which typically requires signifi- ing data which are picked up by the model and warp the cant domain expertise or access to a domain expert. More- decision boundary in unexpected ways (e.g., Figure 5). over, the process is exploratory: there is a huge space Some errors can be found by inspecting the data, but of scenarios to consider and the questions have to be given the nature of the data, it can be challenging to know iteratively refined. where to look. The other cause is peculiarities with the model itself, the learning algorithm that constructed the model, or the biases resulting from the model represen- 5.2. Standard Metrics tation (e.g., Figure 6). Traditional evaluation metrics are Many novel indicators involve using a learned model that completely oblivious to these issues. They can only be makes probabilistic predictions, making calibration the discovered by reasoning about the model. Unfortunately, standard choice for a classical evaluation. Calibration can it remains difficult to correct a model that has picked up on an unwanted pattern. For example, the time’s effect [5] L. Shaw, S. Gopaladesikan, Routine inspection: A on the probability of scoring can only be resolved via playbook for corner kicks, in: MIT Sloan Sports representing the feature in a different way, relearning Analytics Conference, 2021. the model, and reassessing its performance. Alas, this [6] P. Bauer, G. Anzer, Data-driven detection of coun- is an iterative guess-and-check approach. We believe terpressing in professional football, Data Mining that reasoning approaches to evaluation are only in their and Knowledge Discovery 35 (2021) 2009—-2049. infancy and need to be further explored. [7] A. Miller, L. Bornn, Possession sketches: Mapping While this paper discussed evaluation in the context NBA strategies, in: MIT Sloan Sports Analytics of sports, we do feel that some of the challenges and in- Conference, 2017. sights are relevant for other application domains where [8] S. L. Halson, Monitoring training load to under- machine learning is applied. For example, evaluation stand fatigue in athletes, Sports Med 44 (2014) 139– challenges also arise in prognostics, especially when it is 147. impossible to directly collect data about a target such as [9] P. C. Bourdon, M. Cardinale, A. Murray, P. Gastin, time until failure. In both domains, we do not want to let M. Kellmann, M. C. Varley, T. J. Gabbett, A. J. Coutts, the athlete nor machine be damaged beyond repair. Also, D. J. Burgess, W. Gregson, N. T. Cable, Monitoring we perform multiple actions to avoid failure, making it athlete training loads: consensus statement, Int J difficult to attribute value to individual actions or iden- Sports Physiol Perform 12 (2017) 161–170. tify root causes. Another example is how to deal with [10] S. Green, Assessing the performance of Pre- subjective ratings provided by users, which often occurs mier League goalscorers, 2012. URL: https: when monitoring players’ fitness and was also a key issue //www.statsperform.com/resource/assessing-the- in the Netflix challenge. Finally, in terms of approaches performance-of-premier-league-goalscorers/. to evaluation, there is also more emphasis within ML in [11] M. Buchheit, Y. Cholley, P. Lambert, Psychometric general on trying to ensure the robustness of learned and physiological responses to a preseason compet- models by checking, for example, how susceptible they itive camp in the heat with a 6-hour time difference are to adversarial attacks. in elite soccer players, Int J Sports Physiol Perform 11 (2016) 176–181. [12] G. Borg, Psychophysical bases of perceived exer- Acknowledgments tion, Med sci sports exer 14 (1982) 377–381. [13] N. Johnson, Extracting player tracking data from This work was supported by iBOF/21/075, Research video using non-stationary cameras and a combina- Foundation-Flanders (EOS No. 30992574, 1SB1320N to tion of computer vision techniques, in: MIT Sloan LD) and the Flemish Government under the “Onder- Sports Analytics Conference, 2020. zoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” [14] A. Arbués Sangüesa, A journey of computer vision program. in sports: from tracking to orientation-base metrics, Ph.D. thesis, Universitat Pompeu Fabra, 2021. References [15] P. Lucey, A. Bialkowski, M. Monfort, P. Carr, I. Matthews, Quality vs quantity: Improved shot [1] L. Bransen, P. Robberechts, J. Van Haaren, J. Davis, prediction in soccer using strategic features from Choke or shine? quantifying soccer players’ abil- spatiotemporal data, in: MIT Sloan Sports Analytics ities to perform under mental pressure, in: MIT Conference, 2015. Sloan Sports Analytics Conference, 2019. [16] P. Robberechts, J. Davis, How data availability af- [2] T. Decroos, L. Bransen, J. Van Haaren, J. Davis, Ac- fects the ability to learn good xg models, in: Work- tions Speak Louder Than Goals: Valuing Player shop on Machine Learning and Data Mining for Actions in Soccer, in: Proc. of 25th ACM SIGKDD Sports Analytics, 2020, pp. 17–27. International Conference on Knowledge Discovery [17] V. Sarlis, C. Tjortjis, Sports analytics — evaluation & Data Mining, 2019, pp. 1851–1861. of basketball players and team performance, Infor- [3] G. Liu, O. Schulte, Deep reinforcement learning mation Systems 93 (2020) 101562. in ice hockey for context-aware player evaluation, [18] B. Macdonald, An expected goals model for evalu- in: Proc. of 27th Int. Joint Conference on Artificial ating nhl teams and players, in: MIT Sloan Sports Intelligence, 2018, pp. 3442–3448. Analytics Conference, 2012. [4] A. Franks, A. Miller, L. Bornn, K. Goldsberry, Coun- [19] J. Fernández, L. Bornn, D. Cervone, A framework terpoints: Advanced defensive metrics for NBA bas- for the fine-grained evaluation of the instantaneous ketball, in: MIT Sloan Sports Analytics Conference, expected value of soccer possessions, Machine 2015. Learning 110 (2021) 1389–1427. [20] S. Pettigrew, Assessing the offensive productivity of NHL players using in-game win probabilities, in: J. Davis, Leaving goals on the pitch: Evaluating MIT Sloan Sports Analytics Conference, 2015. decision making in soccer, in: MIT Sloan Sports [21] B. Burke, WPA explained, 2010. URL: http: Analytics Conference, 2021. //archive.advancedfootballanalytics.com/2010/01/ [35] H. M. Le, Y. Yue, P. Carr, P. Lucey, Coordinated win-probability-added-wpa-explained.html. multi-agent imitation learning, in: Proceedings [22] P. Robberechts, J. Van Haaren, J. Davis, A bayesian of the 34th International Conference on Machine approach to in-game win probability in soccer, in: Learning, 2017, pp. 1995–2003. Proc. of 27th ACM SIGKDD International Confer- [36] M. J. Joyner, Modeling: optimal marathon perfor- ence on Knowledge Discovery & Data Mining, 2021, mance on the basis of physiological factors, Journal pp. 3512–3521. of Applied Physiology 70 (1991) 683–687. [23] M. Bouey, NBA win probability added, 2013. [37] I. McHale, P. Scarf, D. Folker, On the development URL: https://www.inpredictable.com/2013/06/nba- of a soccer player performance rating system for win-probability-added.html. the english premier league, Interfaces 42 (2012) [24] D. Cervone, A. D’Amour, L. Bornn, K. Goldsberry, 339–351. POINTWISE: Predicting Points and Valuing Deci- [38] L. Pappalardo, P. Cintia, P. Ferragina, E. Massucco, sions in Real Time with NBA Optical Tracking Data, D. Pedreschi, F. Giannotti, Playerank: Data-driven in: MIT Sloan Sports Analytics Conference, 2014. performance evaluation and player ranking in soc- [25] D. Romer, Do firms maximize? Evidence from cer via a machine learning approach, ACM Trans. professional football, Journal of Political Economy Intell. Syst. Technol. 10 (2019) 59:1–59:27. 114 (2006) 340–365. [39] L. M. Hvattum, Offensive and defensive plus–minus [26] K. Routley, O. Schulte, A Markov game model for player ratings for soccer, Applied Sciences 10 valuing player actions in ice hockey, in: Proc. 31st (2020). Conference on Uncertainty in Artificial Intelligence, [40] A. Z. Jacobs, H. Wallach, Measurement and fairness, 2015, pp. 782–791. in: Proc. of the 2021 ACM Conference on Fairness, [27] T. Kempton, N. Kennedy, A. J. Coutts, The expected Accountability, and Transparency, 2021, p. 375–385. value of possession in professional rugby league [41] W. Dubitzky, P. Lopes, J. Davis, D. Berrar, The open match-play, Journal of sports sciences 34 (2016) international soccer database for machine learning, 645–650. Machine learning 108 (2019) 9–28. [28] Q. Wang, H. Zhu, W. Hu, Z. Shen, Y. Yao, Discerning [42] U. Dick, D. Link, U. Brefeld, Who can receive the tactical patterns for professional soccer teams: An pass? A computational model for quantifying avail- enhanced topic model with applications, in: Proc. ability in soccer, Data Mining and Knowledge Dis- of the 21th ACM SIGKDD International Conference covery (2022). on Knowledge Discovery and Data Mining, ACM, [43] W. Xu, Toward human-centered AI: A perspective 2015, pp. 2197–2206. from human-computer interaction, Interactions 26 [29] T. Decroos, J. Davis, Player vectors: Characteriz- (2019) 42–46. ing soccer players’ playing style from match event [44] M. Van Roy, P. Robberechts, T. Decroos, J. Davis, streams, in: Joint European Conference on Machine Valuing on-the-ball actions in soccer: A critical com- Learning and Knowledge Discovery in Databases, parison of xT and VAEP, in: 2020 AAAI Workshop Springer, 2019, pp. 569–584. on AI in Team Sports, 2020. [30] J. Bekkers, S. S. Dabadghao, Flow motifs in soc- [45] K. Singh, Introducing expected threat, 2019. URL: cer: What can passing behavior tell us?, Journal of https://karun.in/blog/expected-threat.html. Systems Architecture 5 (2019) 299–311. [46] A. M. Franks, A. D’Amour, D. Cervone, L. Bornn, [31] J. Fernandez-Navarro, L. Fradua, A. Zubillaga, A. P. Meta-analytics: Tools for understanding the statis- McRobert, Evaluating the effectiveness of styles of tical properties of sports metrics, Journal of Quan- play in elite soccer, International Journal of Sports titative Analysis in Sports 12 (2016) 151–165. Science & Coaching 14 (2019) 514–527. [47] M. Kwiatkowska, G. Norman, D. Parker, PRISM 4.0: [32] S. Merckx, P. Robberechts, Y. Euvrard, J. Davis, Mea- Verification of probabilistic real-time systems, in: suring the effectiveness of pressing in soccer, in: Proc. 23rd Int. Conf. on Computer Aided Verifica- Workshop on Machine Learning and Data Mining tion, 2011, pp. 585–591. for Sports Analytics, 2021. [48] S. Russell, D. Dewey, M. Tegmark, Research priori- [33] N. Sandholtz, L. Bornn, Markov decision processes ties for robust and beneficial artificial intelligence, with dynamic transition probabilities: An analysis AI Magazine 36 (2015) 105–114. of shooting strategies in basketball, Annals of App [49] A. Kantchelian, J. D. Tygar, A. Joseph, Evasion and Stat 14 (2020) 1122–1145. hardening of tree ensemble classifiers, in: Proc. [34] M. Van Roy, P. Robberechts, W.-C. Yang, L. De Raedt, of the 33rd International Conference on Machine Learning, 2016, pp. 2387–2396. [50] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochen- derfer, Reluplex: An efficient smt solver for veri- fying deep neural networks, in: Computer Aided Verification, 2017, pp. 97–117. [51] L. Devos, W. Meert, J. Davis, Versatile verification of tree ensembles, in: Proc. of the 38th International Conference on Machine Learning, 2021, pp. 2654– 2664. [52] A. Niculescu-Mizil, R. Caruana, Predicting good probabilities with supervised learning, in: Proc. of the 22nd Int. Conf. on Machine learning, 2005, p. 625–632. [53] G. W. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3. [54] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in: Proc. of the 34th Int. Conf. on Machine Learning, 2017, pp. 1321–1330. [55] T. Decroos, J. Davis, Interpretable prediction of goals in soccer, in: AAAI 2020 Workshop on AI in Team Sports, 2020. [56] L. Pappalardo, A. Rossi, M. Natilli, P. Cintia, Ex- plaining the difference between men’s and women’s football, PLoS ONE 16 (2021).