=Paper=
{{Paper
|id=Vol-1663/bmaw2016_extended-abstract-1
|storemode=property
|title=Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams’ Evolving Performance
|pdfUrl=https://ceur-ws.org/Vol-1663/bmaw2016_extended-abstract-1.pdf
|volume=Vol-1663
|authors=Anthony Constantinou,Norman Fenton
|dblpUrl=https://dblp.org/rec/conf/uai/ConstantinouF16
}}
==Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams’ Evolving Performance==
Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams’ Evolving Performance Anthony Constantinou Norman Fenton Queen Mary University of London, Queen Mary University of London, London, UK, E1 4NS London, UK, E1 4NS a.constantinou@qmul.ac.uk n.fenton@qmul.ac.uk EXTENDED ABSTRACT season ends. The need for greater accuracy in such predictions has become the subject of international interest (this paper is published as extended abstract only) following the 2015-16 English Premier League (EPL) season when Leicester City finished top of the league, having been priced at 5,000 to 1 to do so by many bookmakers. In an era of big-data the general consensus is that relationships between variables of interest surface almost by We use a data and knowledge engineering approach that puts themselves. Sufficient amounts of data can nowadays reveal greater emphasis on applying causal knowledge and real- new insights that would otherwise have remained unknown. world ‘facts’ to the process of model development for real- Inferring knowledge from data, however, imposes further world decision making, driven by what data are really challenges. For example, the 2007-08 financial crisis revealed required for inference, rather than blindly seeking ‘bigger’ that big-data models used by investment banks and rating data. We refer to this as the ‘smart data’ approach. We use a agencies for decision making failed to predict real-world Bayesian network (BN) as the appropriate modelling method. financial risk. This is because while such big-data models are Based on the soccer case study, we illustrate the reasoning excellent at predicting past events, they may fail to predict towards this smart-data approach to BN modeling with two similar future events that are influenced by new and hence, subsystems: previously unseen factors. 1. A knowledge-based intervention for informing the model In many real-world domains, experts comprehend vital about real-world time-series facts; and influential processes which data alone may fail to discover. 2. A knowledge-based intervention for data-engineering Yet, such knowledge is normally disregarded in favor of purposes to ensure data adhere to the structure of the automated learning, even when the data are limited. While model. automation provides major benefits, these benefits sometimes The BN model incorporates factors such as player injuries, come at a cost for accuracy. This study focuses on a managerial changes, team involvement in other European prediction problem that has similarities to financial risk, competitions, and financial investments relative1 to namely predicting evolving soccer team performance. Soccer adversaries. The BN model is based on three distinct time is the world’s most popular sport and constitutes an components: important share of the gambling market. Just like in financial risk, future team performance can be suddenly and 1. Observed events from previous season that have dramatically affected by rarely seen, or previously unseen, influenced team performance; events and so both require smarter ways of data engineering 2. Observed events during the summer break that are and modeling, rather than just larger amounts of data. expected to influence team performance; 3. Expected performance for next season, accounting for the Most of the previous extensive work on soccer has focused on uncertainty which arises from other unknown events results predictions based on historical data of relevant match which may influence team performance, such as injuries. instances. In this study we do not consider individual match results, but rather exploit external factors which may This process is repeated for each new season, for a total of 15 influence the strength of a team and its resulting seasons. This approach enabled us to provide far more performance. The aim is to predict a soccer team’s accurate predictions compared to purely data-driven standard performance for a whole season (measured by total number of league points won) before the season starts. This is an 1 Team A may spend £20m to improve their squad, but if the average important and enormous gambling market in itself - betters adversary spends £30m, then the strength of Team A is expected to start placing bets such as which team will win the title, finish diminish relative to the average adversary. in top positions, or be relegated, as soon as the previous 1 BMAW 2016 - Page 54 of 59 non-linear regression models, which still represent the skills that merge the quantitative as well as qualitative standard method for prediction in critical real-world risk aspects of data. assessment problems, such as in medical decision analysis For future research, we question whether automated learning (Kendrick, 2014). Specifically, we demonstrate how we of the available data is capable of inferring real-world facts managed to generate accurate predictions of the evolving such as those incorporated into the BN model presented in performance of soccer teams based on limited data that this paper. It may be the case that, for many real-world enables us to predict, before a season starts, the total league problems, resulting inferences will be limited in the absence of points to be accumulated. Predictive validation over a series expert intervention for data engineering as well as modeling of 15 EPL seasons demonstrates a mean error of 4.06 points purposes. Future research will examine the capability of (the possible range of points a team can achieve is 0 to 114). causal discovery algorithms in terms of realizing various real- In contrast, for two different regression based methods, the world facts from data, and the impact various data- mean errors are 7.27 and 7.30. engineering interventions may have on the results. The implications of the paper are two-fold. First, with respect Keywords: data engineering; dynamic Bayesian networks; to the application domain, the current state-of-the-art is expert systems; football predictions; smart data; soccer extended as follows: predictions; temporal Bayesian networks. 1. This is the first study to present a model for accurate time-series forecasting in terms of how the strength of soccer teams evolves over adjacent soccer seasons, ACKNOWLEDGEMENTS without the need to generate predictions for individual We acknowledge the financial support by the European matches. Research Council (ERC) for funding this research project, ERC-2013-AdG339182-BAYES_KNOWLEDGE, and Agena 2. Previously published match-by-match prediction models Ltd for software support. (some of them include: Karlis & Ntzoufras, 2003; Rotshtein et al., 2005; Baio & Blangiardo, 2010; REFERENCES Hvattum & Arntzen, 2010; Constantinou & Fenton, 2012; Constantinou & Fenton, 2013b) which fail to account for the external factors influencing team Baio, G., & Blangiardo, M. (2010). Bayesian hierarchical model for the strength, are prone to an error of 8.512 league points prediction of football results. Journal of Applied Statistics, accumulated per team, in terms of prior belief for team 37:2, 253- 264. strength, and for each subsequent season. Therefore, one Constantinou, A., Fenton, N., & Neil, M. (2012). pi-football: A Bayesian could improve match-by-match predictions by reducing network model for forecasting Association Football match the error in terms of prior belief. outcomes. Knowledge-Based Systems, 36: 322, 339. Constantinou, A., & Fenton, N. (2013a). Profiting from an inefficient 3. Studies which assess the efficiency of the soccer gambling Association Football gambling market: Prediction risk and market (Dixon & Pope, 2004; Goddard & Uncertainty using Bayesian networks. Knowledge-Based Asimakopoulos, 2004; Graham & Stott, 2008; Systems, 50: 60-86. Constantinou & Fenton, 2013b) may find the BN model Constantinou, A, & Fenton, N. (2013b). Profiting from arbitrage and helpful in the sense that it could help in explaining odds biases of the European football gambling market. The previously unexplained fluctuations in published market Journal of Gambling Business and Economics, Vol. 7, 2: 41-70. odds. Dixon, M., & Pope, P. (2004). The value of statistical forecasts in the UK Second, with respect to the general strategy for learning from association football betting market. International Journal of data, we demonstrate that seeking ‘bigger’ data is not always Forecasting, 20, 697-711. the path to follow. The model presented in this paper, for Goddard, J., & Asimakopoulos, I. (2004). Forecasting Football Results instance, is based on just 300 data instances generated over a and the Efficiency of Fixed-odds Betting. Journal of period of 15 years. With a smart-data approach, one should Forecasting, 23, 51-66 aim to improve the quality, as opposed to the quantity, of a Graham, I., & Stott, H. (2008). Predicting bookmaker odds and efficiency dataset which also directly influences the quality of the for UK football. Applied Economics, 40, 99-109. model. We highlight the importance of developing models Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match based on what data we really require for inference, rather result prediction in association football. International Journal than generating a model based on what data are available of Forecasting, 26, 460-470. which represents the conventional approach to big-data Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using solutions. With smart-data one has to have a clear bivariate Poisson models. The Statistician, 52: 3, 381-393. understanding of the inferences of interest. Inferring Kendrick, M. (2014). Doctoring Data: How to sort out medical advice knowledge from data imposes further challenges and requires from medical nonsense. UK, Columbus Publishing. Rotshtein, A., Posner, M., & Rakytyanska, A. (2005). Football 2 Note that this error assumes EPL teams, and is dependent on the size predictions based on a fuzzy model with genetic and neural of the league. For instance, the EPL consists of 20 teams and each tuning. Cybernetics and Systems Analysis, 41: 4, 619- 630. team has to play 38 matches. Hence, the maximum possible accumulation of points is 114. 2 BMAW 2016 - Page 55 of 59