=Paper= {{Paper |id=Vol-1663/bmaw2016_extended-abstract-1 |storemode=property |title=Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams’ Evolving Performance |pdfUrl=https://ceur-ws.org/Vol-1663/bmaw2016_extended-abstract-1.pdf |volume=Vol-1663 |authors=Anthony Constantinou,Norman Fenton |dblpUrl=https://dblp.org/rec/conf/uai/ConstantinouF16 }} ==Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams’ Evolving Performance== https://ceur-ws.org/Vol-1663/bmaw2016_extended-abstract-1.pdf
  Improving Predictive Accuracy Using Smart-Data rather than
 Big-Data: A Case Study of Soccer Teams’ Evolving Performance


                            Anthony Constantinou                                  Norman Fenton
                         Queen Mary University of London,                  Queen Mary University of London,
                               London, UK, E1 4NS                                London, UK, E1 4NS
                            a.constantinou@qmul.ac.uk                            n.fenton@qmul.ac.uk



            EXTENDED ABSTRACT                                          season ends. The need for greater accuracy in such
                                                                       predictions has become the subject of international interest
      (this paper is published as extended abstract only)              following the 2015-16 English Premier League (EPL) season
                                                                       when Leicester City finished top of the league, having been
                                                                       priced at 5,000 to 1 to do so by many bookmakers.
In an era of big-data the general consensus is that
relationships between variables of interest surface almost by          We use a data and knowledge engineering approach that puts
themselves. Sufficient amounts of data can nowadays reveal             greater emphasis on applying causal knowledge and real-
new insights that would otherwise have remained unknown.               world ‘facts’ to the process of model development for real-
Inferring knowledge from data, however, imposes further                world decision making, driven by what data are really
challenges. For example, the 2007-08 financial crisis revealed         required for inference, rather than blindly seeking ‘bigger’
that big-data models used by investment banks and rating               data. We refer to this as the ‘smart data’ approach. We use a
agencies for decision making failed to predict real-world              Bayesian network (BN) as the appropriate modelling method.
financial risk. This is because while such big-data models are         Based on the soccer case study, we illustrate the reasoning
excellent at predicting past events, they may fail to predict          towards this smart-data approach to BN modeling with two
similar future events that are influenced by new and hence,            subsystems:
previously unseen factors.
                                                                       1.     A knowledge-based intervention for informing the model
In many real-world domains, experts comprehend vital                          about real-world time-series facts; and
influential processes which data alone may fail to discover.           2.     A knowledge-based intervention for data-engineering
Yet, such knowledge is normally disregarded in favor of                       purposes to ensure data adhere to the structure of the
automated learning, even when the data are limited. While                     model.
automation provides major benefits, these benefits sometimes
                                                                       The BN model incorporates factors such as player injuries,
come at a cost for accuracy. This study focuses on a
                                                                       managerial changes, team involvement in other European
prediction problem that has similarities to financial risk,
                                                                       competitions, and financial investments relative1 to
namely predicting evolving soccer team performance. Soccer
                                                                       adversaries. The BN model is based on three distinct time
is the world’s most popular sport and constitutes an
                                                                       components:
important share of the gambling market. Just like in financial
risk, future team performance can be suddenly and                      1.     Observed events from previous season that have
dramatically affected by rarely seen, or previously unseen,                   influenced team performance;
events and so both require smarter ways of data engineering            2.     Observed events during the summer break that are
and modeling, rather than just larger amounts of data.                        expected to influence team performance;
                                                                       3.     Expected performance for next season, accounting for the
Most of the previous extensive work on soccer has focused on
                                                                              uncertainty which arises from other unknown events
results predictions based on historical data of relevant match
                                                                              which may influence team performance, such as injuries.
instances. In this study we do not consider individual match
results, but rather exploit external factors which may                 This process is repeated for each new season, for a total of 15
influence the strength of a team and its resulting                     seasons. This approach enabled us to provide far more
performance. The aim is to predict a soccer team’s                     accurate predictions compared to purely data-driven standard
performance for a whole season (measured by total number of
league points won) before the season starts. This is an                1
                                                                           Team A may spend £20m to improve their squad, but if the average
important and enormous gambling market in itself - betters
                                                                            adversary spends £30m, then the strength of Team A is expected to
start placing bets such as which team will win the title, finish
                                                                            diminish relative to the average adversary.
in top positions, or be relegated, as soon as the previous


                                                                   1
                                                      BMAW 2016 - Page 54 of 59
non-linear regression models, which still represent the                         skills that merge the quantitative as well as qualitative
standard method for prediction in critical real-world risk                      aspects of data.
assessment problems, such as in medical decision analysis
                                                                                For future research, we question whether automated learning
(Kendrick, 2014). Specifically, we demonstrate how we
                                                                                of the available data is capable of inferring real-world facts
managed to generate accurate predictions of the evolving
                                                                                such as those incorporated into the BN model presented in
performance of soccer teams based on limited data that
                                                                                this paper. It may be the case that, for many real-world
enables us to predict, before a season starts, the total league
                                                                                problems, resulting inferences will be limited in the absence of
points to be accumulated. Predictive validation over a series
                                                                                expert intervention for data engineering as well as modeling
of 15 EPL seasons demonstrates a mean error of 4.06 points
                                                                                purposes. Future research will examine the capability of
(the possible range of points a team can achieve is 0 to 114).
                                                                                causal discovery algorithms in terms of realizing various real-
In contrast, for two different regression based methods, the
                                                                                world facts from data, and the impact various data-
mean errors are 7.27 and 7.30.
                                                                                engineering interventions may have on the results.
The implications of the paper are two-fold. First, with respect
                                                                                Keywords: data engineering; dynamic Bayesian networks;
to the application domain, the current state-of-the-art is
                                                                                expert systems; football predictions; smart data; soccer
extended as follows:
                                                                                predictions; temporal Bayesian networks.
1.     This is the first study to present a model for accurate
       time-series forecasting in terms of how the strength of
       soccer teams evolves over adjacent soccer seasons,
                                                                                ACKNOWLEDGEMENTS
       without the need to generate predictions for individual                  We acknowledge the financial support by the European
       matches.                                                                 Research Council (ERC) for funding this research project,
                                                                                ERC-2013-AdG339182-BAYES_KNOWLEDGE, and Agena
2.     Previously published match-by-match prediction models                    Ltd for software support.
       (some of them include: Karlis & Ntzoufras, 2003;
       Rotshtein et al., 2005; Baio & Blangiardo, 2010;
                                                                                REFERENCES
       Hvattum & Arntzen, 2010; Constantinou & Fenton,
       2012; Constantinou & Fenton, 2013b) which fail to
       account for the external factors influencing team                        Baio, G., & Blangiardo, M. (2010). Bayesian hierarchical model for the
       strength, are prone to an error of 8.512 league points                             prediction of football results. Journal of Applied Statistics,
       accumulated per team, in terms of prior belief for team                            37:2, 253- 264.
       strength, and for each subsequent season. Therefore, one                 Constantinou, A., Fenton, N., & Neil, M. (2012). pi-football: A Bayesian
       could improve match-by-match predictions by reducing                               network model for forecasting Association Football match
       the error in terms of prior belief.                                                outcomes. Knowledge-Based Systems, 36: 322, 339.
                                                                                Constantinou, A., & Fenton, N. (2013a). Profiting from an inefficient
3.     Studies which assess the efficiency of the soccer gambling                         Association Football gambling market: Prediction risk and
       market (Dixon & Pope, 2004; Goddard &                                              Uncertainty using Bayesian networks. Knowledge-Based
       Asimakopoulos, 2004; Graham & Stott, 2008;                                         Systems, 50: 60-86.
       Constantinou & Fenton, 2013b) may find the BN model                      Constantinou, A, & Fenton, N. (2013b). Profiting from arbitrage and
       helpful in the sense that it could help in explaining                              odds biases of the European football gambling market. The
       previously unexplained fluctuations in published market                            Journal of Gambling Business and Economics, Vol. 7, 2: 41-70.
       odds.                                                                    Dixon, M., & Pope, P. (2004). The value of statistical forecasts in the UK
Second, with respect to the general strategy for learning from                            association football betting market. International Journal of
data, we demonstrate that seeking ‘bigger’ data is not always                             Forecasting, 20, 697-711.
the path to follow. The model presented in this paper, for                      Goddard, J., & Asimakopoulos, I. (2004). Forecasting Football Results
instance, is based on just 300 data instances generated over a                           and the Efficiency of Fixed-odds Betting. Journal of
period of 15 years. With a smart-data approach, one should                               Forecasting, 23, 51-66
aim to improve the quality, as opposed to the quantity, of a                    Graham, I., & Stott, H. (2008). Predicting bookmaker odds and efficiency
dataset which also directly influences the quality of the                                 for UK football. Applied Economics, 40, 99-109.
model. We highlight the importance of developing models                         Hvattum, L. M., & Arntzen, H. (2010). Using ELO ratings for match
based on what data we really require for inference, rather                               result prediction in association football. International Journal
than generating a model based on what data are available                                 of Forecasting, 26, 460-470.
which represents the conventional approach to big-data                          Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using
solutions. With smart-data one has to have a clear                                         bivariate Poisson models. The Statistician, 52: 3, 381-393.
understanding of the inferences of interest. Inferring                          Kendrick, M. (2014). Doctoring Data: How to sort out medical advice
knowledge from data imposes further challenges and requires                                from medical nonsense. UK, Columbus Publishing.
                                                                                Rotshtein, A., Posner, M., & Rakytyanska, A. (2005). Football
2
     Note that this error assumes EPL teams, and is dependent on the size                 predictions based on a fuzzy model with genetic and neural
     of the league. For instance, the EPL consists of 20 teams and each                   tuning. Cybernetics and Systems Analysis, 41: 4, 619- 630.
     team has to play 38 matches. Hence, the maximum possible
     accumulation of points is 114.


                                                                            2
                                                              BMAW 2016 - Page 55 of 59