<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Predictive Accuracy Using Smart-Data rather than Big-Data: A Case Study of Soccer Teams' Evolving Performance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Constantinou</string-name>
          <email>a.constantinou@qmul.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norman Fenton</string-name>
          <email>n.fenton@qmul.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Queen Mary University of London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
          ,
          <addr-line>E1 4NS</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>(this paper is published as extended abstract only) In an era of big-data the general consensus is that relationships between variables of interest surface almost by themselves. Sufficient amounts of data can nowadays reveal new insights that would otherwise have remained unknown. Inferring knowledge from data, however, imposes further challenges. For example, the 2007-08 financial crisis revealed that big-data models used by investment banks and rating agencies for decision making failed to predict real-world financial risk. This is because while such big-data models are excellent at predicting past events, they may fail to predict similar future events that are influenced by new and hence, previously unseen factors. In many real-world domains, experts comprehend vital influential processes which data alone may fail to discover. Yet, such knowledge is normally disregarded in favor of automated learning, even when the data are limited. While automation provides major benefits, these benefits sometimes come at a cost for accuracy. This study focuses on a prediction problem that has similarities to financial risk, namely predicting evolving soccer team performance. Soccer is the world's most popular sport and constitutes an important share of the gambling market. Just like in financial risk, future team performance can be suddenly and dramatically affected by rarely seen, or previously unseen, events and so both require smarter ways of data engineering and modeling, rather than just larger amounts of data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>EXTENDED
Most of the previous extensive work on soccer has focused on
results predictions based on historical data of relevant match
instances. In this study we do not consider individual match
results, but rather exploit external factors which may
influence the strength of a team and its resulting
performance. The aim is to predict a soccer team’s
performance for a whole season (measured by total number of
league points won) before the season starts. This is an
important and enormous gambling market in itself - betters
start placing bets such as which team will win the title, finish
in top positions, or be relegated, as soon as the previous
1.
2.
1.
2.
3.
season ends. The need for greater accuracy in such
predictions has become the subject of international interest
following the 2015-16 English Premier League (EPL) season
when Leicester City finished top of the league, having been
priced at 5,000 to 1 to do so by many bookmakers.
We use a data and knowledge engineering approach that puts
greater emphasis on applying causal knowledge and
realworld ‘facts’ to the process of model development for
realworld decision making, driven by what data are really
required for inference, rather than blindly seeking ‘bigger’
data. We refer to this as the ‘smart data’ approach. We use a
Bayesian network (BN) as the appropriate modelling method.
Based on the soccer case study, we illustrate the reasoning
towards this smart-data approach to BN modeling with two
subsystems:</p>
      <p>A knowledge-based intervention for informing the model
about real-world time-series facts; and
A knowledge-based intervention for data-engineering
purposes to ensure data adhere to the structure of the
model.</p>
      <p>The BN model incorporates factors such as player injuries,
managerial changes, team involvement in other European
competitions, and financial investments relative1 to
adversaries. The BN model is based on three distinct time
components:</p>
      <p>
        Observed events from previous season that have
influenced team performance;
Observed events during the summer break that are
expected to influence team performance;
Expected performance for next season, accounting for the
uncertainty which arises from other unknown events
which may influence team performance, such as injuries.
This process is repeated for each new season, for a total of 15
seasons. This approach enabled us to provide far more
accurate predictions compared to purely data-driven standard
1 Team A may spend £20m to improve their squad, but if the average
adversary spends £30m, then the strength of Team A is expected to
diminish relative to the average adversary.
non-linear regression models, which still represent the
standard method for prediction in critical real-world risk
assessment problems, such as in medical decision analysis
        <xref ref-type="bibr" rid="ref10">(Kendrick, 2014)</xref>
        . Specifically, we demonstrate how we
managed to generate accurate predictions of the evolving
performance of soccer teams based on limited data that
enables us to predict, before a season starts, the total league
points to be accumulated. Predictive validation over a series
of 15 EPL seasons demonstrates a mean error of 4.06 points
(the possible range of points a team can achieve is 0 to 114).
In contrast, for two different regression based methods, the
mean errors are 7.27 and 7.30.
      </p>
      <p>The implications of the paper are two-fold. First, with respect
to the application domain, the current state-of-the-art is
extended as follows:</p>
      <p>This is the first study to present a model for accurate
time-series forecasting in terms of how the strength of
soccer teams evolves over adjacent soccer seasons,
without the need to generate predictions for individual
matches.</p>
      <p>
        Previously published match-by-match prediction models
        <xref ref-type="bibr" rid="ref1 ref11 ref2 ref3 ref4 ref8 ref9">(some of them include: Karlis &amp; Ntzoufras, 2003;
Rotshtein et al., 2005; Baio &amp; Blangiardo, 2010;
Hvattum &amp; Arntzen, 2010; Constantinou &amp; Fenton,
2012; Constantinou &amp; Fenton, 2013b)</xref>
        which fail to
account for the external factors influencing team
strength, are prone to an error of 8.512 league points
accumulated per team, in terms of prior belief for team
strength, and for each subsequent season. Therefore, one
could improve match-by-match predictions by reducing
the error in terms of prior belief.
      </p>
      <p>
        Studies which assess the efficiency of the soccer gambling
market
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">(Dixon &amp; Pope, 2004; Goddard &amp;
Asimakopoulos, 2004; Graham &amp; Stott, 2008;
Constantinou &amp; Fenton, 2013b)</xref>
        may find the BN model
helpful in the sense that it could help in explaining
previously unexplained fluctuations in published market
odds.
      </p>
      <p>Second, with respect to the general strategy for learning from
data, we demonstrate that seeking ‘bigger’ data is not always
the path to follow. The model presented in this paper, for
instance, is based on just 300 data instances generated over a
period of 15 years. With a smart-data approach, one should
aim to improve the quality, as opposed to the quantity, of a
dataset which also directly influences the quality of the
model. We highlight the importance of developing models
based on what data we really require for inference, rather
than generating a model based on what data are available
which represents the conventional approach to big-data
solutions. With smart-data one has to have a clear
understanding of the inferences of interest. Inferring
knowledge from data imposes further challenges and requires
2 Note that this error assumes EPL teams, and is dependent on the size
of the league. For instance, the EPL consists of 20 teams and each
team has to play 38 matches. Hence, the maximum possible
accumulation of points is 114.
skills that merge the quantitative as well as qualitative
aspects of data.</p>
      <p>For future research, we question whether automated learning
of the available data is capable of inferring real-world facts
such as those incorporated into the BN model presented in
this paper. It may be the case that, for many real-world
problems, resulting inferences will be limited in the absence of
expert intervention for data engineering as well as modeling
purposes. Future research will examine the capability of
causal discovery algorithms in terms of realizing various
realworld facts from data, and the impact various
dataengineering interventions may have on the results.
Keywords: data engineering; dynamic Bayesian networks;
expert systems; football predictions; smart data; soccer
predictions; temporal Bayesian networks.</p>
      <p>ACKNOWLEDGEMENTS
We acknowledge the financial support by the European
Research Council (ERC) for funding this research project,
ERC-2013-AdG339182-BAYES_KNOWLEDGE, and Agena
Ltd for software support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baio</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Blangiardo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Bayesian hierarchical model for the prediction of football results</article-title>
          .
          <source>Journal of Applied Statistics</source>
          ,
          <volume>37</volume>
          :
          <fpage>2</fpage>
          ,
          <fpage>253</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Constantinou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fenton</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Neil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>pi-football: A Bayesian network model for forecasting Association Football match outcomes</article-title>
          .
          <source>Knowledge-Based Systems</source>
          ,
          <volume>36</volume>
          :
          <fpage>322</fpage>
          ,
          <fpage>339</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Constantinou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Fenton</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2013a</year>
          ).
          <article-title>Profiting from an inefficient Association Football gambling market: Prediction risk and Uncertainty using Bayesian networks</article-title>
          .
          <source>Knowledge-Based Systems</source>
          ,
          <volume>50</volume>
          :
          <fpage>60</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Constantinou</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Fenton</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2013b</year>
          ).
          <article-title>Profiting from arbitrage and odds biases of the European football gambling market</article-title>
          .
          <source>The Journal of Gambling Business and Economics</source>
          , Vol.
          <volume>7</volume>
          ,
          <issue>2</issue>
          :
          <fpage>41</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Dixon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pope</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>The value of statistical forecasts in the UK association football betting market</article-title>
          .
          <source>International Journal of Forecasting</source>
          ,
          <volume>20</volume>
          ,
          <fpage>697</fpage>
          -
          <lpage>711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Goddard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Asimakopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Forecasting Football Results and the Efficiency of Fixed-odds Betting</article-title>
          .
          <source>Journal of Forecasting</source>
          ,
          <volume>23</volume>
          ,
          <fpage>51</fpage>
          -
          <lpage>66</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stott</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Predicting bookmaker odds and efficiency for UK football</article-title>
          .
          <source>Applied Economics</source>
          ,
          <volume>40</volume>
          ,
          <fpage>99</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Hvattum</surname>
            ,
            <given-names>L. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Arntzen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Using ELO ratings for match result prediction in association football</article-title>
          .
          <source>International Journal of Forecasting</source>
          ,
          <volume>26</volume>
          ,
          <fpage>460</fpage>
          -
          <lpage>470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Karlis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ntzoufras</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Analysis of sports data by using bivariate Poisson models</article-title>
          .
          <source>The Statistician</source>
          ,
          <volume>52</volume>
          :
          <fpage>3</fpage>
          ,
          <fpage>381</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kendrick</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Doctoring Data: How to sort out medical advice from medical nonsense</article-title>
          . UK, Columbus Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Rotshtein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rakytyanska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Football predictions based on a fuzzy model with genetic and neural tuning</article-title>
          .
          <source>Cybernetics and Systems Analysis</source>
          ,
          <volume>41</volume>
          :
          <fpage>4</fpage>
          ,
          <fpage>619</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>