<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Arti cial Neural Network-based Prediction Model for Underdog Teams in NBA Matches</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Camerino, School of Science and Technology, Computer Science Division</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we present an arti cial neural network-based prediction model for underdog teams in NBA matches (ANNUT). We describe the steps of our supervised algorithm, starting from data acquisition to prediction selection. We talk about prediction selection because the nal stage of our model is represented by a ltration phase. In this phase, the outputs returned from the neural network are evaluated according to how the events are quoted on one of the most famous bookmakers. Experimental results prove that the model is able to select with a certain accuracy winning teams. In particular, it reaches excellent results when we restrict the selection among underdogs (teams which probably will not win). Furthermore, we show that a signi cant sports prediction model cannot ignore bookmaker's odds.</p>
      </abstract>
      <kwd-group>
        <kwd>sports prediction model</kwd>
        <kwd>arti cial neural network</kwd>
        <kwd>bookmaker's odd analysis</kwd>
        <kwd>supervised classi cation</kwd>
        <kwd>basketball</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The world of sports betting is a real jungle. There exists a huge number of
bookmakers and prediction model for every sport. In this paper, we deal with
predictions of outcome for basketball events. We consider the most famous basketball
league, the National Basketball Association (NBA) played in North America.
The worth of NBA is its incredible amount of matches. In fact, during the
regular season, each team plays 82 games, 41 each home and away. Furthermore,
most of those matches end with a small winning gap, this makes the NBA league
one of the most exciting league in all sports. Due to this success, bookmakers
and bettors follow NBA with extreme interest. These are some reasons why we
choose NBA, in addition, we can count on a well-updated statistics database
provided directly by NBA.</p>
      <p>In this work, we present a prediction model based on training and application of
arti cial neural networks (ANNs) with the nal aim to predict if home or away
team is going to win the match. In detail, we see why all the predictions are not
the same. In fact, we add another ingredient: bookmaker's odd. We describe all
needed steps of analysis and the experimental results. Using classical techniques,
we train the ANN with data related to the last regular season ended in April
2017 and we show its performance. Moreover, we test our ANNUT model on
data concerning the last part of the NBA regular season showing its worth also
in practice i.e., money gained by betting on these predictions. We show how our
model, according to experimental results, is able to select underdog teams which
have an underestimated odd (according to the ANN output).</p>
      <p>The paper is organized in the following way, in Section II, we report a brief
background dealing with neural network and machine learning in general.
Furthermore, we provide a description of sports betting. In Section III, we describe
the state of art and similar works. In Section IV, we explain in details the phase
of our analysis starting from data acquisition to outputs. In Section V, we present
the experimental results applying our prediction model to matches of the last
part of the regular season. Finally, in Section VI, we conclude the paper also
with some possible future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>In this section, we are going to introduce brie y what is machine learning, in
particular, data classi cation. Furthermore, we provide a description of the main
actors involved in a betting scenario.
2.1</p>
      <sec id="sec-2-1">
        <title>Machine Learning: Data Classi cation</title>
        <p>
          Machine learning [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is the sub eld of computer science, it is considered the
ability of a computing system to learn without being explicitly programmed. In
machine learning, classi cation is the problem of identifying to which of a set of
categories (sub-populations) a new observation belongs, on the basis of a
training set of data containing observations whose category membership is known
(supervised learning [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]).
        </p>
        <p>
          Arti cial neural networks (ANNs) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] are computing systems inspired by the
biological neural networks that constitute animal brains. Such systems learn
(progressively improve performance) to do tasks by considering examples,
generally without task-speci c programming. For example, in image recognition,
they might learn to identify images that contain cats by analysing example
images that have been manually labelled as "cat" or "no cat" and using the analytic
results to identify cats in other images.
        </p>
        <p>
          In particular, a feed-forward neural network is an arti cial neural network wherein
connections between the units do not form a cycle. In this network, the
information moves in only one direction, forward, from the input nodes, through the
hidden nodes (if any) and to the output nodes. It is common to use a
backpropagation method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] as a part of algorithms that optimize the performance of the
network by adjusting the weights. This approach calculates the gradient of the
loss function with respect to the weights in an arti cial neural network.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Sports Betting</title>
        <p>Sports betting is the activity of predicting sports results and placing a wager
on the outcome. The frequency of sports bet upon varies by culture, with the
vast majority of bets being placed on association football, American football,
basketball, baseball, hockey, track cycling, auto racing, mixed martial arts and
boxing at both the amateur and professional levels. Sports betting can also
extend to non-athletic events, such as reality show contests and political elections,
and non-human contests such as horse racing, greyhound racing and illegal,
underground dog ghting.</p>
        <p>The bookmaker functions as a market maker for sports wagers, most of which
have a binary outcome: a team either wins or loses. The bookmaker accepts both
wagers and maintains a spread which will ensure a pro t regardless of the
outcome of the wager (bookmaker's fee). In other words, bookmakers have a xed
income for every event, in fact, they underestimate every possible outcome of
the event, i.e. they calculate the odd according to a higher probability than the
expected one, meaning a lower odd for the outcome.</p>
        <p>Odds for di erent outcomes in single bet are presented either in European
format (decimal odds), UK format (fractional odds), or American format
(moneyline odds). European format (decimal odds) are used in continental Europe,
Canada, and Australia. They are the ratio of the full payout to the stake, in a
decimal format. Decimal odds of 2.00 are an even bet with a theoretical implied
probability of 50% (without considering bookmaker's fee).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        In the state of art concerning NBA prediction models we can nd several works,
for instance in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], authors present a comparison between NBA and the
National College Athletics Association Basketball (NCAAB). They evaluate the
implementations of the multilayer perceptrons, random forest, and Naive Bayes
classi ers. They used fewer variables than our model without considering every
match as a single record and they do not take into account the distinction
between home and road team. Furthermore, they do not consider the bookmaker's
expectation of events.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], authors formalize the problem of predicting NBA game results as a
classi cation problem and apply the principle of Maximum Entropy to construct an
NBA Maximum Entropy (NBAME) model that ts discrete statistics for NBA
games, and then predict the outcomes of NBA playo s using the model.
In journal paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], authors propose two network-based models to predict the
behaviour of teams in sports leagues. This represents a di erent approach from
our model, in fact, they do not start from team statistics or other parameters
but they model sports leagues as networks of players and teams where the only
information available is the work relationships among them.
      </p>
      <p>GP
W</p>
      <p>L
WIN%</p>
      <p>MIN
FGM
FGA
FG%
3PM
3PA
3P%
FTM
FTA</p>
      <sec id="sec-3-1">
        <title>Games Played</title>
      </sec>
      <sec id="sec-3-2">
        <title>Wins</title>
      </sec>
      <sec id="sec-3-3">
        <title>Losses</title>
      </sec>
      <sec id="sec-3-4">
        <title>Win Percentage</title>
      </sec>
      <sec id="sec-3-5">
        <title>Minutes Played</title>
      </sec>
      <sec id="sec-3-6">
        <title>Field Goalds Made</title>
      </sec>
      <sec id="sec-3-7">
        <title>Field Goals Attempted</title>
      </sec>
      <sec id="sec-3-8">
        <title>Field Goal Percentage</title>
      </sec>
      <sec id="sec-3-9">
        <title>3 Point Field Goals Attempted</title>
      </sec>
      <sec id="sec-3-10">
        <title>3 Point Field Goals Percentage</title>
      </sec>
      <sec id="sec-3-11">
        <title>Free Throws Made</title>
      </sec>
      <sec id="sec-3-12">
        <title>Free Throws Attempted</title>
        <p>In the following section, we describe the analysis process ow. Our model is
essentially composed of four phases. First of all, we have to collect data. Then,
in the ltering phase, we process this raw dataset by removing irrelevant features.
After that, we train the ANN deploying the resulting dataset as training source
and we evaluate the performance of the generated network. Finally, we compare
the ANN output, computed on a new input set of data referring to future events,
to the bookmaker's odd of those events. In Fig. 2 we present a diagram describing
the ANNUT model.</p>
        <p>FT%</p>
        <sec id="sec-3-12-1">
          <title>OREB</title>
        </sec>
        <sec id="sec-3-12-2">
          <title>DREB</title>
          <p>
            REB
AST
TOV
STL
BLK
PF
PFD
PTS
+/In our model, every record represents a match. In every row, we insert home team
statistics and road (away) team statistics. Statistics refer to previous matches
played during the season until the day of the match. It is important to note
that we take into account only home matches for the home team and only away
matches for the away team. This is because we assume that teams have a slighltly
di erent behaviour playing home or away. Data are extracted directly from the
o cial NBA site [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. In table 1, we can see all the features provided by NBA
data source.
          </p>
          <p>The NBA calendar is full of matches and it is common to have that two teams
have a di erent number of played matches in the season. For this reason, to
have a balanced dataset we normalize our data according to the actual played
minutes, considering in this way also overtime periods. Thus, we have values per
minute, for instance: points scored per minute. The two chosen categories for
classi cation are win home and win road, they are boolean variables set to 1 if
home team or road team won the match. Data refer to the second half of the
season allowing us to have a set of data based on a signi cant amount of already
played matches.
4.2</p>
        </sec>
        <sec id="sec-3-12-3">
          <title>Data Filtering</title>
          <p>Once prepared the raw dataset, the next step is the ltering phase. We decided
to select a subset of available variables. In details, we select ten features which
describe almost completely the characteristics of a team. The selected features
are shown in Tab. 2. We choose to remove rst of all the statistics concerning win
or loss matches because they do not describe playing styles of teams. Then, we
remove the number of shots, we evaluate that having the success shot percentage
and the points scored it is su cient to have a consistent description of how
much and how a team scores points. Furthermore, we remove foul statistics
because they do not improve the network performances. Finally, we consider
other variables (as the number of turnovers, blocks, etc.) giving information on
the ability of teams mostly concerning defence playing. We get records composed
of twenty elds, ten elds for each team.</p>
          <p>PTS
FG%
FT%
REB</p>
        </sec>
      </sec>
      <sec id="sec-3-13">
        <title>Points</title>
        <p>AST</p>
      </sec>
      <sec id="sec-3-14">
        <title>Assists</title>
      </sec>
      <sec id="sec-3-15">
        <title>Field Goal Percentage</title>
      </sec>
      <sec id="sec-3-16">
        <title>TOV Turnovers</title>
      </sec>
      <sec id="sec-3-17">
        <title>3P% 3 Point Field Goals Percentage STL</title>
      </sec>
      <sec id="sec-3-18">
        <title>Free Throw Percentage</title>
        <p>BLK</p>
      </sec>
      <sec id="sec-3-19">
        <title>Steals</title>
      </sec>
      <sec id="sec-3-20">
        <title>Blocks</title>
      </sec>
      <sec id="sec-3-21">
        <title>Rebounds +/- Plus Minus Table 2: List of selected variables provided by the NBA database.</title>
        <p>
          Our ANNUT model is based on the deployment of an ANN for pattern
recognition. Our aim is to predict which team will probably win the match, home
or road. In order to build the network, we deploy the "Neural Network" tool
provided by Matlab [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The algorithm implemented by the tool is based on a
two-layer feed-forward network, with sigmoid hidden and output neurons. It is
used to classify vectors into speci ed target categories. The network is trained
with scaled conjugate gradient backpropagation. The training set is composed
of 566 matches, played between late December 2016 and late March 2017. In
the training phase, we select the following split of input data: 396 matches for
training, 141 matches for validation, 28 matches for testing. We follow a
classical neural network training approach by using also validation data to select
the best performing network. Furthermore, we do not select a bigger training
dataset because we note that using a bigger one will train a network with worse
or at least equal performance.
        </p>
        <p>
          In order to evaluate the performance of the generated network, we show the
receiver operating characteristic (ROC) curve presented in Fig. 1. By looking
at the area under the curve ROC (AUROC) of Tab.3 (computed constructing
trapezoids under the curve), we have a measure of the predictive accuracy of the
model. We can see that the neural network has good prediction capabilities, in
fact, we have to consider that we are into a non-deterministic context (sports
events) in which we have a huge amount of involved variables. Our AUROC
values are compared in Tab. 3 with classi ers presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This table shows
a signi cant gap between our ANNUT model and other approaches, underlying
the good results reached by our trained network. In detail, we compare our
ANNUT model (ANNUTH for classi cation of the win of home team and ANNUTR
for the road team) with Naive Bayes (NB), logistic regression (LR),
backpropagation neural networks (BP-ANN), random forest (RF) and NBAME model.
At this point, the trained ANN can be used to classify new data and we can go
further with the last analysis step.
Now, we want to introduce a di erent perspective in order to evaluate the output
of the network. Other approaches take simply the output from the network as
a prediction, instead of in our model we evaluate network output in comparison
with betting odds (in decimal format) of the match. In other words, we compare
the real normalized value returned by the ANN (included between zero and one)
with the implied bookmaker's probabilities. In fact, bookmakers select their odds
according to the expectation of an event (implied probability) and the market.
Starting from bookmaker's quotation, we can compute this probability of an
event p by the equation:
p =
        </p>
        <p>1
q c
(1)
where q is the bookmaker's odd for that event and c is a xed parameter
indicating the bookmaker's commission (the average bookmaker's gain). The value
p should return the implied probability computed by the bookmaker, i.e. the
bookmaker's expectation on that outcome. For this reason, we consider also
bookmaker's commission c. Once computed the value p for the event, we
compare it to the normalized ANN output o. If the di erence between values p and
o exceeds a prede ned threshold t, we choose this prediction. In other words, it
means that our expectation of that event is greater at least of a factor t than
the bookmaker's one.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>
        We are now able to present the experimental results. We build the network with
a dataset composed of 566 matches extracted from the o cial NBA database
and odds are extracted from one of the most famous bookmakers, Bet365 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
After training the ANN, we apply our ANNUT model to another set of matches
referring to the last part of the regular season 2016/2017, the rst two weeks of
April. This test dataset has 117 records (matches), in which every record always
contains statistics for home and away team. For what concerns the Eq. 1 the
parameter c is setted to 1:041, meaning a bookmaker's fee of 4; 1%. This value
comes out from the computation of the average commission of ten similar events
quoted on Bet365. Furthermore, the selection threshold is equal to 0:10. So, it
results that for every normalized output o if o &gt; p + 0:10, we keep this prediction
otherwise we discard it. Thus, we are interested not only in the pure prediction
but in some way we are looking for a quality prediction. In fact, our aim is to
design a pro table model because our perspective is to gain money in betting (or
at least to not lose it). After this consideration, if we want to have a pro table
model, we cannot neglect which is the bookmaker's odd of that event.
in order to evaluate our ANNUT model, accuracy was used as a performance
measure, and it was calculated by the following formula:
      </p>
      <p>Accuracy =
number of correct predictions
number of predictions
100
(2)
In the Tab. 4, we expose our results according to our prediction schema. In the
rst row, we consider every range of odd and we collect 69 event predictions
over 117 considered matches, their winning rate, average odd and the balance
(computed assuming that we bet one unit per prediction). In next rows, we
restrict the selected predictions according to the odd, for instance in the second
row we consider only events listed with an odd greater than 1:99. The winning
percentage must be evaluated considering that our model selects events with
high odds, i.e., low probability but high pro tability. In order to better show the
performance of our model, we compute also the implied probability according to
the odds of the selected matches. The implied probability is exactly the value p
computed by Eq.1 with c = 1:041 and q is the average odd. Tab. 4 shows that the
accuracy of our model has a good margin over the implied probability, especially
in the case of odds greater than 2:99 where we have a signi cant percentage of
correct predictions with respect to the implied probability.</p>
      <p>In the last two rows, we can see a consistent gain in units. We can interpret this
result as the capability of our model to predict the win of an underdog team.
Considering only odds between 3:00 and 10:00, our model selects 26 events with
an average odd of 4:63. 10 out of 26 events are winning predictions with a winning
average odd of 4:89. Assuming to bet one unit for each event, we have a total
balance of +22:89 units. Taking into account matches referring to the last row of
Tab. 4 and assuming that we have a base bet amount of 10:00 e on each match,
in Fig. 3 we present the trend of our balance.</p>
      <p>Odd
no restrictions
In conclusion, in this paper, we describe the ANNUT model showing its
profitability on real events. The ANNUT model reveals good performance in term of
ANN ROC curve but also concerning the selection of winning underdog teams
which lead us to interesting money earnings showing his e ectiveness into
practice.</p>
      <p>The interpretation of the success of our framework can be related to the
consensus phenomenon which shows what percentage of the general betting public
is on each side of the game. Thus, in order to balance the market, bookmakers
change odds according to the sentiment of bettors. Considering this scenario,
our ANNUT model shows its ability in spotting undervalued teams.
There are several future works whom we can add, for instance, we can make
an additional analysis on how to choose the threshold t. Furthermore, one weak
point is represented by the fact that the model does not consider missing
players. This can lead to a wrong prediction if one or more relevant players will not
probably play the game because odd will very high for that team.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Bet365 sports bookmaker, https://mobile.bet365.com/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <article-title>Matlab neural network toolbox</article-title>
          , https://it.mathworks.com/products/ neural-network.html
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Nba statistics team database, http://stats.nba.com/teams/traditional/#! ?Season=
          <fpage>2016</fpage>
          -
          <lpage>17</lpage>
          &amp;SeasonType=Regular%20Season&amp;PerMode=Totals&amp;sort=W_PCT&amp; dir=-1
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Cheng, G.,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kyebambe</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kimbugwe</surname>
          </string-name>
          , N.:
          <article-title>Predicting the outcome of NBA playo s based on the maximum entropy principle</article-title>
          .
          <source>Entropy</source>
          <volume>18</volume>
          (
          <issue>12</issue>
          ),
          <volume>450</volume>
          (
          <year>2016</year>
          ), https://doi.org/10.3390/e18120450
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cilimkovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Neural networks and back propagation algorithm http:// dataminingmasters</article-title>
          .com/uploads/studentProjects/NeuralNetworks.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. de Melo,
          <string-name>
            <given-names>P.O.S.V.</given-names>
            ,
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.A.F.</given-names>
            ,
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.A.F.</given-names>
            ,
            <surname>Faloutsos</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Forecasting in the NBA and other team sports: Network e ects in action</article-title>
          .
          <source>TKDD</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <volume>13</volume>
          :1{
          <fpage>13</fpage>
          :
          <fpage>27</fpage>
          (
          <year>2012</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2362383.2362387
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mohri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rostamizadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwalkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Foundations of Machine Learning</article-title>
          . The MIT Press (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Munoz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Machine learning and optimization</article-title>
          . URL: https://www. cims. nyu. edu/~ munoz/ les/ml optimization.
          <source>pdf [accessed</source>
          <year>2016</year>
          -
          <volume>03</volume>
          -02]
          <article-title>[WebCite Cache ID 6 LfZvnG] (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Basketball predictions in the NCAAB and NBA: similarities and di erences</article-title>
          .
          <source>Statistical Analysis and Data Mining</source>
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <volume>350</volume>
          {
          <fpage>364</fpage>
          (
          <year>2016</year>
          ), https:// doi.org/10.1002/sam.11319
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>