=Paper=
{{Paper
|id=Vol-1971/paper-06
|storemode=property
|title=Dynamic Winner Prediction in Twenty20 Cricket: Based on Relative Team Strengths
|pdfUrl=https://ceur-ws.org/Vol-1971/paper-06.pdf
|volume=Vol-1971
|authors=Sasank Viswanadha,Kaustubh Sivalenka,Madan Gopal Jhawar,Vikram Pudi
|dblpUrl=https://dblp.org/rec/conf/pkdd/ViswanadhaSJP17
}}
==Dynamic Winner Prediction in Twenty20 Cricket: Based on Relative Team Strengths==
<pdf width="1500px">https://ceur-ws.org/Vol-1971/paper-06.pdf</pdf>
<pre>
       Dynamic Winner Prediction in Twenty20
      Cricket: Based on Relative Team Strengths

    Sasank Viswanadha1 , Kaustubh Sivalenka1 , Madan Gopal Jhawar2? , and
                               Vikram Pudi3
                    1
                    Mahindra École Centrale, Hyderabad, India,
            sasank14168@mechyd.ac.in, kaustubh14161@mechyd.ac.in
                                2
                                  Microsoft, India
                            majhawar@microsoft.com
         3
           DSAC, Kohli Centre on Intelligent Systems, IIIT Hyderabad, India
                               vikram@iiit.ac.in


        Abstract. Predicting the outcome of a match has always been at the
        center of sports analytics. Indian Premier League (IPL), a professional
        Twenty20 (T20) cricket league in India, has established itself as one of
        the biggest tournaments in cricket history. In this paper, we propose a
        model to predict the winner at the end of each over in the second in-
        nings of an IPL cricket match. Our methodology not only incorporates
        the dynamically updating game context as the game progresses, but also
        includes the relative strength between the two teams playing the match.
        Estimating the relative strength between two teams involves modeling
        the individual participating players’ potentials. To model a player, we
        use his career as well as recent performance statistics. Using the various
        dynamic features, we evaluate several supervised learning algorithms to
        predict the winner of the match. Finally, using the Random Forest Clas-
        sifier (RFC), we have achieved an accuracy of 65.79% - 84.15% over the
        course of second innings, with an overall accuracy of 75.68%.

        Keywords: Winner Prediction, Sports Analytics, Supervised Learning,
        Player Modeling, Cricket


1     Introduction
The use of statistical analysis in sports has been growing rapidly since the past
decade. It has not only changed the way game strategies are formed or the players
are evaluated, but also has impacted the way sports is viewed by the audience.
Cricket is one of the most followed team games in the world with billions of fans
all across the globe. The complex rules governing the game along with many
other player-dependent and natural parameters provide ample opportunities to
model the game from various perspectives.
    Cricket has evolved over time. Today, it is played in three major formats –
Test Matches, One Day Internationals (ODIs) and the T20 cricket. T20 cricket
?
    This work was done when the author was a student at IIIT-Hyderabad.
is the latest and the most exciting format of the game. Ever since its inception
in 2007, IPL has been a huge success and has generated a billion-dollar industry.
It is played during April and May of every year by teams representing Indian
cities and has already completed 10 successful seasons. Therefore, in this paper,
we focus our study on the IPL cricket matches. We propose a dynamic model to
predict the winner of a match at the end of each over in the second innings of the
match. Apart from various game dependent features such as the number of balls
remaining, the number of runs to be scored remaining, and the number of wickets
remaining, we have used the relative team strength between the competing teams
as a distinctive feature in predicting the winner of the match. A team is composed
of players, hence, estimating the relative team strength between two competing
teams requires us to estimate the potential of the players. Therefore, using the
recent and career performance statistics of a player we define novel methods to
render his batting and bowling capabilities, the two major roles of a player in
the game of cricket. Using these features, we have evaluated various supervised
learning algorithms to predict the winner of the match at the end of each over
as the match progresses.


2   Related Work


Over the last decade, the application of statistical methods in cricket analysis
has manifolded, particularly in the context of winner prediction. The application
of supervised learning techniques – Support Vector Machines (SVM) and Naive-
Bayes Classification towards predictive analysis, considering various factors such
as coin toss outcome, competing teams, home venue etc., in ODI matches is
presented in Khan, Mehvish, Riddhi Shah [6]. Kaluarachhi, Amal [7] studied the
impact of several factors in predicting the outcome of ODI cricket matches using
Bayesian classifiers. Madan Gopal Jhawar, Vikram Pudi [8] proposes an approach
to predict the winner of ODI cricket matches based on the team composition of
the competing teams. Deep C Prakash, C Patwardhan et al. [9] presents an
approach of winner predictions for the ninth season of IPL, at the start of the
season, by modeling the individual player strengths into cumulative batting and
bowling scores.
    However, the problem of winner prediction, while the game is in progress,
has not been studied in detail. Shankarnarayanan et al. [11] considers both the
historical data as well as instantaneous match states for ODI cricket to predict
the match winner using nearest-neighbor clustering and linear regression algo-
rithms. Shankarnarayanan et al.[11] introduces the idea of using segments to
break down an innings and make predictions for each segment. Michael Bailey,
Stephen R. Clarke [12] studied a range of variables that could independently
explain statistically significant proportions of variation associated with the pre-
dicted run totals and match outcomes were created. Further, they used a linear
regression model to predict the winner.
3     Problem Formulation and Notation
3.1   Overview of T20 Cricket : Rules
In the T20 format of cricket, each of the two playing teams bats for a maximum
of 120 deliveries and bowls for a maximum of 120 deliveries. The team that scores
the maximum amount of runs in the 120 deliveries or before they lose their 10
wickets, wins the match
    Over: A sequence of six balls bowled by a bowler from one end of the pitch
is called an over in cricket terminology.
    Innings: An innings is one of the divisions of a cricket match during which
one team takes its turn to bat. There are two innings in a game of cricket. In
this paper, we restrict our study to the second innings of a match.
    State: In our study, we define state to represent the different stages in the
match at which we make the predictions using our model. We consider 21 states
for each match; 1 at the beginning of the second innings and 20 at the end of
each over of the second innings. It is to be noted here that the number of states
considered to make predictions can be changed.

3.2   Notation
In this section, we introduce the notation to be used throughout this paper. We
use m to represent a match, innings1 and innings2 to denote the first and second
innings respectively. We use T eamA to represent the team batting in innings1
and T eamB to represent the team batting in innings2 . ScoreA denotes the runs
scored by T eamA in innings1 . T arget denotes the number of runs that T eamB
needs to score to win the match, T arget = ScoreA + 1. Si , 0 ≤ i ≤ 20 represents
the states in a m. S0 corresponds to the state at the end of innings1 and the
remaining states 1 ≤ i ≤ 20 each correspond to the state at the end of over i
in innings2 . S20 has been considered for training examples so as to make sure
that the model learns which team has won the game. S20 has been used in the
testing set as well and it serves as a confirmation that the model is working as
              m                                                                     m
expected. P lA   denotes the set of 11 players in T eamA playing in m and P lB
denotes the set of 11 players in T eamB playing in m.
    C(p) denotes the set of career statistics of a player p and F (p) denotes the set
of recent statistics (recent 4 games) or form of a player p. The career statistics are
shown in Table 1 and recent statistics are similar to career statistics, replacing
C with F .
    At each state, there are 3 parameters along with the relative team scores that
we use in our model to make predictions.
     i
 – Rruns  remaining denotes the number of runs T eamB needs to get to win the
                   i
   m at state i. Rruns  remaining = T arget − runs scored by T eamB at state i
     i
 – Rwickets remaining denotes the number of wickets T eamB has in hand at
             i
   state i. Rwickets remaining = 10 − wickets lost by T eamB at state i
     i
 – Rball remaining denotes the number of balls T eamB is yet to play at state i.
     i
   Rballs remaining = 120 − balls played by T eamB at state i
                             Table 1. Career Statistics


         Notation                              Description
         M PC                            # Matches Played by the player
         BaIC                   # Matches in which the player has batted
          RSC                               # Runs Scored by the player
          OBC                     # Overs in which the player has batted
          N OC                          # The player remained not − out
          BaC       # Average Runs scored by the player before getting out
        BaSRC           # Average runs scored by the player per 100 balls
         BoIC                  # Matches in which the player has bowled
          W TC                             # Wickets taken by the player
          RCC                             # Runs conceded by the player
          OBC                    # Overs in which the player has bowled
          BEC                    # Runs conceded by the player per over
        BoSRC              # Balls bowled by the player per wicket taken


4     Methodology

4.1   Batsman Rating

Calculation of Batting Average: Batting Average is defined as the average
number of runs scored by the batsman before he gets out. Batting average for
the career statistics is calculated in the following way
                                         RSC
                              BaC =               ,                           (1)
                                      BaIC − N OC

Calculation of Batting Strike Rate: Batting Strike Rate is defined as the av-
erage number of runs scored by the batsman before per 100 balls faced. Batting
strike rate for the career statistics is calculated in the following way
                                         RSC
                            BaSRC =              ∗ 100                        (2)
                                       (OBC ∗ 6)

   The batting average and strike rate for the recent statistics is calculated
similar to equations 1 and 2.


Calculation of Batsman Score The quality of the batsmen a team possesses
can greatly affect the outcome of a game. Consistency and fast run-scoring ability
are two traits common to all the good batsmen. Batting average and is a measure
of the consistency of the batsman and batting strike rate is a measure of his fast
run-scoring ability. As illustrated in [10], batting average, batting strike rate
can be used to effectively estimate the batting scores of participating players.
Career and recent scores of a player are calculated as shown in equations 3 and
4.                                      r
                 p                        BaIC
               φcareer batting score =          ∗ BaC ∗ BaSRC                (3)
                                          M PC
                                        r
                  p                       BaIF
                φrecent batting score =          ∗ BaF ∗ BSRF                (4)
                                            n
   The final batting score φpf inal batting score of a player considering his career
and recent statistics is given by the equation 5

 φpf inal batting score = µ ∗ φpcareer batting score + (1 − µ) ∗ φprecent batting score (5)

where n represents the number of recent matches considered and µ represents
the weight assigned to the career score in calculating the final batting score of a
player.


4.2   Bowler Rating

Calculation of Bowling Average: Bowling Economy is defined as the average
number of runs conceded by the bowler per over he bowls. Bowler average for
the career statistics is calculated in the following way
                                              RCC
                                    BEC =                                              (6)
                                              OBC

Calculation of Bowling Strike Rate: Bowling Strike Rate is defined as the
average number of balls bowled by the bowler per wicket taken. Bowling strike
rate for the career statistics is calculated in the following way

                                             (OBC ∗ 6)
                                BoSRC =                                                (7)
                                               W TC
   The bowling average and strike rate for the recent statistics is calculated
similar to equations 6 and 7.


Calculation of Bowler Score The quality of the bowlers a team possesses
also has significant impact on the game’s outcome. Economical bowling and
high wicket-taking ability are two traits common to all the good bowlers. Bowl-
ing economy and bowling strike rate is a measure of the economical bowling
while bowling strike rate is a measure of the bowler’s high wicket-taking abil-
ity.As illustrated in [10], bowling average and bowling strike rate can be used
to effectively estimate bowling scores of participating players. Career and recent
scores of a player are calculated as shown in equations 8 and 9.
                                          r
                                            BoIC          1
                 φpcareer bowling score =        ∗(               )            (8)
                                            M PC    BEC ∗ BoSRC
                                           r
                                           BoIF              1
                φprecent bowling score =          ∗(                 )             (9)
                                              n      BEF ∗ BoSRF
The final bowling score φpf inal bowling score of a player considering his career and
recent statistics is given by the equation 10
 φpf inal bowling score = µ ∗ φpcareer bowling score + (1 − µ) ∗ φprecent bowling score (10)
where n represents the number of recent matches considered and µ represents
the weight assigned to the career score in calculating the final batting score of a
player, and are same as the ones introduced in Equations 4 and 5, respectively.

4.3   Calculation of Relative Team Strength
A team’s batting and bowling strength will be a consolidated measure of the bat-
ting and bowling strengths of the 11 players playing in that match. Algorithm
1 illustrates the computation of Relative strengthT eamB /T eamA . Lines 1- 4 nor-
malize the φpbatting score and φpbowling score for all the players. As the match pro-
gresses through the innings2 , there is every possibility of some batsmen getting
out and some the bowlers using up their quota of deliveries (24 balls). Thus,
we compute the φTbatting
                      eam               T eam
                           score and φbatting score of a team as a weighted sum of
players (batsmen) who are not yet out and the players (bowlers) who still retain
their quota of deliveries respectively, Lines 5- 8. This introduces dynamism in
the team scores by removing players, who cannot contribute to the game any
longer (in terms of batting and bowling), from the respective batting and bowl-
ing scores of the team. Line 9 computes the Relative strengthT eamB /T eamA . We
only calculate Relative strength with respect to T eamB because T eamB bats in
innings2 according to our notation and we make predictions only for innings2
in our model. T eamA bowling score has a negative impact on T eamB batting
score and vice-versa in the formula in line 9.

4.4   Features
To predict the outcome of an ongoing T20 (IPL) match we first split the innings2
into 21 states. S0 at the end of innings1 and Si (1 ≤ i ≤ 20) at the end of i overs.
At a state Si we use the following dynamic features to make the prediction:
                                                      i
 – Runs remaining to be scored to win the match Rruns     remaining
                                             i
 – Wickets that T eamB still has in hand Rwickets remaining
                                                             i
 – Balls remaining to be played by T eamB in the innings Rball   remaining
 – Relative strengthT eamB /T eamA ; serving as a dynamic metric of team strengths
   to better forecast predictions.
The aforementioned features capture the state of the match at any given instance
while the match is in progress and these features change as the match progresses
towards completion. All these features are parsed to a classifier along with the
label (1 if T eamB wins, 0 otherwise) to forecast predictions for the match winner.
Venue or home advantage is not used as a feature because most of the pitches
in IPL are somewhat similar and the crowd support is even. Also, the shorter
format of the game makes this feature negligible.
Algorithm 1           Modeling Teams at the beginning of an over
                  m p
Input: P lAm
             , P lB ,φbatting score , φpbowling score ∀p ∈ (P lA
                                                               m      m
                                                                 , P lB )
Output: Relative strengthT eamB /T eamA
                m       m
 1: for p ∈ (P lA  ∪ P lB ) do
                                 p
                             φ
 2:    φpbatting score ← max(φ
                            batting score
                               p
                                                     )
                                     batting score
                                 p
                           φbowling score
 3:    φpbowling score ← max(φ p
                                                         )
                                     bowling score
 4: end for
 5: φTbatting
        eamA                        p
                       P
              score =   p∈(P lm
                              A
                                ) φbatting score
                                                         p
 6: φTbowling
        eamA                        24−balls bowled
                       P
               score =   p∈(P lm )(       24
                                                    ) ∗ φbowling score
                               A
 7: φTbatting
        eamB                                    p
                       P
              score =   p∈(P lm
                              B
                                (not yet out)) φbatting score
      T eamB
 8: φbowling score = p∈(P lm ) φpbowling score
                       P
                                     B
                                                         T eam
                                                            B              T eam A
                                                     φbatting score
                                                                          φbatting score
 9: Relative strengthT eamB /T eamA =                 T eamA          −    T eamB
                                                     φbowling score
                                                                          φbowling score


5     Experiments and Results

5.1    Dataset

The dataset can be broadly divided into two categories – historical data: per-
taining to the career statistics of players, and ball by ball data: pertaining to
various states of a match. The dataset for career statistics has been scraped
from the cricinf o website [13] for all the matches played in the seasons 3-10 of
IPL. The ball by ball data for each match in seasons 3-10 of IPL has been pro-
vided by the cricsheet website [14]. The dataset constitutes the match statistics
recorded after each ball, including runs scored, wickets lost, current batsmen,
current bowler, winner of the match, date of f ixture, etc. We combined data
from both these sources to build our prediction model. IPL seasons 3-7 have
been used for training our model, season 8 has been used for validating the pa-
rameters, and seasons 9 and 10 have been used as the test data, where each
season consists of 59 matches.


5.2    Learning Parameters

To learn the values of the parameters n and µ, used in Equations 4, 5, 9 and
10, we used all the matches played in seasons 3-8. We did a grid search over the
values of n and µ. For each combination of n and µ, we ranked the players in
the order of their estimated batting scores and estimated bowling scores, which
are compared against the actual ranked batting and bowling score lists in the
last n matches. Finally, n = 4 and µ = 0.8 yielded the least squared error in
terms of the rank difference between estimated and the actual lists.
                                          Accuracy of different classifiers
                                76

                                75

                                74

      Accuracy %
                                73

                                72

                                71

                            Decision      KNN    Gradient   SVM     Logistic Random
                             Tree                 Boost            Regression Forest


5.3   Results
                                     Accuracy after each over in innings2
                          100

                           95

                           90
             Accuracy %


                           85

                           80

                           75

                           70

                           65

                           60
                                0 1 2 3 4 5 6 7 8 9 1011121314151617181920
                                                  Overs

   Using the features described and match outcome as the label, we evaluated
various binary classifiers such as SVM, Random Forests, k Nearest Neighbors
(kNN), Logistic Regression and Decision Trees using their scikit-learn [15] im-
plementations. The P arameterGrid mechanism has been used to evaluate all
possible combinations of parameters for all the above listed algorithms. Fig-
ure 5.2 shows the accuracies of the different classifiers. The small differences in
the accuracy of the different classifiers suggests that the predictive power lies
in the features and not the classifer used. The Random Forest algorithm with
parameters: n estimators = 28, has yielded the highest accuracy, for the valida-
tion set, among the best models for all other classifiers. The results for this are
shown in Figure 5.3.
    From the plot in Figure 5.3, we observe an increasing trend in prediction
accuracies after each over as the match progresses until completion. This proves
the ability of our classifier to predict the winner with increasing confidence after
each over. This also agrees with common intuition, that as the game nears its
end, it is easier to predict the winner based on a given match state. While
we examine the increasing trend of prediction accuracies in Figure 5.3, some
fluctuations are observed around the middle overs. This is because a game need
not necessarily progress with increasing chances of one team’s victory. Most of
the times, the game fluctuates between both the team based on their very recent
(last few overs) performance in the match. However, when generalized over a set
of matches, the probability of accurately predicting the winner increases as the
game progresses towards its end.
    The overall prediction accuracy obtained regardless of the match state is
75.68%, with an accuracy of 65.79% at the beginning of the second innings
which increases to 84.15% at the end of the 19th over.
    There have been several works such as [9], [11], etc., specifically addressing the
problem of winner prediction in ODI and Twenty20 cricket. However, our study
cannot be directly compared to them as we consider our analysis only from the
beginning of the innings2 and our model cannot be translated into their works
for comparison. Nevertheless, table 2 briefs about some of the previous works
and their stated accuracies.


                  Table 2. Various Winner Prediction Models in Cricket

Author        Description                                                                    Accuracy
[9]            Winner prediction for IPL Season 9 (2016), at the start of the season           69.64%
[11]           A dynamic winner prediction model for ODIs, January 2011 to July 2012           68%-70%
Baseline model Only #runs remaining, #wickets remaining, and #balls remaining used as features 69.37%
Our model      Dynamic winner prediction in IPL matches, For seasons 3-10 (2010-2017)          75.68%


    The accuracy of Our model is greater than the accuracy of Baseline model
in Table 2. This shows the significance of Relative strengthT eamB /T eamA as a
feature for making robust predictions.


6    Conclusion and Future Work

The problem of dynamic winner prediction in a Twenty20 cricket match has been
successfully addressed in this paper. A combination of features which capture the
state of the match have furnished promising results. Relative strengthT eamB /T eamA
has been shown as an important feature that is successful in quantifying and com-
paring the strengths of the playing teams. In order to further make the prediction
model adept at addressing the entire match scenario, we intend to extend our
approach in order to account for the innings1 dynamics as well. The primary
challenge that stands in the way of this is to estimate the score that the team
batting first is expected to score.

References
1. Duckworth, Frank C., Anthony J. Lewis.: A fair method for resetting the target
  in interrupted one-day cricket matches. Journal of the Operational Research Society
  49.3 (1998): 220-227.
2. Beaudoin, David, Tim B. Swartz.: The best batsmen and bowlers in one-day cricket.
  South African Statistical Journal 37.2 (2003): 203.
3. Kimber, Alan.: A graphical display for comparing bowlers in cricket. Teaching Statis-
  tics 15.3 (1993): 84-86.
4. Van Staden, Paul Jacobus.: Comparison of cricketers bowling and batting perfor-
  mances using graphical displays. (2009).
5. Lemmer, Hermanus H.: THE ALLOCATION OF WEIGHTS IN THE CALCULA-
  TION OF BATTING AND BOWLING PERFORMANCE MEASURES. South
  African Journal for Research in Sport, Physical Education and Recreation (SAJR
  SPER) 29.2 (2007).
6. Khan, Mehvish, Riddhi Shah.: Role of External Factors on Outcome of a One Day
  International Cricket (ODI) Match and Predictive Analysis.”
7. Kaluarachchi, Amal, Varde Aparna S.: CricAI: A classification based tool to predict
  the outcome in ODI cricket. 2010 Fifth International Conference on Information and
  Automation for Sustainability. IEEE, 2010.
8. Madan Gopal Jhawar, Vikram Pudi.: ”Predicting the Outcome of ODI Cricket
  Matches: A Team Composition Based Approach.” European Conference on Machine
  Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-
  PKDD 2016 2016), September 2016 , Conference Center, Riva del Garda. Report no:
  IIIT/TR/2016/32
9. Deep C Prakash, C Patvardhan, Vasantha C Lakshmi.: ”Data Analytics based Deep
  Mayo Predictor for IPL-9”. International Journal of Computer Applications 152(6):6-
  11, October 2016.
10. Barr, G. D. I., B. S. Kantor.: A criterion for comparing and selecting batsmen
  in limited overs cricket. Journal of the Operational Research Society 55.12 (2004):
  1266-1274
11. Sankaranarayanan, Vignesh Veppur, Junaed Sattar, Laks VS Lakshmanan.: Auto-
  play: A Data Mining Approach to ODI Cricket Simulation and Prediction. SDM.
  2014.
12. Michael Bailey, Stephen R. Clarke.: ”Predicting the match outcome in One Day
  International cricket matches, while the game is in progress.” The 8th Australasian
  Conference on Mathematic s and Computers in Sport, 3-5 July 2006, Queensland,
  Australia, 5 December 2006.
13. ESPN Cricinfo: http://www.espncricinfo.com
14. IPL data: http://cricsheet.org
15. Pedregosa, Fabian, et al.: Scikit-learn: Machine learning in Python. Journal of
  Machine Learning Research 12.Oct (2011): 2825-2830.

</pre>