Analyzing and predicting NCAA volleyball match
outcome using machine learning techniques
Dhvanil Sanghvi, Priya Deshpande, Suhas Shanbhogue and Vishwa Shah
BITS Pilani, India


                                      Abstract
                                      In this paper, we perform a thorough match prediction analysis of our newly mined NCAA (National
                                      Collegiate Athletic Association) volleyball data set. We also investigate the comparative power of two
                                      distinct yet comparable models, namely team aggregates and player aggregates, to predict the outcome
                                      of an NCAA volleyball match. The dependent variables for both models are mainly hitting rates of
                                      serves, recepts, attacks, and assists. The output variable is the winning team. Apart from the features
                                      specific to volleyball, we also incorporated a few general match statistics. Among the multitude of Ma-
                                      chine Learning models available for classification, the study finalizes on three primary ones viz Logistic
                                      Regression, Decision trees, and Neural Networks. Results show that Decision trees and Neural Net-
                                      works perform considerably well in both the team and player models on the ROC metric and accuracy,
                                      with Neural Networks giving marginally better results. Logistic regression on team aggregates performs
                                      only slightly better than randomized outcomes, whereas, for the player model, it performs way better.
                                      In terms of model structure, player aggregates give much better classification than team aggregates with
                                      a maximum ROC of 0.98. This shows that volleyball, despite being a team sport, is intrinsically more
                                      impacted by players who make the team than the team as a whole. Our model accuracy suggests that
                                      this model can be successfully used to predict the outcome of a NCAA volleyball match.

                                      Keywords
                                      NCAA, Data Mining, Volleyball, Machine Learning


1. Introduction
One of the most common task in Supervised Machine Learning is the Classification task [1].
This is mainly due to the use-case and entailment it has in sports predictions . Sports prediction
is a part of an enormous market and forms the crux of a team’s analyst. In some sports getting
the strategy and team build right can make the difference of winning and losing the whole
tournament . Stakeholders of a team like owners , coaches and analysts rely on computer
simulations and models to predict the team performance with respect to strategies and tactics.
Also large monetary rewards in betting further elucidate the necessity of good accurate models
to predict sport matches. [2], states that betting markets are highly volatile and are subject to

     Equal Contribution by all authors
ICAIW 2021: Workshops at the Fourth International Conference on Applied Informatics 2021, October 28–30, 2021,
Buenos Aires, Argentina
" f20170379@goa.bits-pilani.ac.in (D. Sanghvi); f20170776@goa.bits-pilani.ac.in (P. Deshpande);
f20170769@goa.bits-pilani.ac.in (S. Shanbhogue); f20180109@goa.bits-pilani.ac.in (V. Shah)
 0000-0001-7010-0662 (D. Sanghvi); 0000-0001-9258-5699 (P. Deshpande); 0000-0002-6775-9519 (S. Shanbhogue);
0000-0002-5690-5832 (V. Shah)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                       99
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                            99–116


negative returns in the long run. In most literature historical stats, player performance stats
and opposition information have been traditionally used as features [1]. [3], in their paper
’Using Bookmaker Odds to Predict the Final Result of Football Matches’ state that bookmakers’
odds correlate significantly with match predictions and can be used for predicting matches. [4],
look into numeric predictions where the authors have dwelled into winning margins in college
football. [5], use a unique ranking method to predict and model the English Premier League.
[4],also make a keen observation that treating the prediction as a classification problem rather
than a regression based classification gave higher accuracy.
   Our paper focuses on match prediction as a classification problem of win and loss. While
most existing literature focus on mainstream sports, we sought to draw our attention to another
popular sport Volleyball and fill the void and gap in the current literature. Additionally we also
compare how using holistic features of team stack up against amalgamated individual player
features.
   In recent times, Volleyball has gained immense popularity in the world of both professional
sports and recreational leagues. The sport is played at the Olympics and also has many European
and American leagues associated with it.Volleyball is played both on turf and the beach. It is
important to note that these are entirely different sports and our focus in this paper is turf
volleyball of the popular NCAA (National Collegiate Athletic Association) league.

1.1. Volleyball:Rules and Regulations
Before we dive deep into Machine Learning Research aspects, lets try to learn more about
volleyball to get an intuition of the game. We briefly explain the structure and terminologies of
Volleyball. The volleyball rules as stated in the NCAA Women’s Volleyball Handbook are as
follows:
A typical volleyball game consists of 6 members on each side of the court. The team however
consists of 10-12 players with rotating substitutions. Every time a particular side serves the ball,
there is a rotation among the 6 players so that no player serves twice continuously. The sport
can be played indoors as well as outdoors. [6]
   Most of the basic rules remain although the size of the court and position of boundaries
changes slightly. The rules for volleyball are the same for women and men with the only
exception that the official height of the net is shorter for women. The two ends of the net must
be at the same height and it cant exceed the official height by more than 2cm. Some basic rules
of volleyball include:
    • A team cannot touch the ball more than 3 times before it crosses the net.
    • A particular player cannot touch the ball more than once.
    • The ball may not be lifted, held or carried.
Two forms of scoring are observed in volleyball. A serving team loses serve if it makes a mistake,
a point is only given to the serve team if the non serving team makes an error, in the other case
it leads to a service change. In the formally adopted scoring mechanism, each serve results in a
point to either team, its sort of a rallying mechanism. Matches are played up to 25 points and
3 games. In order to win a game a team must have a 2 point score lead. Else the score keeps
accumulating even beyond 25 until either team wins. The match ends when a team wins the


                                                100
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                          99–116


majority of the games i.e. 2. For this research we have used historical data for features of both
teams and players to predict the winning team using the features mentioned in the following
sections. Important terms and definitions used in this text can be found in the appendix section.


2. Review of existing literature
In [7] two models were developed for forecasting point spread for Women’s Volleyball, one for
predicting point spread using a regression model and a second model to predict the probability
of winning using a logistic regression model. Difference between the averages of the in-game
statistics was calculated between the two teams and placed in the model. The score margin
model had an accuracy of 68% when the differences in the averages of the in-game statistics were
used. [8] uses Logistic Regression for predicting Football results from Barclay’s Premier League
and sofifa.com. They highlight the most significant features used by previous researchers which
include Home Offense, Home Defense, Away Offense, and Away Defense. The paper gives
additional insight into the coefficients obtained Logistic regression, concluding that the most
significant variables as Home Defense and Away Defense.
   In [9] they discuss using a machine learning approach, ANN, to predict the outcomes of one
week, specifically applied to the Iran Pro League (IPL) 2013-2014 football matches. The data
obtained from the past matches in the seven last leagues are used to make better predictions for
the future matches. Some unique features that have been used as input to ANNs involve Quality
of Opponents in last matches, Condition of Teams in Recent Matches, Condition of Teams in
the Overall League. [10] gives a good insight into our dataset and features of the box statistics
of a volleyball match. The paper used the 1994 NCAA Women’s volleyball tournament and
calculated mean and standard deviation across a division along with the correlation coefficients.
The authors use multiple regression to predict a match using the aforementioned features. The
paper found that attack coefficient correlates the best to a team’s success. Blocking stats were
next important for division I and II while serve was next most important stat for division III.
This simple regression method predicted 60% of the variance of the team’s success across the
divisions.
   [11] is a review paper which gave us a bird’s eye view of the existing research on Match
predictions in the form of most popular sports which researchers choose for prediction and
what is the frequency of ML algorithms used in existing literature. The paper shows that ANN
are most frequently used in existing literature for predicting Team sport matches. The paper
states 65% of the papers consider ANN models as part of their experiments and 23% of the
papers solely use ANN in their work. But the authors say that using ANN models does not
lead to high accuracy in prediction and it’s unclear why historically ANN models have been so
widely used. The authors say ANN models are black boxes and its very difficult for analysts
and coaches to reverse engineer the outcomes of the ANN model predictions.
   [12] uses multiple algorithms and even multiple sports in order to make sports analytics
more accurate. Although volleyball is not a part of the sports discussed, the paper was very
instrumental in giving us a direction to use individual player statistics to predict the final
winning team. The paper also confirms our belief that individual player feature estimations are
very much correlated with the team features.


                                              101
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                              99–116


   [13] provides the framework and basis for the usage of Artificial Neural Networks to predict
male volleyball professional league rankings. It also gives some insights on the features that can
be used to predict the rankings such as wins, defeats, home/away etc. The paper concludes by
suggesting that the best kind of ANN is one with a single hidden layer 4-neuron model which
had “logsig” transfer function, “trainlm” training function, and “learngmd” adaptive learning
function. [14] proposes a Bayesian hierarchical model to predict the rankings of the volleyball
national teams. The model also allows the estimation of results of each match played in the
league. The model consumes efficiencies in four categories - Serve, Attack, Defense and Block.
The efficiencies are calculated as follows:

                                        𝑃 𝑒𝑟𝑓 𝑒𝑐𝑡𝑎𝑡𝑡𝑒𝑚𝑝𝑡𝑠 − 𝑇 𝑜𝑡𝑎𝑙𝐸𝑟𝑟𝑜𝑟𝑠
                      𝐸𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑐𝑦 =                                                               (1)
                                                   𝑇 𝑜𝑡𝑎𝑙𝑎𝑡𝑡𝑒𝑚𝑝𝑡𝑠
   [15] adopts a Logistic Regression model based on efficiencies of the players at the different
positions. The efficiency is calculated in a similar manner as Andrea Gabrio (2020). The different
efficiency variables used were Libero player efficiency, Middle blocker efficiency, Setter efficiency,
Middle blocker efficiency, Outside hitter efficiency and Universal hitter efficiency.


3. Dataset
3.1. Raw data
The data was obtained from National Collegiate Athletic Association’s (NCAA) official website
[16]. It includes data of all the Division 1 Women’s Volleyball matches played from 2011 through
2015. Generally, volleyball statistics are split into the following 6 categories : Attacking, Serving,
Setting, Passing, Defending and Blocking [6]. The analysis in this paper consumes statistics
from the following four categories:
    • Attacks
    • Assists
    • Serves
    • Blocks
   First we made a web crawler to crawl the NCAA website to extract all the match links of
the statistics for individual matches. Then the data was scraped from these links of the NCAA
website using BeautifulSoup, a package available in Python. The code for this scraper is available
in this repository. We use a HTML parser to parse the stats page and all the different tables were
extracted to a Pandas dataframe and stored into a Dictionary object. The data on the website
was organized match-wise. So, first the links of all matches were extracted and then they were
iterated over to obtain the data for each match. We made a dictionary to hold the matches
statistics of all the matches until the current match. The dictionary was used as our database
for generating prior and features for the current match prediction. The key of the dictionary
was either a team name to hold team stats or a tuple of (team name, player ID , player name ) to
hold the player stats.
   For the given match we would look into our Dictionary database and engineer the features
for the current match. This ensures that there is no data leakage whatsoever into our Model


                                                 102
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                             99–116


predictions. After feature engineering done, we feed the current match’s data to the dictionary
so that its stats can also be used as prior for next upcoming match. We discuss about the feature
engineering in the next section.

3.2. Generating Priors for the Current Match
It is the generation of features from the primary dataset to make it ready to be consumed by the
machine learning model. It is natural to assume that to predict 𝑀 𝑎𝑡𝑐ℎ𝑖 (Match number i of the
tournament), the data that we have available to us is only till 𝑀 𝑎𝑡𝑐ℎ𝑖−1 .
    Phase 1: In phase 1, only the overall team statistics from the past matches are used to predict
the outcome of the match. Therefore, a weighted average of the different statistics available
was taken. A weight of one suggests a similar weight to recent matches than older matches.
The decay factor was taken to be 0.9 so as to incorporate a decaying effect with age/duration of
the match stat. This enables giving the performance in recent matches more weight than the
performances in matches played quite some time back. This can be understood simply with the
help of the following expression:

                         𝐹𝑚=𝑖−1 + 0.9 * 𝐹𝑚=𝑖−2 + 0.92 * 𝐹𝑚=𝑖−3 ...0.9𝑖 * 𝐹𝑚=0
               𝐹𝑚=𝑖 =                                                                         (2)
                                         1 + 0.9 + 0.92 + ... + 0.9𝑖
Here, ’F’ refers to a particular feature corresponding to that match and ’m’ refers to the match
number. Therefore, for a match number i played by a team, let’s say Arizona, the features for
that match would be a weighted average from the Arizona’s previous match to Arizona’s first
match in that season.
  Phase 2: In phase 2, player-wise statistics were used for the study. For each player of a
particular team, the features of that player were engineered from their performance in the
previous matches playing for the same team. The features were weighted in the same way as in
phase 1.


4. Methodology and Feature Engineering
In order to predict the wins in a volleyball match, we decided upon two different paradigms:
First, A structure that lays emphasis only on the team features and decides the match outcomes
on the team statistics without giving much of a direct focus on the players. The other approach
is to look at the players being the strength of the team and incorporate more directly the impact
of the stronger players, rather than averages. The final features of both the approaches are
calculated as historical averages. There are broadly six skills in volleyball that are considered
crucial to a player/team’s strength. Four of these are considered in this paper due to their relative
importance: Attack, Assist, Serve, and Recept [17]. The concept of PCT ( a term popularly used
in volleyball to abbreviate percentage ) is used to calculate both teams’ and players’ features [7].
Now, we are going to select the machine learning models on which we are going to train the data.
Figure 1, adapted from [11] captures the frequency of usage for the different machine learning
models applied to predict matches in the existing literature. We will train the same machine
learning models for both the phases so that we can compare and contrast their performances


                                                103
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                                                           99–116


                    Start


            Crawl across NCAA
            website and scrape
             all match links for
               years 2011-15                                                               Historical dataset stores all
                                                                                               match stats about a
                                                                                           particular player and team.
                                                                                              We use this historical
                                                                                               information to make
            Match Links Dataset                                                            features for current match
               from 2011-15


              Iterate across and
            links and scrape the
            tables with Beautiful
                                                                  Historical Database
            soup's HTML parser

                                  Update the
                                   Database
                                 after features
                                 generated for
                                 current match
                 Current Match
                   Features
                                                  Decay data proportional               Decay data proportional
                                        yes       to how old the match is               to how old the match is
                  Generated?
                                                     and take weighted                     and take weighted
                                                          average                               average


                                                       Do Feature
                                                   Engineering for Team                      Do Feature
                                                          Data                              Engineering for
                                                                                             Player Data

           In our dataset 0 implies
          team1 won and 1 implies
         team2 won. We can make
           total 1s equal to total 0s
            by switching the teams
         whenever one of the class                Team Feature Data in csv              Player Feature Data in
            is imbalanced. This is                        format                              csv format
              done by us using a
         random number algorithm

                                                     Balance the classes                  Balance the classes
                                                         in dataset                           in dataset


                                                       Balanced Team                        Balanced Player
                                                        features data                        features data


Figure 1: Flowchart of Data Scraping, Feature Engineering and Class Balancing


later. The first model that we choose is Logistic Regression. Our label is a binary variable and
hence, the Logistic Regression model calculates the probability in the following way:


                                                              104
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                         99–116


                                                          1
                        𝑃 (𝑌 = 1|𝑋, 𝛼) =                                                     (3)
                                            1 + 𝑒−(𝛽0 +𝛽1 *𝑋1 +...+𝛽𝑑 *𝑋𝑑 )
   Here, 𝛼 is the parameter set for the model. The probability of Y belonging to label 0 can be
just calculated by 𝑃 (𝑌 = 0|𝑋, 𝛼) = 1 − 𝑃 (𝑌 = 1|𝑋, 𝛼).
   As our second machine learning model, we choose Artificial Neural Networks also called
feed-forward neural networks to be more precise. [11] suggests that it is the most widely used
model for prediction of matches.
   The third model is Decision Trees. A minimal cost complexity approach is used for pruning
in the model. The splitting criterion used is - randomly initiate the threshold for all features
and then iterative to find the best split. This helps reduce over fitting.
Overcoming class imbalance
Due to some cultural or circumstantial reasons, the NCAA data inherently had stats where the
second team on the list won 90% of the matches. Upon further analysis we found no reasonable
argument for the same. Also, most of the NCAA matches were played in neutral venues. To
remove this bias from our data set we balance the number of 0’s and 1’s by randomly shuffling
the ordering of the teams so that there is an equal probability of either team winning. The
shuffling has lead to no loss of generality and the data set ends up being balanced.

4.1. Phase 1
In phase 1, we focus on team aggregates as a whole to predict wins. Individual match data was
collected from the NCAA website. The train set consisted of six features which are calculated
using the match statistics.
   The data is cumulative in nature, in the sense that every following year contains the informa-
tion cumulatively up to, but not including that year.
   The following features are used:

    • Attack PCT : Attack pct measures the average attacking power of the team. Attacks
      are the most important way for an offensive team to win points. Attack pct is directly
      proportional to the win rate [18]

                                             𝑇 𝑜𝑡𝑎𝑙𝐾𝑖𝑙𝑙𝑠 − 𝑇 𝑜𝑡𝑎𝑙𝐸𝑟𝑟𝑜𝑟𝑠
                            𝐴𝑡𝑡𝑎𝑐𝑘𝑃 𝐶𝑇 =                                                     (4)
                                                   𝑇 𝑜𝑡𝑎𝑙𝐴𝑡𝑡𝑒𝑚𝑝𝑡𝑠
    • Serve PCT : Serve PCT is a measure of the number of serve aces by a team.

                                               𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑟𝑣𝑖𝑐𝑒𝐴𝑐𝑒𝑠
                                 𝑆𝑒𝑟𝑣𝑒𝑃 𝐶𝑇 =                                              (5)
                                                     𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑡𝑠
    • Assist PCT : A player is awarded an assist if he/she passes the ball to a teammate who
      then closes in on a kill or attack.

                                           𝑇 𝑜𝑡𝑎𝑙𝐴𝑠𝑠𝑖𝑠𝑡𝑠 − 𝑇 𝑜𝑡𝑎𝑙𝐸𝑟𝑟𝑜𝑟𝑠
                           𝐴𝑠𝑠𝑖𝑠𝑡𝑃 𝐶𝑇 =                                                      (6)
                                                  𝑇 𝑜𝑡𝑎𝑙𝐴𝑡𝑡𝑒𝑚𝑝𝑡𝑠


                                               105
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                            99–116


Figure 2: Correlation Heatmap for NCAA Volleyball match features.


    • Recept PCT :How well a team handles a potential serve ace is measured by the recept
      PCT.

                                                    𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑟𝑣𝑒𝑅𝑒𝑐𝑒𝑝𝑡𝑠
                                 𝑅𝑒𝑐𝑒𝑝𝑡𝑃 𝐶𝑇 =                                                   (7)
                                                         𝑇 𝑜𝑡𝑎𝑙𝑠𝑒𝑡𝑠
    • Set Win Ratio : It is the ratio of total set wins by all sets played.

                                                       𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑡𝑊 𝑖𝑛𝑠
                                 𝑆𝑒𝑡𝑊 𝑖𝑛𝑅𝑎𝑡𝑖𝑜 =                                                 (8)
                                                      𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑡𝑠𝑃 𝑙𝑎𝑦𝑒𝑑
   One important thing to note is that the features mentioned above show a high level of
correlation with each other as shown in Figure 2 . This can be attributed to the fact that a
proficient player might possess more than one skill to a reasonable extent [19]. A similar
reasoning applies to team features as well. Therefore, we have considered these as individual
features and maintain that they would individually add soundness and robustness to the model
[20].

4.2. Phase 2
In phase 2, player-wise statistics were used for the study.To represent the efficiency and strength
of the player we have used AttackPCT, AssistPCT, ServePCT. These features are measured
similar to what we did for teams. It is considered that the squad playing a given match is not
known a priori. For each match played, we use the 10 players of each team. For each player of


                                                106
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                                99–116


a particular team, the features of that player were engineered from their performance in the
previous matches playing for the same team. The features were generated with a decay for the
past performances (similar to Phase 1).
   Similar to [21] we have defined Block Efficiency of a player as a ball that has been touched
by blockers and then played by the defence.

                                         𝐵𝐷𝑖𝑗 + 𝐵𝑆𝑖𝑗 + 𝐵𝐴𝑖𝑗 − 𝐵𝐸𝑖𝑗 − 𝐵𝐻𝐸𝑖𝑗
              𝐵𝑙𝑜𝑐𝑘𝐸𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑐𝑦𝑖𝑗 =                                                                  (9)
                                         𝐵𝐷𝑖𝑗 + 𝐵𝑆𝑖𝑗 + 𝐵𝐴𝑖𝑗 + 𝐵𝐸𝑖𝑗 + 𝐵𝐻𝐸𝑖𝑗
where BD is Blocking digs, BS is Block solos, BA is Block Assists, BE is Block Errors, BHE is
Ball handling errors. Xij refers to the player j of a particular team i.
   Ace is a serve which lands in the opponent’s court without being touched, or is touched, but
unable to be kept in play by one or more receiving team players, resulting in a point for the
team serving. Since Ace is a special kind of serve, this ratio is a representative of the player’s
serving skills and hence used as a feature.

                                                        𝑆𝑒𝑟𝑣𝑒𝐴𝑐𝑒𝑠𝑖𝑗
                                𝐴𝑐𝑒𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ𝑖𝑗 =                                                     (10)
                                                       𝑇 𝑜𝑡𝑎𝑙𝑆𝑒𝑟𝑣𝑒𝑠𝑖𝑗

4.2.1. Overcoming high dimensional data
The authors collected data of 299 matches spread across five years. For a particular year, there
are fifty to sixty matches. But, using player-wise data leads to an explosion in the number of
features because there are five features each for the twenty players who are going to play in
that match (Ten per team) summing up to a total of 100 features. This is a severe problem [22]
because the number of data points in a given year are much lesser than the number of features.
Hence, it is imperative that measures are taken to reduce the dimension of the data [23].
   It can be observed that all the five features for the players are efficiency ratios. Therefore, they
are already normalized between zero to one. To reduce the number of features for prediction
and to not lose important information, a single metric is allotted to every player. The metric
(player-Score), acts as a proxy of the player strength and provides information on how valuable
is the player to the team. It is calculated in the following way:
                      𝐴𝑡𝑡𝑎𝑐𝑘𝑃 𝐶𝑇𝑖𝑗 +𝐴𝑠𝑠𝑖𝑠𝑡𝑃 𝐶𝑇𝑖𝑗 +𝑆𝑒𝑟𝑣𝑒𝑃 𝐶𝑇𝑖𝑗 +𝐴𝑐𝑒𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ𝑖𝑗 +𝐵𝑙𝑜𝑐𝑘𝐸𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑐𝑦𝑖𝑗
 𝑝𝑙𝑎𝑦𝑒𝑟𝑆𝑐𝑜𝑟𝑒𝑖𝑗 =                                          5                                         (11)

   Here, similar to conventions followed above, j represents a particular player of a team i.
Hence, it is just a mean of all those five features. These scores are generated for every player
who is (predicted to) play in that match. After this is done, the data set is transformed to a
new lower-dimensional data set. The new data set has the five featured mentioned above (of
which the mean is taken) for three players of each team. The three players are said to be the
representatives of the team for that match.
   The first representative player’s stats are an average of the best three players of that team on
the basis of the calculated playerScore. Similarly, the second representative is an average of the
three players who have medium strength - The fourth, fifth and the sixth best players for that
team. And the third representative is an average of the poor performing teammates, having the
least four playerScores. These three representatives are calculated in the same manner for the


                                                 107
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                         99–116


second team. Now, for each match we have a total of 30 features containing the features of 3
representatives for each team.
   This would allow us to capture important relationships between the match winner and the
3 representatives. Using the coefficients and the importance of features in case of logistic
regression and decision tree respectively, we can empirically conclude relations between the
match winner and the best players, the average players and the worst players.

4.2.2. Model construction
In this section we give details about the architecture of our models. These are the three models
that were constructed for training on data set for our prediction. Each model was tuned for the
best hyperparamters.
   1. Logistic Regression model
   2. Decision Tree Classifier
   3. Artificial Neural Network
   Logistic Regression A tuned Logistic regression was used as a baseline for model training. A
limit on the max iterations was set to 100000. The optimizer is set to limited memory BFGS(lbfgs)
for our model.
Decision Tree Classifier Decision Tree Classifier from the sklearn library was used to im-
plement the decision tree classifier. A critical factor is that with such limited data points and
without any hyperparameter tuning, the tree over-fits the training set completely. We set a
particular value for ccp_alpha, which is the hyperparameter for cost complexity pruning in
Decision Tree provided by the sklearn library. To observe the variation in impurities in the
leaves with the changing ccp_alpha, we plot the following graph.
   We set ccp_alpha at 0.06 in our final model so that the model does not overfit and is robust.
The model uses entropy as the criteria to judge the quality of split. The class weight is set to
"balanced" so as to learn both the class labels equally.
   Artificial Neural Networks
   After firm experimentation on regularization in our ANN models, we zeroed in on the
architectures which have been mentioned in the results. We use Adam optimizer and disable


Figure 3: Impurities in leaf nodes VS ccp_alpha


                                                  108
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                            99–116


shuffling to prevent data leakage as our data is time series in nature.


5. Results
Since this is a classification task we have taken the ROC-AUC metric to analyze our model.
Second metric we have used is F1-score, F1 Score is the harmonic mean of precision and recall.
F1 score conveys the balance between the precision and the recall. We have used data spanning
from 2011 to 2015 and since our data is dependent on time and the sequence in which matches
were played, we have used the following train test split:

Table 1
Year-wise Train Test split
                               Split    Train                    Test
                               Split1   2011                     2012
                               Split2   2011, 2012               2013
                               Split3   2011, 2012, 2013         2014
                               Split4   2011, 2012, 2013, 2014   2015

  Eg: here, data from matches played in 2011, 2012 would be used to predict for matches in
2013. We take the average of ROC-AUC score obtained in each of the above 4 cases. We have
taken care that within the same season also the match data should not be shuffled and is ordered
according to the dates when they took place.

5.1. Phase 1 - Team Data
In this section we compare the results of the models developed using Team Data and the features
mentioned in section 4.1.

5.1.1. Logistic Regression
We develop a Logistic Regression Model for Binary Classification. We set the max iterations as
100 and set the class weight parameter as "balanced" to automatically adjust weights inversely
proportional to class frequencies in the input data. The value of C has been chosen by tuning is
across various values to get appropriate regularization(200 values on logarithmic scale). The
output labels would be 1 or 0 depending on which team is predicted to win(0: team1 wins, 1:
team2 wins)

5.1.2. Decision Tree
We use the Decision Tree Classifier and set the splitting criterion as gini and set the splitter as
’random’, so that all features are sampled randomly according to the feature importance. As
we have limited data, we want to ensure that the decision tree does not overfit. To ensure this
we have tuned the max_depth and min_samples_leaf as 10 and 20 respectively. The ccp_alpha
parameter here has been chosen as 0.01.


                                                109
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                         99–116


Table 2
Split wise ROC-AUC and F1 score using Logistic Regression using Team Data
                              Split    ROC AUC Score     F1 Score
                              Split1       0.57856       0.59701
                              Split2       0.59833       0.51064
                              Split3       0.72945       0.64407
                              Split4       0.66181       0.61290
                              Mean         0.64204       0.59116


Table 3
Split wise ROC-AUC and F1 score using Decision Tree using Team Data
                              Split    ROC AUC Score     F1 Score
                              Split1          0.5        0.66667
                              Split2       0.65917       0.51429
                              Split3       0.89022       0.89655
                              Split4       0.97607       0.91228
                              Mean         0.75636       0.74744


5.1.3. Artificial Neural Networks
We use Sequential Model provided by Keras Library for the Artificial Neural Network. The 5x2
features from both teams i.e. 10 units form the input layer. We use the ReLU (Rectified Linear
Unit) activation function at both stages to learn a non-linear mapping for classification task.
The final output is passed through a Sigmoid function- which will give an output in the range
(0,1) denoting the probability of the team winning for our binary classification task. We train
the Neural Network for 100 epochs. We set shuffle = FALSE as we want prevent data leakage as
our data is time series in nature.
    Input           Dense                Dense                Output
            →                    →                    →
   10 units   15 units,Actn:ReLU   25 units,Actn:ReLU   1 unit,Actn:Sigmoid


Table 4
Split wise ROC-AUC and F1 score using Artificial Neural Networks using Team Data
                              Split    ROC AUC Score     F1 Score
                              Split1       0.82258       0.81356
                              Split2       0.81667       0.81633
                              Split3       0.85483       0.84746
                              Split4       0.90323       0.9
                              Mean         0.84933       0.84434

  In the above 3 models, there is a similar trend in the ROC/F1-score vs the Split being trained
on. As the training data increases, the metrics of the model improve as the model has learned


                                              110
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                              99–116


on more data.

5.2. Phase 2 - Player-wise Data
In this section, we document the results obtained from the different models trained on player-
wise statistics for a given Volleyball match.

5.2.1. Logistic Regression
Similar to Phase 1 Logistic regression on Team data, we tuned and trained a Logistic Regression
Model for Players data to set a baseline . We set the class weight parameter as ”balanced” to
automatically adjust weights inversely proportional to class frequencies in the input data. We
tuned across multiple C values (2000 of them divided on a logarithmic scale) to get the best
regularization .

Table 5
Split wise ROC-AUC and F1 score using Logistic Regression using Player Data
                               Split    ROC AUC Score       F1 Score
                               Split1        0.98439        0.93103
                               Split2        0.95833        0.84615
                               Split3        0.98387        0.98412
                               Split4           1.0         1.0
                               Mean          0.98568        0.94032


5.2.2. Decision Tree
The Decision Tree is trained with ccp_alpha at 0.06. The class weights are set to balanced so
that both the classes are learned equally. The criterion to decide the quality of a split is taken to
be entropy.

Table 6
Split wise ROC-AUC and F1 score using Decision Tree using Player Data
                               Split    ROC AUC Score       F1 Score
                               Split1        0.81842        0.76667
                               Split2        0.93583        0.87719
                               Split3        0.93184        0.95385
                               Split4        0.99844        0.98413
                               Mean          0.92113        0.89546

   Table 6 clearly suggests that the decision tree model starts performing better as the number
of training data instances are increased. From a F1 score of 0.76667 for 2012 data, it goes to an
F1 score of 0.89546 for 2015. We observe an approximately linear growth in the ROC AUC score
and F1 score across the years.


                                                111
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                           99–116


5.2.3. Artificial Neural Networks
We used player vectors of 30 features as input to the Neural Network and set shuffle = FALSE
as we want prevent data leakage as our data is time series in nature. We trained for 10 epochs
with Adam optimizer. Below is the architecture which gave the best results after tuning.

              Dense          Dense          Dense
                                                             Output
    Input     20 units,      25 units,      30 units,
            →              →              →              →    1 unit,
   20 units   Actn:ReLU ,    Actn:ReLU,     Actn:ReLU ,
                                                           Actn:Sigmoid
              Dropout:0.25   Dropout:0.25   Dropout:0.25


Table 7
Split wise ROC-AUC and F1 score using Artificial Neural Networks
                              Split    ROC AUC Score      F1 Score
                              Split1       0.96357        0.85714
                              Split2       0.97166        0.88000
                              Split3          1.0         1.0
                              Split4          1.0         1.0
                              Mean         0.98381        0.93428

  In the above three models there is a clear trend of better metrics with more and more data
with every passing year.


6. Discussion and Conclusion
The results that volleyball despite being a team sport , results are intrinsically more impacted
by players who make the team than the team as a whole. In other words , models trained on
Player data contain finer statistics than model trained on team data. This shows that in sports
prediction bifurcated chunks of features which make the entity which is team gives better
information to the model and is a better predictor.
    Our classification results achieved better performance than strict guessing in all cases, with
prediction ROC’s ranging from 0.74 to almost 0.98 in some cases. In phase 1, the roc scores
are 0.64, 0.75, 0.84 for LR, DT and NN’s. Similarly for Phase 2 the corresponding scores are
0.98, 0.95, 0.98. Neural Networks perform better in both phases. The differences in accuracy are
primarily due to the different approaches that we have chosen for this classification task. The
aggregate player approach seems to have picked up the key causes that result in a team win as
it incorporates significantly more information in its features. Although our model parameters
were finalized through a series of experiments, we are aware that more specialized models could
result in higher accuracy. Some general features such as average age of players, average age of
the team, number of new players etc can be used as well. We were unable to use the same due
to a lack of relevant data.
    The metric for accuracy used was ROC AUC, as it is not biased towards the size of test or
evaluation data. While accuracy is measured on predicted classes, roc auc is measured on


                                               112
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                             99–116


predicted scores which makes roc scores and f1 scores better for classification tasks. A high
accuracy moreover could be due to over-fitting.
   In phase 1, the mean ROC is highest for Artificial Neural Networks as they have the ability to
learn and model non-linear and complex relationships between inputs and outputs. We observe
Decision Tree performs much better in split4 when it is supplied the maximum test data.
   Further extension of this model could use the extensive NCAA data available to make the
current model more robust and versatile. Moreover, it could also incorporate a home and away
team feature. In the current literature, we were unable to do so because the NCAA games are
not necessarily conducted in the home or away grounds. We believe this is a crucial factor that
must be used for sports predictions. We can also experiment with Recurrent Neural Network
(RNN) architectures to learn from the temporal property of the data.

6.1. Usage of priors
An essential point of consideration for any probabilistic model is the inclusion of prior prob-
abilities for all possible outcomes. One way to measure priors as mentioned in [24] is to use
historical data to come up with some reasonable value. The paper uses the previous output to
determine the new probabilities for the current year. This approach, however, poses several
questions for time series data. How many years of data to use?, what if a new team/player
joins the tournament?, with only five years of data and a few hundred matches, will the priors
be biased to the train data? These questions require extensive research and are beyond the
scope of this literature. Moreover, ncaa data pertains to university/college level matches. This
implies that the players in a particular team may change considerably over time. Thus using
priors on teams would be rendered useless. On players as well, priors would have to take into
consideration their increased experience and skill level, the data to which we did not have
access. After much discussion and deliberation, we come to a conclusion to use equally probable
priors, i.e., we assume initially that both the teams are equally likely to win, and the only factors
that affect the match outcome are the posteriors. An enhancement of the model could take into
consideration these factors and estimate the relevant priors to further improve the classification
accuracy.

6.2. Generating a network of players
The model that we have trained do not capture the synergistic relations between the different
players. Although, Artificial Neural Networks might capture these relations implicitly, no
inferences can be made from the trained neural network on these synergistic relations between
the players. [25] uses an edge-centric multi-view network analysis to predict the performances
of a given Basketball lineup in the NBA. The nodes of the networks are the players and the
weights on the edges of the graphs represent the performance inhibitors/boosters due to the
other players in the match. Using this technique could significantly understand the game of
volleyball. For instance, if we find that the setter and outside hitter have a significant impact on
each other, the team manager could choose to not replace these players in the current lineup.
Similarly, calculating the centralities and eigen vectors of the network could help us get insights
into the impact of the players in the team performance.


                                                113
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                            99–116


  [26], in their paper propose the analysis of these player-interactions via the social network
theory. They re-conceptualize the sports team as a social network and hence the relations
between the nodes capture the interactions between these players. For eg., if the sport concerned
was of Basketball, the network could be a ball-passing network.
  These network-based approaches move away from the conventional machine learning meth-
ods, and include richer information into the model. Useful features can be generated after
network analysis which can be helped generate robust models for the same conventional
machine learning algorithms.

6.3. Using K means for Clustering and Merging Similar Players
While we used an algorithm of sorting and merging players in sections of 3, 3, 4 K means could
have been employed to cluster similar players together and replacing the vectors of clustered
Player with one single vector. [27], in their paper merge different players into relevant clusters
to find a beta player which increased their model accuracy. The issue with this method is though
its difficult to predict the number of players in each cluster. Many clusters may have more
players and need to be weighted differently than clusters with lesser players to prevent bias.
Also finding the optimal K value is another difficult task.

6.4. Decay Factor
We have chosen a decay factor of 0.9 with respect to the previous matches. But, technically
the selection of the decay factor is in itself a search problem. Further research can be done
to understand how the decay factor should vary within earlier matches of same season and
matches from previous season. Here we have considered a geometrical decay factor. There
are other questions like should this decay factors be varying for the past years? For example,
should matches from 2 years back have the same decay factor as the matches from the past year.
Most teams in NCAA play only for about 1-2 matches in a season. Hence, impacts of past 2-3
years can be seen in the data if we just use a simple decay factor without analyzing these things.
The decay factors can also be adjusted on the basis of the volatility of players’ skill. There are a
lot of possible future directions this paper can be extended to and it will be interesting to see
the impacts of them on the predictability.

6.5. Concluding remarks
In conclusion, we believe taking into consideration our limiting factors such as limited data
availability, class imbalance, and inability to use cross-validation or shuffling as we were
constrained by time-series nature of the data, the aggregative player model implements a sound
classification task of predicting a volleyball win and can be used successfully for the given task.


References
 [1] R. P. Bunker, F. Thabtah, A machine learning framework for sport result prediction,
     Applied Computing and Informatics 15 (2019) 27 – 33. URL: http://www.sciencedirect.com/


                                                114
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                         99–116


     science/article/pii/S2210832717301485. doi:https://doi.org/10.1016/j.aci.2017.
     09.005.
 [2] S. Wilkens, Sports prediction and betting models in the machine learning age: The case of
     tennis, SSRN Electronic Journal (2019). doi:10.2139/ssrn.3506302.
 [3] K. Odachowski, J. Grekow, Using bookmaker odds to predict the final result of football
     matches, volume 7828, 2012, pp. 196–205. doi:10.1007/978-3-642-37343-5_20.
 [4] D. C. D. Delen, N. Kasap, A comparative analysis of data mining methods in predicting
     ncaa bowl outcomes, 2012.
 [5] R. Baboota, H. Kaur, Predictive analysis and modelling football results using machine
     learning approach for english premier league, 2018. doi:10.1016/j.ijforecast.2018.
     01.003.
 [6] NCAA, Women’s volleyball rules of the game, 2020. URL: http://www.ncaa.org/
     playing-rules/womens-volleyball-rules-game.
 [7] D. Zhang, Forecasting point spread for women’s volleyball, 2016.
 [8] D. Prasetio, D. Harlili, Predicting football match results with logistic regression, 2016
     International Conference On Advanced Informatics : Concepts, Theory And Application
     (ICAICTA) (2016). doi:10.1109/ICAICTA.2016.7803111.
 [9] S. M. Arabzad, M. Araghi, S.-N. Soheil, N. Ghofrani, Football match results prediction using
     artificial neural networks; the case of iran pro league, International Journal of Applied
     Research on Industrial Engineering 1 (2014) 159–179.
[10] N. L. Estabrook, The relationship between ncaa volleyball statistics and team performance
     in women’s intercollegiate volleyball, Kinesiology, Sport Studies, and Physical Education
     Master’s Theses. (1996).
[11] R. Bunker, T. Susnjak, The application of machine learning techniques for predicting
     results in team sport: A review, 2019. arXiv:1912.11762.
[12] K. Apostolou, C. Tjortjis, Sports analytics algorithms for performance prediction, in: 2019
     10th International Conference on Information, Intelligence, Systems and Applications
     (IISA), 2019, pp. 1–4. doi:10.1109/IISA.2019.8900754.
[13] A. E. Tümer, S. Koçer, Prediction of team league’s rankings in volleyball by artificial
     neural network method, International Journal of Performance Analysis in Sport 17 (2017)
     202–211. URL: https://doi.org/10.1080/24748668.2017.1331570. doi:10.1080/24748668.
     2017.1331570. arXiv:https://doi.org/10.1080/24748668.2017.1331570.
[14] A. Gabrio,           Bayesian hierarchical models for the prediction of volley-
     ball results,       Journal of Applied Statistics 0 (2020) 1–21. URL: https://doi.
     org/10.1080/02664763.2020.1723506.              doi:10.1080/02664763.2020.1723506.
     arXiv:https://doi.org/10.1080/02664763.2020.1723506.
[15] C. Akarçeşme, Is it possible to estimate match result in volleyball: A new prediction model,
     Central European Journal of Sport Sciences and Medicine 19 (2017). doi:10.18276/cej.
     2017.3-01.
[16] ncaa.org, Women’s volleyball statistics, 2020. URL: http://www.ncaa.org/championships/
     statistics/womens-volleyball-statistics.
[17] A. Papageorgiou, 6 basic skills in volleyball, 2020. URL: https://www.
     strength-and-power-for-volleyball.com/basic-volleyball-skills.html.
[18] L. M. e. American Volleyball Coaches Association, Bonnie Johnson, 2020 women’s volleyball


                                              115
Dhvanil Sanghvi et al. CEUR Workshop Proceedings                                          99–116


     statisticians’ manual, 2020. URL: http://fs.ncaa.org/Docs/stats/Stats_Manuals/VB/2020.pdf.
[19] G. B. e. a. Ferraz R, Pacing behaviour of players in team sports: Influence of match status
     manipulation and task duration knowledge, PLoS One 13 (2018). URL: https://doi.org/10.
     1371/journal.pone.0192399.
[20] S. Senthilnathan, Usefulness of correlation analysis, SSRN Electronic Journal (2019).
     doi:10.2139/ssrn.3416918.
[21] A. Gabrio, Bayesian hierarchical models for the prediction of volleyball results, Journal of
     Applied Statistics (2020). URL: https://doi.org/10.1080/02664763.2020.1723506. doi:https:
     //doi.org/10.1080/02664763.2020.1723506.
[22] L. Yu, Feature selection for high-dimensional data: A fast correlation-based filter solution,
     Proceedings of the 20th international conference on machine learning (2003).
[23] L. Yu, Feature selection for high-dimensional data, 2015.
[24] T. I. M. Laura Hervert-Escobar, Neil Hernandez-Gress and, Bayesian based approach
     learning for outcome prediction of soccer matches, 2018.
[25] M. Ahmadalinezhad, M. Makrehchi, Basketball lineup performance prediction using edge-
     centric multi-view network analysis, Social Network Analysis and Mining 10 (2020) 72.
     URL: https://doi.org/10.1007/s13278-020-00677-0. doi:10.1007/s13278-020-00677-0.
[26] J. Ribeiro, P. Silva, R. Duarte, K. Davids, J. Garganta, Team sports performance analysed
     through the lens of social network theory: Implications for research and practice, Sports
     Medicine 47 (2017) 1689–1696. URL: https://doi.org/10.1007/s40279-017-0695-1. doi:10.
     1007/s40279-017-0695-1.
[27] R. Kumar, Z. Liu, W. Zamri, Sports competition stressors based on k-means algorithm,
     Malaysian Sports Journal (2019) 04–07. doi:10.26480/msj.01.2019.04.07.


                                              116