=Paper=
{{Paper
|id=Vol-2436/article_1
|storemode=property
|title=Context-Driven Data Mining Through Bias Removal and Incompleteness Mitigation
|pdfUrl=https://ceur-ws.org/Vol-2436/article_1.pdf
|volume=Vol-2436
|authors=Feras Batarseh,Ajay Kulkarni
|dblpUrl=https://dblp.org/rec/conf/sdm/BatarsehK19
}}
==Context-Driven Data Mining Through Bias Removal and Incompleteness Mitigation==
<pdf width="1500px">https://ceur-ws.org/Vol-2436/article_1.pdf</pdf>
<pre>
             Context-Driven Data Mining through Bias Removal and
                           Incompleteness Mitigation
                                  Feras A. Batarseh∗                    Ajay Kulkarni∗
                                  fbatarse@gmu.edu                    akulkar8@gmu.edu


Abstract                                                           sports studies presented, there is an infinite amount
The results of data mining endeavors are majorly driven            of information that could be collected and used for
by data quality. Throughout these deployments, serious             contextual awareness. For example, context can consist
show-stopper problems are still unresolved, such as: data          of data about the weather on the day of the competition,
collection ambiguities, data imbalance, hidden biases in           or the type of car that the athlete owns, or their
data, the lack of domain information, and data incomplete-         country’s birth rate, or the type of shoes worn by them
ness. This paper is based on the premise that context can          during the competition, or whether the athlete had eggs
aid in mitigating these issues. In a traditional data science      or cereal for breakfast that day! The point is, the
lifecycle, context is not considered. Context-driven Data          amount and variety of data that could be collected to
Science Lifecycle (C-DSL); the main contribution of this           define the context of the event under study is infinite,
paper, is developed to address these challenges. Two case          which makes the scope of this challenge very difficult to
studies (using datasets from sports events) are developed to       capture.
test C-DSL. Results from both case studies are evaluated                In data collection, and given that any data could
using common data mining metrics such as: coefficient              be collected (theoretically), then the four Vs of big
of determination (R2 ) and confusion matrices. The work            data (velocity, variety, veracity, and volume) are not
presented in this paper aims to re-define the lifecycle and        representative of the real challenge within the lifecycle
introduce tangible improvements to its outcomes.                   of data science; but the main (or first) challenge to
                                                                   be addressed is: what data should be collected for
Keywords – Context, Data Mining, Missing Values,                   the problem at hand? In the studies presented in
Outliers, Data Imbalance                                           this manuscript, multiple categorical data columns,
                                                                   coefficients, and correlations are evaluated to define
1   Introduction and Motivation.                                   a context, multiple approaches are explored, and the
                                                                   results are evaluated statistically and by comparing
Historically, most research in AI has been focused on
                                                                   them to actual results.
improving the algorithm. In the last decade or so
                                                                        The major challenge found throughout the process
however, the focus has shifted to data - big data.
                                                                   was the quality of the data (outliers, bias, and incom-
Ample amounts of data reshaped AI and renewed
                                                                   pleteness). As Niels Bohr famously stated: “Prediction
its promise and premise. As more machine learning
                                                                   is very difficult, especially if it’s about the future”. The
models are deployed across multiple domains [1] [2],
                                                                   challenge exacerbates however, when the future predic-
new challenges are rising. For instance, the relevance,
                                                                   tion is an outlier. For instance, winning a gold medal or
data types, data quality, and completeness of inputs to
                                                                   a medal at all is an outlier, very few athletes win medals
a model (dependent variables), effect the significance
                                                                   at the Olympics - one per sport. Same thing applies for
and ‘goodness’ of the outputs (independent variables).
                                                                   most sports events, there is only one winner of the super
But how can that be optimized? In the presented
                                                                   bowl, one winner of the World Cup, and that winner is
method, context is defined and injected into the process
                                                                   the outlier. Contrary to that, if an athlete is histori-
to obtain insights that are more relevant and domain-
                                                                   cally a winner of medals, for that athlete, not winning a
specific. However, in most cases, it is highly challenging
                                                                   medal becomes an outlier (not the contrary). Therefore,
to define what context is. Context is infinite [3], and
                                                                   locating outliers depends on the scope, and the subset of
so data that could be collected to define a complete
                                                                   the universal dataset that is used. Adding more data to
context is also potentially infinite. For instance, in the
                                                                   help define context is also dependent on the scope, goals,
  ∗ College of Science, George Mason University, 4400 University   and the information available in the dataset. Even if we
Dr., Fairfax, Virginia, USA 22030.                                 are looking at the same problem, same machine learn-
ing model, the slicing and dicing of data is constantly            Another example used context for software testing.
effecting what context consists of. Therefore, if con-        Context-Driven Testing (CDT), utilizes context to re-
text is that dynamic, then how can it be captured in          duce the number of test cases and improve on the vali-
a data science lifecycle? This paper examines that no-        dation and verification of software systems. The authors
tion and provides solutions to it using a Context-driven      of the paper reported very significant improvements in
Data Science Lifecycle (C-DSL). The paper is organized        time and quality of testing results due to context [9].
as follows: next section discusses the literature review           The issue of deriving context from data however,
for context, data bias, and data incompleteness. After-       is even more challenging, for instance, Mary-Anne
wards, C-DSL is introduced along with the two experi-         Williams [10] pointed out that data science algorithms
mental studies, and in the final section, conclusions and     without realizing their context could have an opacity
future research plans are presented.                          problem. This can cause models to be racist or sexist
                                                              (for example). It is often observed that Google trans-
2   Related Works in Contextual Management.                   lator refers to women as ‘he said’ or ‘he wrote’ when
As discussed prior, context plays a pivotal role in deci-     translating from Spanish to English. This finding was
sion making as it can change the meaning of concepts          also verified by Google Inc. Another opacity example is
present in a dataset. The context within a dataset can        a word embedding algorithm which classifies European
be extracted and represented as features [4]. Features        names as pleasant and African American names as un-
in general fall into three categories: primary features,      pleasant [11]. If a reductionist approach is considered,
irrelevant features, and contextual features. Primary         adding or removing data can surely redefine context, it
features are the traditional ones which are pertinent to      is observed however, that most real-world data science
a particular domain. Irrelevant features are features         projects use incomplete data [12] [13]. Data incomplete-
which are not helpful and can be safely removed, while        ness occurs within one of the following categorizations:
contextual features are the ones to pay attention to.         1) Missing Completely at Random (MCAR), 2) Miss-
That categorization helps in eliminating irrelevant data      ing at Random (MAR), and 3) Missing not at Random
but doesn’t help in clearly defining context. Another         (MNAR). MAR depends on the observed data, but not
promising method that aimed to solve this challenge,          on unobserved data while MCAR depends neither on
is called the Recognition and Exploitation of Contex-         observed data nor unobserved data [14] [15]. There are
tual Clues via Incremental Meta-Learning [5], which is        various methods to handle missing data issues which
a two-level learning model in which a Bayesian classifier     includes listwise or pairwise detections, multiple impu-
is used for context classification, and meta algorithms       tation, mean/ median/ mode imputation, regression im-
are used to detect contextual changes.                        putation, as well as learning without handling missing
     Another method: context-sensitive feature selec-         data [12].
tion [6] described a process that out performs tradi-              All the aforementioned works were challenged with
tional feature selection such as forward sequential se-       the quality of the data. For example, several types of
lection and backward sequential selection. Dominogos’s        bias can occur in any phase of the data science lifecycle
method uses a clustering-based approach to select lo-         or while extracting context. Bias can begin during
cally relevant features. Additionally, Bergadano et al.       data collection, data cleaning, modeling, or any other
[7] introduced a two-tier contextual classification adjust-   phase. Biases which arise in the data are independent of
ment method called POISEDON. The first tier captures          the sample size or statistical significance, and they can
the basic properties of context, and the second tier cap-     directly affect the context of the results or the model.
tures property modifications and context dependencies.        They also affect the association between variables, and
Context injections however, have been more successful         in extreme cases, they can even reflect the opposite of
when they are applied to specific domains. For exam-          a true association or correlation [16].
ple, adding context to data has significantly improved             Based on reviewing multiple works in data science,
the accuracy of algorithms for solving Natural Language       the most commonly observed bias is class imbalance due
Processing (NLP) problems. Dinh et al. [8] added con-         to covariate shifts. Class imbalance is represented by
text to correct wrongly tagged words. In their paper,         the unequal ratio of categories which can occur due to
the authors have combined the output from the clas-           changes in the distribution of data (covariate shifts).
sifier with a set of words manually labeled with con-         Class imbalance depends on four factors: 1) degree
text. A transformation based learning algorithm was           of class imbalance 2) the complexity of the concept
used to generate new rules for the classifier. The au-        represented by the data 3) the overall size of the training
thors claimed that this method increased the contextual       size and 4) the type of classifier [17]. Datasets with
accuracy of their application by 4.8%.                        imbalance create difficulties in information retrieval,
filtering tasks, and knowledge representation [18] [19].       R-squared; and performance of the models is compared
     In this paper, context is extracted by deploying a        with actual results of the sports events. C-DSL is
variety of statistical methods: data imputation, creation      meant with the continuous fine-tuning of data until
of a generic coefficient, adding data columns (such as:        a certain ‘contextual’ sweet spot is achieved. The
host country, sport, GDP, height, weight, and age),            proposed combination of statistical methods are tools
weighted modeling, and mitigation of bias. The details         that are used to reach that contextual understanding of
about the method (main contribution of this paper) and         the dataset, and be able to then predict based on that.
techniques used are presented in the next section.                 In the Olympics experiment, outliers and bias in
                                                               data lead to results that are barely better than the
3 Context-Driven Data Science Lifecycle.                       conventional process, but in the second experiment
C-DSL has five main steps (Figure 1). Those five               (Champions League), and after understanding context
steps are represented in two experiments (Olympics             due to data imputation and inference, a coefficient
medal predictions and the UEFA Champions League                is proven very successful in predicting the results of
winners and losers). In the first step, data cleaning and      a tournament with very high accuracy. In the next
wrangling are performed. In the literature [22], [23],         section, an in-depth explanation of the implementation
[24] it is indicated that data cleaning helps to build         of C-DSL for both experiments is presented.
robust and more reliable models. Data wrangling is
considered one of the most expensive phases in the data        4   Experimental Work.
science lifecycle. During that phase, multiple decisions       This section aims to test and evaluate the method
are taken, that includes: eliminating subsets of data,         presented in this paper, and present the detailed process
filtering, and aggregation. In the second step of C-DSL,       followed to define it.
context is injected. For experiment 1, that is done by
adding details like year, host city, sport, name of athlete,   4.1 Experiment #1 (Olympics Predictions):
country of the athlete, medal type (gold, silver, and          Data Preparation and Statistical Deployments.
bronze) and athlete’s demographical data.                      In this experiment, an application of sports predictions
                                                               has been developed using summer Olympics data
                                                               between years 1896 and 2016.          Two datasets are
                                                               pulled from Kaggle.com. The first dataset has 31,165
                                                               observations, and the second dataset consists of more
                                                               than 200,000 observations. The datasets can be found
                                                               here – https://exchangelabsgmu-my.sharepoint.
                                                               com/:f:/g/personal/akulkar8_masonlive_
                                                               gmu_edu/EuY3SFjeQl5EpNfK8P4ZUi0BcWFN-
                                                               pcUBRUpTvwuKgWmMg.
                                                                   In the conventional data preparation step, winter
                                                               data is filtered out (the aim is to predict next summer
                                                               Olympics medal counts by country and sport). Summer
                                                               data is then checked for missing values. Information
                                                               on some athletes was missing, such as: Age, Height,
                                                               and Weight. A function from the R “mice” package
                                                               “md.pattern()” is used for getting insights into the
                                                               patterns of missing data. Additionally, it is for example
                    Figure 1: C-DSL
                                                               observed that 1,888,464 athletes didn’t win any medals;
     For experiment 2, context is injected by collecting,      that is represented by nulls in the medals’ column.
cleaning and generating sentiment scores from social           Nulls are then replaced by “No medal”, because some
media text (tweets). For step 3, Data imputation,              models in R choke when dealing with null values.
bias removal, and outlier detection are performed for          The missing values (count: 114,900) are then imputed
the first experiment (explained in great details in the        using the Multivariate Imputation by Chained Equation
next section). In the fourth step of C-DSL, prediction         (MICE) technique [20]. After that, columns such as
models are built for experiment 1, while a coefficient         Sport, Gender, Age, Height, and Weight are used as
is created for experiment 2 and used for predictions.          context. This operation is performed by Predictive
In the final step of C-DSL context is evaluated using          Mean Matching (PMM) method in R using the “mice()”
confusion matrices, and model quality measure such as          function. Fifty iterations of imputations were required
to create all the missing data - approximately 15 hours        words. Once all the tweets have scores, a coefficient is
to complete the entire process.                                created: Average Team Sentiment Score (ATSS). It is
     Outlier detection is then performed, using Local          defined as: (Sum of Sentiment score of all tweets at the
Outlier Factor (LOF). It is a density-based outlier detec-     team level) / (Count of tweets at the team level).
tion technique [21]. The main reason for choosing this
method is the type of variables in the dataset. In out-
lier detection it is essential to convert categorical vari-
ables into numerical variables. In addition to that the
numerical variables are scaled using the “scale()” func-
tion. Initially, there are 5 columns (Sport, Gender, Age,
Height, and Weight) in the data but after performing
scaling and encoding of values in categories, fifty three
representative columns are created (as iterative combi-
nations of these columns). The function “lofactor()” is
used with “k = 5” for outlier detection. In the func-
tion, k denotes the number of nearest neighbors that
represent the locality used for estimating the density.        Figure 2: Sentiments of tweets and counts of tweets per
     Afterwards, model selection was deployed; regres-         team
sion and random forests are used for this experiment. In
the first part, a simple linear regression model is built in        The idea of the coefficient is to represent the
R using the “lm()” function. Further, predictions per          team’s popularity and the sentiments of its fans. This
sport per country are developed using multiple linear          study was deployed for eight teams: Barcelona, Real
regression. For that purpose, six different weight sce-        Madrid, Juventus, Bayern Munich, Borussia Dortmund,
narios are used, and the models are tweaked to enforce         Galatasaray, and Paris Saint Germain. Figure 2 shows
more significance on recent years. For random forests,         a data visualization that illustrates the results of senti-
classification is based on the type of the medal (gold, sil-   ments tweets. It shows a sample of all tweets and their
ver, bronze, and no medal), Sport, Gender, Age, Height,        sentiment values. Red is a negative sentiment, green
and Weight of the athlete. To perform the classifica-          is a positive sentiment, and blue is neutral. The main
tion, medals are encoded by numbers (“Gold = 1”, “Sil-         takeaway from Figure 2 is to visualize the distribution of
ver=2”, “Bronze=3” and “No medal=4”), and then the             sentiments from the tweets on all the different teams. It
model is trained on the entire dataset from 1896 to 2012       can be observed from the heat map that most of the sen-
(using “randomForest” and “ranger” packages in R).             timents are neutral (blue), while the pie chart indicates
The results of this experiment were not very convincing        that Barcelona F.C. has the highest number tweets.
(presented in experimental results), although much bet-
ter than conventional predictions. This experiment re-
flected the importance of tuning the value of k, creating
a coefficient, and the criticality of inference, something
that is deployed in the second experiment.

4.2 Experiment #2 (Text Mining for Context):
Setup and Coefficient Creation. In this experi-
ment, social media data are collected to be the main
driver for Context. In sports, it is safe to assume that
the fans of a sports team can reflect or influence the
team’s status, and maybe even help in predicting the
outcomes of that team. This study calculates sentiment
scores for text relevant to the Champions League (a Eu-
ropean Clubs Soccer Championship), and uses that as
the context of a team to help predict whether the team
will perform well in next stages or not. The sentiment
score for each post or tweet is normalized on a -7 to +13
scale. The R “tm” package is used to scan through the
tweets and assign scores based on a set of predefined              Figure 3: Sentiment score heat map by country
    Additionally, Figure 3 shows the sentiments when            Country      Sport      Actual        Conventional C-DSL
aggregated to the country level. For example, tweets            USA          Gymnastics 12            18           14
from China and Russia about the tournament are                  UK           Gymnastics 7             11           7
negative on average, and ones from USA and Canada are           UK           Kayaking 4               6            5
positive on average, while Europe varies. The results for       UK           Athletics  7             8            6
both experiments 1 and 2 are presented in the following         UK           Sailing    3             5            4
subsection.                                                     UK           Boxing     3             4            3
                                                                UK           Taekwondo 3              3            2
4.3 Experimental Results: Olympics Predic-                      UK           Triathlon 3              3            2
tions. After deploying C-DSL steps, the predictions for         UK           Tennis     1             4            2
the first experiment were acceptable, certainly better          UK           Shooting   2             5            2
than without deploying context, however, not very sat-          China        Table      6             5            6
isfactory. The bar plot in Figure 4 the actual number of                     Tennis
medals (blue bar on the left) and orange color (on the          China        Athletics  6             8                 7
right) indicates predicted number of medals through C-          China        Taekwondo 2              3                 3
DSL.                                                            China        Boxing     4             4                 4
                                                                Russia       Wrestling 9              9                 9
                                                                Germany      Kayaking 7               7                 7
                                                                Germany      Shooting   4             6                 5
                                                                Germany      Equestrian 6             7                 8

                                                               Table 1: Selected results for different sports for top 5
                                                               countries

                                                                                Reference/Actual
                                                                                     1     2      3            4
                                                                                1 13       6      6           73
                                                                     Prediction 2    9     3     10           61
                                                                                3    5    12     9            63
  Figure 4: Actual and predicted number of medals                               4 638 634 678               11468

     The observed adjusted R2 value for the simple linear             Table 2: Confusion matrix for predictions
regression model is 0.5488. It can be easily observed
that for Japan, Canada, Brazil, New Zealand, and the
UK the actual number of medals and predicted number            as an outlier issue), the results in Table 2 are potentially
of medals are very close, and potentially useful for           a result of a model that is underfitting. The claim made
decision making. In the second round, after applying           in this scenario is that context can be used as a pointer
weights for predicting number of medals per sport,             to such unclear data lifecycle dilemmas.
for top 5 countries, it is observed that all the models
are predicting better number of medals for: USA,               4.4 Experimental Results: Text Mining for
China, Russia, and Germany, and that is reflective             Context. After calculating the sentiments and the
of actual results. In the case of the UK, all the              activities for all tweets, an aggregation of ATSS (the
models were close to the actual number of medals (90%          coefficient) for every team is created. The coefficient
accuracy). In Table 1, the best results from C-DSL are         reflects the ATSS for every team, as well as the count
presented. Results from C-DSL are much better than             of tweets per team (i.e. interest and hype surrounding
the conventional regression process. Furthermore, Table        that team). The results from this experiment are
2 shows results compared to actual events (confusion           very successful (more than Experiment 1). When the
matrix). The model is able to predict 13 correct records       coefficient-by-team is sorted (as Figure 5 shows), the
for (1 Gold), 3 correct records for (2 Silver) and 9 correct   highest two teams are the teams that reached the final
records for (3 Bronze).                                        game in that tournament. Followed by the other two
     The overall accuracy of the random forests model is       semi-finalists, and then followed by teams in the quarter
83.96%, which usually reflects high accuracy, however,         finals, that result indicates how contextual awareness of
due to data imbalance (which could be also considered          the tournament (through data from fans for instance),
can provide predictions with high statistical confidence.   niques for data imputation, bias, and outlier detection
    The predictions for this study are much more in-        have a significant influence in C-DSL. Two experiments
dicative of actual events than when compared to the         are performed, they utilize C-DSL steps slightly differ-
UEFA ranking of those teams for instance, or expec-         ently, and they have different success rates. However,
tations based on stars playing for them, or any other       both experiments are successful in providing better out-
conventional method. It is important to note however        comes than the conventional data science lifecycle. The
that these results are not tested across multiple types     method presented in this paper is deemed to be very
of tournaments, rather only for one year (2013). That       specific to certain types of data sets, and certain data
is due to the availability of the data, this work however   mining problems. The experiments presented illustrate
is certainly ongoing, and we aim to deploy the same         it as a punctual solution to a broad problem, however,
method for multiple tournaments. In 2013, Bayern Mu-        C-DSL could be generalized to many other types of data
nich won the tournament, and teams such as Barcelona        sets. For future steps, we aim to do the following: 1.
and Paris Saint Germain unexpectedly lost. C-DSL,           Develop a tool that automates the process of C-DSL, 2.
based on contextual understanding of the fans, the hype,    Experiment with more types of sports events, 3. Rede-
social media attention, and collective knowledge is able    fine C-DSL to create a more unified and generic process
to predict the winner. The work presented in both ex-       that applies to all types of datasets, 4. Identify other
periments has potential for improvements, and is still      data sets that have a variety of data types and test
undergoing, conclusions and next steps are presented in     them through C-DSL, 5. Deploy C-DSL for upcoming
the next section.                                           summer sports tournaments and compare the results to
                                                            media and experts predictions.

                                                            References


                                                             [1] F. A. Batarseh, A. J. Gonzalez, and R. Knauf, Context-
                                                                 assisted test cases reduction for cloud validation, Inter-
Figure 5: Team coefficient very indicative of actual             national and Interdisciplinary Conference on Modeling
                                                                 and Using Context, 8175 (2013), pp. 288–301.
results
                                                             [2] F. A. Batarseh, and R. Yang, Federal data science:
                                                                 Transforming government and agricultural policy using
                                                                 artificial intelligence., Elsevier Academic Press, 2017.
                                                             [3] M. Bazire, and P. Brezillon, Understanding Con-text
                                                                 Before Using It., The 5th International and Interdisci-
                                                                 plinary Conference on Context, 3554 (2005), pp. 29–40.
                                                             [4] P. D. Turney, The management of context-sensitive fea-
                                                                 tures: A review of strategies., The 13th International
                                                                 Conference on Machine Learning, Workshop on Learn-
                                                                 ing in Context-Sensitive Domains, 2002, pp. 60–66.
                                                             [5] G. Widmer, Recognition and exploitation of contex-
                                                                 tual clues via incremental meta-learning (Extended ver-
                                                                 sion), The 13th International Conference on Machine
                                                                 Learning, 1996, pp. 525–533.
                                                             [6] P. Domingos, Context-sensitive feature selection for
                                                                 lazy learners, Lazy learning, 1997, pp. 227–253.
                                                             [7] F. Bergadano, S. Matwin, R. S. Michalski, and J.
Figure 6: Actual results of 2012-13 UEFA Champions               Zhang, Learning two-tiered descriptions of flexible con-
League [25]                                                      cepts: The POSEIDON system, Machine Learning, 8
                                                                 (1992), pp. 5–43.
                                                             [8] P. H. Dinh, N. K. Nguyen, and A. C. Le, Combin-
5   Conclusions and Next Steps.                                  ing statistical machine learning with transfor-mation
                                                                 rule learning for Vietnamese word sense disambigua-
In this paper, a Context-driven Data Science Lifecy-             tion, Computing and Communication Technologies,
cle (C-DSL) is introduced and tested for applications            Research, Innovation, and Vision for the Future, 2012,
of sport predictions. It can be concluded from the re-           pp. 1–6.
sults that context plays a crucial role for prediction.      [9] F. A. Batarseh, Context-driven testing on the cloud,
In addition to that, based on our experiments, tech-             Context in Computing, 2014, pp. 25–44.
[10] Mary-Anne Williams, Risky bias in artificial in-
     telligence, The Australian Academy of Tech-
     nology and Engineering, 2018, Retrieved from:
     https://www.atse.org.au/content/news/risky-bias-in-
     artificial-intelligence.aspx
[11] J. Zou, and L. Schiebinger, AI can be sexist and
     racist - it’s time to make it fair., 2018, Retrieved
     from: https://www.nature.com/articles/d41586-018-
     05707-8.
[12] J. Sessa, and D. Syed, Techniques to deal with missing
     data., Electronic Devices, Systems and Applications
     (ICEDSA) 5th International Conference, 2016, pp. 1–4.
[13] H. Kang, The prevention and handling of the missing
     data., Korean journal of anesthesiology, 64 (2013),
     pp. 402–406.
[14] J. L. Schafer, and J. W. Graham, Missing data: our
     view of the state of the art., Psychological methods, 7
     (2002), pp. 147-177.
[15] J. W. Graham, Missing data analysis: Making it work
     in the real world., Annual review of psychology, 60
     (2009), pp. 549-576.
[16] C. J. Pannucci, and E. G. Wilkins, Identifying and
     avoiding bias in research., Plastic and reconstructive
     surgery, 126 (2010), pp. 619-625.
[17] N. Japkowicz, and S. Stephen, The class imbalance
     problem: A systematic study., Intelligent data analysis,
     6 (2002), pp. 429-449.
[18] D. D. Lewis, and M. Ringuette, A comparison of two
     learning algorithms for text categorization., Third an-
     nual symposium on document analysis and information
     retrieval, 33 (1994), pp. 81-93.
[19] D. D. Lewis, and J. Catlett, Heterogeneous uncertainty
     sampling for supervised learning., Machine Learning,
     1994, pp. 148-156.
[20] S. V. Buuren, and K. Groothuis-Oudshoorn, mice:
     Multivariate imputation by chained equations in R.,
     Journal of statistical software, 45 (2010), pp. 1-68.
[21] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander,
     LOF: identifying density-based local outliers., ACM
     sigmod record, 29 (2000), pp. 93-104.
[22] T. Dasu, and T. Johnson, Exploratory data mining and
     data cleaning., John Wiley & Sons, 479 (2003).
[23] S. Zhang, C. Zhang, and Q. Yang, Data preparation for
     data mining., Applied artificial intelligence, 17 (2003),
     pp. 375-381.
[24] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, Data
     cleaning: Overview and emerging challenges., In Pro-
     ceedings of the 2016 International Conference on Man-
     agement of Data, 2016, pp. 2201-2206.
[25] 2012–13 UEFA Champions League image retrieved
     from - https: // en. wikipedia. org/ wiki/ 2012% E2%
     80% 9313_ UEFA_ Champions_ League

</pre>