<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>College of Science, George Mason University, 4400 University</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dr.</institution>
          ,
          <addr-line>Fairfax, Virginia, USA 22030.</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>sports studies presented, there is an in nite amount The results of data mining endeavors are majorly driven of information that could be collected and used for by data quality. Throughout these deployments, serious contextual awareness. For example, context can consist show-stopper problems are still unresolved, such as: data of data about the weather on the day of the competition, collection ambiguities, data imbalance, hidden biases in or the type of car that the athlete owns, or their data, the lack of domain information, and data incomplete- country's birth rate, or the type of shoes worn by them ness. This paper is based on the premise that context can during the competition, or whether the athlete had eggs aid in mitigating these issues. In a traditional data science or cereal for breakfast that day! The point is, the lifecycle, context is not considered. Context-driven Data amount and variety of data that could be collected to Science Lifecycle (C-DSL); the main contribution of this de ne the context of the event under study is in nite, paper, is developed to address these challenges. Two case which makes the scope of this challenge very di cult to studies (using datasets from sports events) are developed to capture. test C-DSL. Results from both case studies are evaluated In data collection, and given that any data could using common data mining metrics such as: coe cient be collected (theoretically), then the four Vs of big of determination (R2) and confusion matrices. The work data (velocity, variety, veracity, and volume) are not presented in this paper aims to re-de ne the lifecycle and representative of the real challenge within the lifecycle introduce tangible improvements to its outcomes. of data science; but the main (or rst) challenge to be addressed is: what data should be collected for</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        ing model, the slicing and dicing of data is constantly Another example used context for software testing.
e ecting what context consists of. Therefore, if con- Context-Driven Testing (CDT), utilizes context to
retext is that dynamic, then how can it be captured in duce the number of test cases and improve on the
valia data science lifecycle? This paper examines that no- dation and veri cation of software systems. The authors
tion and provides solutions to it using a Context-driven of the paper reported very signi cant improvements in
Data Science Lifecycle (C-DSL). The paper is organized time and quality of testing results due to context [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
as follows: next section discusses the literature review The issue of deriving context from data however,
for context, data bias, and data incompleteness. After- is even more challenging, for instance, Mary-Anne
wards, C-DSL is introduced along with the two experi- Williams [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] pointed out that data science algorithms
mental studies, and in the nal section, conclusions and without realizing their context could have an opacity
future research plans are presented. problem. This can cause models to be racist or sexist
(for example). It is often observed that Google
trans2 Related Works in Contextual Management. lator refers to women as `he said' or `he wrote' when
As discussed prior, context plays a pivotal role in deci- translating from Spanish to English. This nding was
sion making as it can change the meaning of concepts also veri ed by Google Inc. Another opacity example is
present in a dataset. The context within a dataset can a word embedding algorithm which classi es European
be extracted and represented as features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Features names as pleasant and African American names as
unin general fall into three categories: primary features, pleasant [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. If a reductionist approach is considered,
irrelevant features, and contextual features. Primary adding or removing data can surely rede ne context, it
features are the traditional ones which are pertinent to is observed however, that most real-world data science
a particular domain. Irrelevant features are features projects use incomplete data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Data
incompletewhich are not helpful and can be safely removed, while ness occurs within one of the following categorizations:
contextual features are the ones to pay attention to. 1) Missing Completely at Random (MCAR), 2)
MissThat categorization helps in eliminating irrelevant data ing at Random (MAR), and 3) Missing not at Random
but doesn't help in clearly de ning context. Another (MNAR). MAR depends on the observed data, but not
promising method that aimed to solve this challenge, on unobserved data while MCAR depends neither on
is called the Recognition and Exploitation of Contex- observed data nor unobserved data [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. There are
tual Clues via Incremental Meta-Learning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is various methods to handle missing data issues which
a two-level learning model in which a Bayesian classi er includes listwise or pairwise detections, multiple
impuis used for context classi cation, and meta algorithms tation, mean/ median/ mode imputation, regression
imare used to detect contextual changes. putation, as well as learning without handling missing
      </p>
      <p>
        Another method: context-sensitive feature selec- data [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
tion [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] described a process that out performs tradi- All the aforementioned works were challenged with
tional feature selection such as forward sequential se- the quality of the data. For example, several types of
lection and backward sequential selection. Dominogos's bias can occur in any phase of the data science lifecycle
method uses a clustering-based approach to select lo- or while extracting context. Bias can begin during
cally relevant features. Additionally, Bergadano et al. data collection, data cleaning, modeling, or any other
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a two-tier contextual classi cation adjust- phase. Biases which arise in the data are independent of
ment method called POISEDON. The rst tier captures the sample size or statistical signi cance, and they can
the basic properties of context, and the second tier cap- directly a ect the context of the results or the model.
tures property modi cations and context dependencies. They also a ect the association between variables, and
Context injections however, have been more successful in extreme cases, they can even re ect the opposite of
when they are applied to speci c domains. For exam- a true association or correlation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
ple, adding context to data has signi cantly improved Based on reviewing multiple works in data science,
the accuracy of algorithms for solving Natural Language the most commonly observed bias is class imbalance due
Processing (NLP) problems. Dinh et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] added con- to covariate shifts. Class imbalance is represented by
text to correct wrongly tagged words. In their paper, the unequal ratio of categories which can occur due to
the authors have combined the output from the clas- changes in the distribution of data (covariate shifts).
si er with a set of words manually labeled with con- Class imbalance depends on four factors: 1) degree
text. A transformation based learning algorithm was of class imbalance 2) the complexity of the concept
used to generate new rules for the classi er. The au- represented by the data 3) the overall size of the training
thors claimed that this method increased the contextual size and 4) the type of classi er [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Datasets with
accuracy of their application by 4.8%. imbalance create di culties in information retrieval,
ltering tasks, and knowledge representation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. R-squared; and performance of the models is compared
      </p>
      <p>
        In this paper, context is extracted by deploying a with actual results of the sports events. C-DSL is
variety of statistical methods: data imputation, creation meant with the continuous ne-tuning of data until
of a generic coe cient, adding data columns (such as: a certain `contextual' sweet spot is achieved. The
host country, sport, GDP, height, weight, and age), proposed combination of statistical methods are tools
weighted modeling, and mitigation of bias. The details that are used to reach that contextual understanding of
about the method (main contribution of this paper) and the dataset, and be able to then predict based on that.
techniques used are presented in the next section. In the Olympics experiment, outliers and bias in
data lead to results that are barely better than the
3 Context-Driven Data Science Lifecycle. conventional process, but in the second experiment
C-DSL has ve main steps (Figure 1). Those ve (Champions League), and after understanding context
steps are represented in two experiments (Olympics due to data imputation and inference, a coe cient
medal predictions and the UEFA Champions League is proven very successful in predicting the results of
winners and losers). In the rst step, data cleaning and a tournament with very high accuracy. In the next
wrangling are performed. In the literature [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], section, an in-depth explanation of the implementation
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] it is indicated that data cleaning helps to build of C-DSL for both experiments is presented.
robust and more reliable models. Data wrangling is
considered one of the most expensive phases in the data 4 Experimental Work.
science lifecycle. During that phase, multiple decisions This section aims to test and evaluate the method
are taken, that includes: eliminating subsets of data, presented in this paper, and present the detailed process
ltering, and aggregation. In the second step of C-DSL, followed to de ne it.
context is injected. For experiment 1, that is done by
adding details like year, host city, sport, name of athlete, 4.1 Experiment #1 (Olympics Predictions):
country of the athlete, medal type (gold, silver, and Data Preparation and Statistical Deployments.
bronze) and athlete's demographical data. In this experiment, an application of sports predictions
has been developed using summer Olympics data
between years 1896 and 2016. Two datasets are
pulled from Kaggle.com. The rst dataset has 31,165
observations, and the second dataset consists of more
than 200,000 observations. The datasets can be found
here { https://exchangelabsgmu-my.sharepoint.
com/:f:/g/personal/akulkar8_masonlive_
gmu_edu/EuY3SFjeQl5EpNfK8P4ZUi0BcWFNpcUBRUpTvwuKgWmMg.
      </p>
      <p>In the conventional data preparation step, winter
data is ltered out (the aim is to predict next summer
Olympics medal counts by country and sport). Summer
data is then checked for missing values. Information
on some athletes was missing, such as: Age, Height,
and Weight. A function from the R \mice" package
\md.pattern()" is used for getting insights into the
Figure 1: C-DSL patterns of missing data. Additionally, it is for example
observed that 1,888,464 athletes didn't win any medals;</p>
      <p>
        For experiment 2, context is injected by collecting, that is represented by nulls in the medals' column.
cleaning and generating sentiment scores from social Nulls are then replaced by \No medal", because some
media text (tweets). For step 3, Data imputation, models in R choke when dealing with null values.
bias removal, and outlier detection are performed for The missing values (count: 114,900) are then imputed
the rst experiment (explained in great details in the using the Multivariate Imputation by Chained Equation
next section). In the fourth step of C-DSL, prediction (MICE) technique [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. After that, columns such as
models are built for experiment 1, while a coe cient Sport, Gender, Age, Height, and Weight are used as
is created for experiment 2 and used for predictions. context. This operation is performed by Predictive
In the nal step of C-DSL context is evaluated using Mean Matching (PMM) method in R using the \mice()"
confusion matrices, and model quality measure such as function. Fifty iterations of imputations were required
to create all the missing data - approximately 15 hours words. Once all the tweets have scores, a coe cient is
to complete the entire process. created: Average Team Sentiment Score (ATSS). It is
      </p>
      <p>
        Outlier detection is then performed, using Local de ned as: (Sum of Sentiment score of all tweets at the
Outlier Factor (LOF). It is a density-based outlier detec- team level) / (Count of tweets at the team level).
tion technique [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The main reason for choosing this
method is the type of variables in the dataset. In
outlier detection it is essential to convert categorical
variables into numerical variables. In addition to that the
numerical variables are scaled using the \scale()"
function. Initially, there are 5 columns (Sport, Gender, Age,
Height, and Weight) in the data but after performing
scaling and encoding of values in categories, fty three
representative columns are created (as iterative
combinations of these columns). The function \lofactor()" is
used with \k = 5" for outlier detection. In the
function, k denotes the number of nearest neighbors that
represent the locality used for estimating the density. Figure 2: Sentiments of tweets and counts of tweets per
      </p>
      <p>Afterwards, model selection was deployed; regres- team
sion and random forests are used for this experiment. In
the rst part, a simple linear regression model is built in The idea of the coe cient is to represent the
R using the \lm()" function. Further, predictions per team's popularity and the sentiments of its fans. This
sport per country are developed using multiple linear study was deployed for eight teams: Barcelona, Real
regression. For that purpose, six di erent weight sce- Madrid, Juventus, Bayern Munich, Borussia Dortmund,
narios are used, and the models are tweaked to enforce Galatasaray, and Paris Saint Germain. Figure 2 shows
more signi cance on recent years. For random forests, a data visualization that illustrates the results of
senticlassi cation is based on the type of the medal (gold, sil- ments tweets. It shows a sample of all tweets and their
ver, bronze, and no medal), Sport, Gender, Age, Height, sentiment values. Red is a negative sentiment, green
and Weight of the athlete. To perform the classi ca- is a positive sentiment, and blue is neutral. The main
tion, medals are encoded by numbers (\Gold = 1", \Sil- takeaway from Figure 2 is to visualize the distribution of
ver=2", \Bronze=3" and \No medal=4"), and then the sentiments from the tweets on all the di erent teams. It
model is trained on the entire dataset from 1896 to 2012 can be observed from the heat map that most of the
sen(using \randomForest" and \ranger" packages in R). timents are neutral (blue), while the pie chart indicates
The results of this experiment were not very convincing that Barcelona F.C. has the highest number tweets.
(presented in experimental results), although much
better than conventional predictions. This experiment
re</p>
      <p>ected the importance of tuning the value of k, creating
a coe cient, and the criticality of inference, something
that is deployed in the second experiment.
4.2 Experiment #2 (Text Mining for Context):
Setup and Coe cient Creation. In this
experiment, social media data are collected to be the main
driver for Context. In sports, it is safe to assume that
the fans of a sports team can re ect or in uence the
team's status, and maybe even help in predicting the
outcomes of that team. This study calculates sentiment
scores for text relevant to the Champions League (a
European Clubs Soccer Championship), and uses that as
the context of a team to help predict whether the team
will perform well in next stages or not. The sentiment
score for each post or tweet is normalized on a -7 to +13
scale. The R \tm" package is used to scan through the
tweets and assign scores based on a set of prede ned</p>
      <p>Additionally, Figure 3 shows the sentiments when
aggregated to the country level. For example, tweets
from China and Russia about the tournament are
negative on average, and ones from USA and Canada are
positive on average, while Europe varies. The results for
both experiments 1 and 2 are presented in the following
subsection.
4.3 Experimental Results: Olympics
Predictions. After deploying C-DSL steps, the predictions for
the rst experiment were acceptable, certainly better
than without deploying context, however, not very
satisfactory. The bar plot in Figure 4 the actual number of
medals (blue bar on the left) and orange color (on the
right) indicates predicted number of medals through
CDSL.
7
3
4
9
7
5
8</p>
      <p>The observed adjusted R2 value for the simple linear Table 2: Confusion matrix for predictions
regression model is 0.5488. It can be easily observed
that for Japan, Canada, Brazil, New Zealand, and the
UK the actual number of medals and predicted number as an outlier issue), the results in Table 2 are potentially
of medals are very close, and potentially useful for a result of a model that is under tting. The claim made
decision making. In the second round, after applying in this scenario is that context can be used as a pointer
weights for predicting number of medals per sport, to such unclear data lifecycle dilemmas.
for top 5 countries, it is observed that all the models
are predicting better number of medals for: USA, 4.4 Experimental Results: Text Mining for
China, Russia, and Germany, and that is re ective Context. After calculating the sentiments and the
of actual results. In the case of the UK, all the activities for all tweets, an aggregation of ATSS (the
models were close to the actual number of medals (90% coe cient) for every team is created. The coe cient
accuracy). In Table 1, the best results from C-DSL are re ects the ATSS for every team, as well as the count
presented. Results from C-DSL are much better than of tweets per team (i.e. interest and hype surrounding
the conventional regression process. Furthermore, Table that team). The results from this experiment are
2 shows results compared to actual events (confusion very successful (more than Experiment 1). When the
matrix). The model is able to predict 13 correct records coe cient-by-team is sorted (as Figure 5 shows), the
for (1 Gold), 3 correct records for (2 Silver) and 9 correct highest two teams are the teams that reached the nal
records for (3 Bronze). game in that tournament. Followed by the other two</p>
      <p>The overall accuracy of the random forests model is semi- nalists, and then followed by teams in the quarter
83.96%, which usually re ects high accuracy, however, nals, that result indicates how contextual awareness of
due to data imbalance (which could be also considered the tournament (through data from fans for instance),
can provide predictions with high statistical con dence. niques for data imputation, bias, and outlier detection</p>
      <p>The predictions for this study are much more in- have a signi cant in uence in C-DSL. Two experiments
dicative of actual events than when compared to the are performed, they utilize C-DSL steps slightly di
erUEFA ranking of those teams for instance, or expec- ently, and they have di erent success rates. However,
tations based on stars playing for them, or any other both experiments are successful in providing better
outconventional method. It is important to note however comes than the conventional data science lifecycle. The
that these results are not tested across multiple types method presented in this paper is deemed to be very
of tournaments, rather only for one year (2013). That speci c to certain types of data sets, and certain data
is due to the availability of the data, this work however mining problems. The experiments presented illustrate
is certainly ongoing, and we aim to deploy the same it as a punctual solution to a broad problem, however,
method for multiple tournaments. In 2013, Bayern Mu- C-DSL could be generalized to many other types of data
nich won the tournament, and teams such as Barcelona sets. For future steps, we aim to do the following: 1.
and Paris Saint Germain unexpectedly lost. C-DSL, Develop a tool that automates the process of C-DSL, 2.
based on contextual understanding of the fans, the hype, Experiment with more types of sports events, 3.
Redesocial media attention, and collective knowledge is able ne C-DSL to create a more uni ed and generic process
to predict the winner. The work presented in both ex- that applies to all types of datasets, 4. Identify other
periments has potential for improvements, and is still data sets that have a variety of data types and test
undergoing, conclusions and next steps are presented in them through C-DSL, 5. Deploy C-DSL for upcoming
the next section. summer sports tournaments and compare the results to
media and experts predictions.</p>
      <p>In this paper, a Context-driven Data Science
Lifecycle (C-DSL) is introduced and tested for applications
of sport predictions. It can be concluded from the
results that context plays a crucial role for prediction.</p>
      <p>In addition to that, based on our experiments,
tech</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Batarseh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Knauf</surname>
          </string-name>
          ,
          <article-title>Contextassisted test cases reduction for cloud validation</article-title>
          ,
          <source>International and Interdisciplinary Conference on Modeling and Using Context</source>
          ,
          <volume>8175</volume>
          (
          <year>2013</year>
          ), pp.
          <volume>288</volume>
          {
          <fpage>301</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Batarseh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Federal data science: Transforming government and agricultural policy using arti cial intelligence</article-title>
          ., Elsevier Academic Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bazire</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Brezillon</surname>
          </string-name>
          ,
          <article-title>Understanding Con-text Before Using It</article-title>
          .,
          <source>The 5th International and Interdisciplinary Conference on Context</source>
          ,
          <volume>3554</volume>
          (
          <year>2005</year>
          ), pp.
          <volume>29</volume>
          {
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Turney</surname>
          </string-name>
          ,
          <article-title>The management of context-sensitive features: A review of strategies</article-title>
          .,
          <source>The 13th International Conference on Machine Learning, Workshop on Learning in Context-Sensitive Domains</source>
          ,
          <year>2002</year>
          , pp.
          <volume>60</volume>
          {
          <fpage>66</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Widmer</surname>
          </string-name>
          ,
          <article-title>Recognition and exploitation of contextual clues via incremental meta-learning (</article-title>
          <source>Extended version)</source>
          ,
          <source>The 13th International Conference on Machine Learning</source>
          ,
          <year>1996</year>
          , pp.
          <volume>525</volume>
          {
          <fpage>533</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <article-title>Context-sensitive feature selection for lazy learners</article-title>
          ,
          <source>Lazy learning</source>
          ,
          <year>1997</year>
          , pp.
          <volume>227</volume>
          {
          <fpage>253</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bergadano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matwin</surname>
          </string-name>
          , R. S. Michalski, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Learning two-tiered descriptions of exible concepts: The POSEIDON system</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>8</volume>
          (
          <year>1992</year>
          ), pp.
          <volume>5</volume>
          {
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Dinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Combining statistical machine learning with transfor-mation rule learning for Vietnamese word sense disambiguation</article-title>
          ,
          <source>Computing and Communication Technologies</source>
          , Research, Innovation, and
          <article-title>Vision for the Future</article-title>
          ,
          <year>2012</year>
          , pp.
          <volume>1</volume>
          {
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Batarseh</surname>
          </string-name>
          ,
          <article-title>Context-driven testing on the cloud</article-title>
          , Context in Computing,
          <year>2014</year>
          , pp.
          <volume>25</volume>
          {
          <fpage>44</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mary-Anne</surname>
            <given-names>Williams</given-names>
          </string-name>
          ,
          <article-title>Risky bias in arti cial intelligence</article-title>
          ,
          <source>The Australian Academy of Technology and Engineering</source>
          ,
          <year>2018</year>
          , Retrieved from: https://www.atse.org.au/content/news/risky-bias
          <article-title>-inarti cial-intelligence</article-title>
          .aspx
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zou</surname>
          </string-name>
          , and L. Schiebinger,
          <article-title>AI can be sexist and racist - it's time to make it fair</article-title>
          .,
          <year>2018</year>
          , Retrieved from: https://www.nature.com/articles/d41586-018- 05707-8.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sessa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Syed</surname>
          </string-name>
          ,
          <article-title>Techniques to deal with missing data</article-title>
          .,
          <source>Electronic Devices, Systems and Applications (ICEDSA) 5th International Conference</source>
          ,
          <year>2016</year>
          , pp.
          <volume>1</volume>
          {
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>The prevention and handling of the missing data</article-title>
          .,
          <source>Korean journal of anesthesiology</source>
          ,
          <volume>64</volume>
          (
          <year>2013</year>
          ), pp.
          <volume>402</volume>
          {
          <fpage>406</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Schafer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <article-title>Missing data: our view of the state of the art</article-title>
          .,
          <source>Psychological methods</source>
          ,
          <volume>7</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>147</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <article-title>Missing data analysis: Making it work in the real world</article-title>
          .,
          <source>Annual review of psychology</source>
          ,
          <volume>60</volume>
          (
          <year>2009</year>
          ), pp.
          <fpage>549</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Pannucci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Wilkins</surname>
          </string-name>
          ,
          <article-title>Identifying and avoiding bias in research</article-title>
          .,
          <source>Plastic and reconstructive surgery</source>
          ,
          <volume>126</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>619</fpage>
          -
          <lpage>625</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Stephen</surname>
          </string-name>
          ,
          <article-title>The class imbalance problem: A systematic study</article-title>
          .,
          <source>Intelligent data analysis</source>
          ,
          <volume>6</volume>
          (
          <year>2002</year>
          ), pp.
          <fpage>429</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>D. D. Lewis</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ringuette</surname>
          </string-name>
          ,
          <article-title>A comparison of two learning algorithms for text categorization</article-title>
          .,
          <source>Third annual symposium on document analysis and information retrieval</source>
          ,
          <volume>33</volume>
          (
          <year>1994</year>
          ), pp.
          <fpage>81</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>D. D. Lewis</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Catlett</surname>
          </string-name>
          ,
          <article-title>Heterogeneous uncertainty sampling for supervised learning</article-title>
          .,
          <source>Machine Learning</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Buuren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Groothuis-Oudshoorn</surname>
          </string-name>
          ,
          <article-title>mice: Multivariate imputation by chained equations in R.</article-title>
          ,
          <source>Journal of statistical software</source>
          ,
          <volume>45</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>M. M. Breunig</surname>
            ,
            <given-names>H. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>R. T.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sander</surname>
          </string-name>
          ,
          <article-title>LOF: identifying density-based local outliers</article-title>
          .,
          <source>ACM sigmod record</source>
          ,
          <volume>29</volume>
          (
          <year>2000</year>
          ), pp.
          <fpage>93</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dasu</surname>
          </string-name>
          , and T. Johnson,
          <article-title>Exploratory data mining and data cleaning</article-title>
          ., John Wiley &amp; Sons,
          <volume>479</volume>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Data preparation for data mining</article-title>
          .,
          <source>Applied arti cial intelligence</source>
          ,
          <volume>17</volume>
          (
          <year>2003</year>
          ), pp.
          <fpage>375</fpage>
          -
          <lpage>381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Data cleaning: Overview and emerging challenges</article-title>
          .,
          <source>In Proceedings of the 2016 International Conference on Management of Data</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2201</fpage>
          -
          <lpage>2206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <year>2012</year>
          {13
          <string-name>
            <given-names>UEFA</given-names>
            <surname>Champions</surname>
          </string-name>
          <article-title>League image retrieved from</article-title>
          - https: // en. wikipedia. org/ wiki/ 2012% E2%
          <volume>80</volume>
          % 9313_ UEFA_ Champions_ League
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>