=Paper=
{{Paper
|id=Vol-2436/article_1
|storemode=property
|title=Context-Driven Data Mining Through Bias Removal and Incompleteness Mitigation
|pdfUrl=https://ceur-ws.org/Vol-2436/article_1.pdf
|volume=Vol-2436
|authors=Feras Batarseh,Ajay Kulkarni
|dblpUrl=https://dblp.org/rec/conf/sdm/BatarsehK19
}}
==Context-Driven Data Mining Through Bias Removal and Incompleteness Mitigation==
Context-Driven Data Mining through Bias Removal and
Incompleteness Mitigation
Feras A. Batarseh∗ Ajay Kulkarni∗
fbatarse@gmu.edu akulkar8@gmu.edu
Abstract sports studies presented, there is an infinite amount
The results of data mining endeavors are majorly driven of information that could be collected and used for
by data quality. Throughout these deployments, serious contextual awareness. For example, context can consist
show-stopper problems are still unresolved, such as: data of data about the weather on the day of the competition,
collection ambiguities, data imbalance, hidden biases in or the type of car that the athlete owns, or their
data, the lack of domain information, and data incomplete- country’s birth rate, or the type of shoes worn by them
ness. This paper is based on the premise that context can during the competition, or whether the athlete had eggs
aid in mitigating these issues. In a traditional data science or cereal for breakfast that day! The point is, the
lifecycle, context is not considered. Context-driven Data amount and variety of data that could be collected to
Science Lifecycle (C-DSL); the main contribution of this define the context of the event under study is infinite,
paper, is developed to address these challenges. Two case which makes the scope of this challenge very difficult to
studies (using datasets from sports events) are developed to capture.
test C-DSL. Results from both case studies are evaluated In data collection, and given that any data could
using common data mining metrics such as: coefficient be collected (theoretically), then the four Vs of big
of determination (R2 ) and confusion matrices. The work data (velocity, variety, veracity, and volume) are not
presented in this paper aims to re-define the lifecycle and representative of the real challenge within the lifecycle
introduce tangible improvements to its outcomes. of data science; but the main (or first) challenge to
be addressed is: what data should be collected for
Keywords – Context, Data Mining, Missing Values, the problem at hand? In the studies presented in
Outliers, Data Imbalance this manuscript, multiple categorical data columns,
coefficients, and correlations are evaluated to define
1 Introduction and Motivation. a context, multiple approaches are explored, and the
results are evaluated statistically and by comparing
Historically, most research in AI has been focused on
them to actual results.
improving the algorithm. In the last decade or so
The major challenge found throughout the process
however, the focus has shifted to data - big data.
was the quality of the data (outliers, bias, and incom-
Ample amounts of data reshaped AI and renewed
pleteness). As Niels Bohr famously stated: “Prediction
its promise and premise. As more machine learning
is very difficult, especially if it’s about the future”. The
models are deployed across multiple domains [1] [2],
challenge exacerbates however, when the future predic-
new challenges are rising. For instance, the relevance,
tion is an outlier. For instance, winning a gold medal or
data types, data quality, and completeness of inputs to
a medal at all is an outlier, very few athletes win medals
a model (dependent variables), effect the significance
at the Olympics - one per sport. Same thing applies for
and ‘goodness’ of the outputs (independent variables).
most sports events, there is only one winner of the super
But how can that be optimized? In the presented
bowl, one winner of the World Cup, and that winner is
method, context is defined and injected into the process
the outlier. Contrary to that, if an athlete is histori-
to obtain insights that are more relevant and domain-
cally a winner of medals, for that athlete, not winning a
specific. However, in most cases, it is highly challenging
medal becomes an outlier (not the contrary). Therefore,
to define what context is. Context is infinite [3], and
locating outliers depends on the scope, and the subset of
so data that could be collected to define a complete
the universal dataset that is used. Adding more data to
context is also potentially infinite. For instance, in the
help define context is also dependent on the scope, goals,
∗ College of Science, George Mason University, 4400 University and the information available in the dataset. Even if we
Dr., Fairfax, Virginia, USA 22030. are looking at the same problem, same machine learn-
ing model, the slicing and dicing of data is constantly Another example used context for software testing.
effecting what context consists of. Therefore, if con- Context-Driven Testing (CDT), utilizes context to re-
text is that dynamic, then how can it be captured in duce the number of test cases and improve on the vali-
a data science lifecycle? This paper examines that no- dation and verification of software systems. The authors
tion and provides solutions to it using a Context-driven of the paper reported very significant improvements in
Data Science Lifecycle (C-DSL). The paper is organized time and quality of testing results due to context [9].
as follows: next section discusses the literature review The issue of deriving context from data however,
for context, data bias, and data incompleteness. After- is even more challenging, for instance, Mary-Anne
wards, C-DSL is introduced along with the two experi- Williams [10] pointed out that data science algorithms
mental studies, and in the final section, conclusions and without realizing their context could have an opacity
future research plans are presented. problem. This can cause models to be racist or sexist
(for example). It is often observed that Google trans-
2 Related Works in Contextual Management. lator refers to women as ‘he said’ or ‘he wrote’ when
As discussed prior, context plays a pivotal role in deci- translating from Spanish to English. This finding was
sion making as it can change the meaning of concepts also verified by Google Inc. Another opacity example is
present in a dataset. The context within a dataset can a word embedding algorithm which classifies European
be extracted and represented as features [4]. Features names as pleasant and African American names as un-
in general fall into three categories: primary features, pleasant [11]. If a reductionist approach is considered,
irrelevant features, and contextual features. Primary adding or removing data can surely redefine context, it
features are the traditional ones which are pertinent to is observed however, that most real-world data science
a particular domain. Irrelevant features are features projects use incomplete data [12] [13]. Data incomplete-
which are not helpful and can be safely removed, while ness occurs within one of the following categorizations:
contextual features are the ones to pay attention to. 1) Missing Completely at Random (MCAR), 2) Miss-
That categorization helps in eliminating irrelevant data ing at Random (MAR), and 3) Missing not at Random
but doesn’t help in clearly defining context. Another (MNAR). MAR depends on the observed data, but not
promising method that aimed to solve this challenge, on unobserved data while MCAR depends neither on
is called the Recognition and Exploitation of Contex- observed data nor unobserved data [14] [15]. There are
tual Clues via Incremental Meta-Learning [5], which is various methods to handle missing data issues which
a two-level learning model in which a Bayesian classifier includes listwise or pairwise detections, multiple impu-
is used for context classification, and meta algorithms tation, mean/ median/ mode imputation, regression im-
are used to detect contextual changes. putation, as well as learning without handling missing
Another method: context-sensitive feature selec- data [12].
tion [6] described a process that out performs tradi- All the aforementioned works were challenged with
tional feature selection such as forward sequential se- the quality of the data. For example, several types of
lection and backward sequential selection. Dominogos’s bias can occur in any phase of the data science lifecycle
method uses a clustering-based approach to select lo- or while extracting context. Bias can begin during
cally relevant features. Additionally, Bergadano et al. data collection, data cleaning, modeling, or any other
[7] introduced a two-tier contextual classification adjust- phase. Biases which arise in the data are independent of
ment method called POISEDON. The first tier captures the sample size or statistical significance, and they can
the basic properties of context, and the second tier cap- directly affect the context of the results or the model.
tures property modifications and context dependencies. They also affect the association between variables, and
Context injections however, have been more successful in extreme cases, they can even reflect the opposite of
when they are applied to specific domains. For exam- a true association or correlation [16].
ple, adding context to data has significantly improved Based on reviewing multiple works in data science,
the accuracy of algorithms for solving Natural Language the most commonly observed bias is class imbalance due
Processing (NLP) problems. Dinh et al. [8] added con- to covariate shifts. Class imbalance is represented by
text to correct wrongly tagged words. In their paper, the unequal ratio of categories which can occur due to
the authors have combined the output from the clas- changes in the distribution of data (covariate shifts).
sifier with a set of words manually labeled with con- Class imbalance depends on four factors: 1) degree
text. A transformation based learning algorithm was of class imbalance 2) the complexity of the concept
used to generate new rules for the classifier. The au- represented by the data 3) the overall size of the training
thors claimed that this method increased the contextual size and 4) the type of classifier [17]. Datasets with
accuracy of their application by 4.8%. imbalance create difficulties in information retrieval,
filtering tasks, and knowledge representation [18] [19]. R-squared; and performance of the models is compared
In this paper, context is extracted by deploying a with actual results of the sports events. C-DSL is
variety of statistical methods: data imputation, creation meant with the continuous fine-tuning of data until
of a generic coefficient, adding data columns (such as: a certain ‘contextual’ sweet spot is achieved. The
host country, sport, GDP, height, weight, and age), proposed combination of statistical methods are tools
weighted modeling, and mitigation of bias. The details that are used to reach that contextual understanding of
about the method (main contribution of this paper) and the dataset, and be able to then predict based on that.
techniques used are presented in the next section. In the Olympics experiment, outliers and bias in
data lead to results that are barely better than the
3 Context-Driven Data Science Lifecycle. conventional process, but in the second experiment
C-DSL has five main steps (Figure 1). Those five (Champions League), and after understanding context
steps are represented in two experiments (Olympics due to data imputation and inference, a coefficient
medal predictions and the UEFA Champions League is proven very successful in predicting the results of
winners and losers). In the first step, data cleaning and a tournament with very high accuracy. In the next
wrangling are performed. In the literature [22], [23], section, an in-depth explanation of the implementation
[24] it is indicated that data cleaning helps to build of C-DSL for both experiments is presented.
robust and more reliable models. Data wrangling is
considered one of the most expensive phases in the data 4 Experimental Work.
science lifecycle. During that phase, multiple decisions This section aims to test and evaluate the method
are taken, that includes: eliminating subsets of data, presented in this paper, and present the detailed process
filtering, and aggregation. In the second step of C-DSL, followed to define it.
context is injected. For experiment 1, that is done by
adding details like year, host city, sport, name of athlete, 4.1 Experiment #1 (Olympics Predictions):
country of the athlete, medal type (gold, silver, and Data Preparation and Statistical Deployments.
bronze) and athlete’s demographical data. In this experiment, an application of sports predictions
has been developed using summer Olympics data
between years 1896 and 2016. Two datasets are
pulled from Kaggle.com. The first dataset has 31,165
observations, and the second dataset consists of more
than 200,000 observations. The datasets can be found
here – https://exchangelabsgmu-my.sharepoint.
com/:f:/g/personal/akulkar8_masonlive_
gmu_edu/EuY3SFjeQl5EpNfK8P4ZUi0BcWFN-
pcUBRUpTvwuKgWmMg.
In the conventional data preparation step, winter
data is filtered out (the aim is to predict next summer
Olympics medal counts by country and sport). Summer
data is then checked for missing values. Information
on some athletes was missing, such as: Age, Height,
and Weight. A function from the R “mice” package
“md.pattern()” is used for getting insights into the
patterns of missing data. Additionally, it is for example
Figure 1: C-DSL
observed that 1,888,464 athletes didn’t win any medals;
For experiment 2, context is injected by collecting, that is represented by nulls in the medals’ column.
cleaning and generating sentiment scores from social Nulls are then replaced by “No medal”, because some
media text (tweets). For step 3, Data imputation, models in R choke when dealing with null values.
bias removal, and outlier detection are performed for The missing values (count: 114,900) are then imputed
the first experiment (explained in great details in the using the Multivariate Imputation by Chained Equation
next section). In the fourth step of C-DSL, prediction (MICE) technique [20]. After that, columns such as
models are built for experiment 1, while a coefficient Sport, Gender, Age, Height, and Weight are used as
is created for experiment 2 and used for predictions. context. This operation is performed by Predictive
In the final step of C-DSL context is evaluated using Mean Matching (PMM) method in R using the “mice()”
confusion matrices, and model quality measure such as function. Fifty iterations of imputations were required
to create all the missing data - approximately 15 hours words. Once all the tweets have scores, a coefficient is
to complete the entire process. created: Average Team Sentiment Score (ATSS). It is
Outlier detection is then performed, using Local defined as: (Sum of Sentiment score of all tweets at the
Outlier Factor (LOF). It is a density-based outlier detec- team level) / (Count of tweets at the team level).
tion technique [21]. The main reason for choosing this
method is the type of variables in the dataset. In out-
lier detection it is essential to convert categorical vari-
ables into numerical variables. In addition to that the
numerical variables are scaled using the “scale()” func-
tion. Initially, there are 5 columns (Sport, Gender, Age,
Height, and Weight) in the data but after performing
scaling and encoding of values in categories, fifty three
representative columns are created (as iterative combi-
nations of these columns). The function “lofactor()” is
used with “k = 5” for outlier detection. In the func-
tion, k denotes the number of nearest neighbors that
represent the locality used for estimating the density. Figure 2: Sentiments of tweets and counts of tweets per
Afterwards, model selection was deployed; regres- team
sion and random forests are used for this experiment. In
the first part, a simple linear regression model is built in The idea of the coefficient is to represent the
R using the “lm()” function. Further, predictions per team’s popularity and the sentiments of its fans. This
sport per country are developed using multiple linear study was deployed for eight teams: Barcelona, Real
regression. For that purpose, six different weight sce- Madrid, Juventus, Bayern Munich, Borussia Dortmund,
narios are used, and the models are tweaked to enforce Galatasaray, and Paris Saint Germain. Figure 2 shows
more significance on recent years. For random forests, a data visualization that illustrates the results of senti-
classification is based on the type of the medal (gold, sil- ments tweets. It shows a sample of all tweets and their
ver, bronze, and no medal), Sport, Gender, Age, Height, sentiment values. Red is a negative sentiment, green
and Weight of the athlete. To perform the classifica- is a positive sentiment, and blue is neutral. The main
tion, medals are encoded by numbers (“Gold = 1”, “Sil- takeaway from Figure 2 is to visualize the distribution of
ver=2”, “Bronze=3” and “No medal=4”), and then the sentiments from the tweets on all the different teams. It
model is trained on the entire dataset from 1896 to 2012 can be observed from the heat map that most of the sen-
(using “randomForest” and “ranger” packages in R). timents are neutral (blue), while the pie chart indicates
The results of this experiment were not very convincing that Barcelona F.C. has the highest number tweets.
(presented in experimental results), although much bet-
ter than conventional predictions. This experiment re-
flected the importance of tuning the value of k, creating
a coefficient, and the criticality of inference, something
that is deployed in the second experiment.
4.2 Experiment #2 (Text Mining for Context):
Setup and Coefficient Creation. In this experi-
ment, social media data are collected to be the main
driver for Context. In sports, it is safe to assume that
the fans of a sports team can reflect or influence the
team’s status, and maybe even help in predicting the
outcomes of that team. This study calculates sentiment
scores for text relevant to the Champions League (a Eu-
ropean Clubs Soccer Championship), and uses that as
the context of a team to help predict whether the team
will perform well in next stages or not. The sentiment
score for each post or tweet is normalized on a -7 to +13
scale. The R “tm” package is used to scan through the
tweets and assign scores based on a set of predefined Figure 3: Sentiment score heat map by country
Additionally, Figure 3 shows the sentiments when Country Sport Actual Conventional C-DSL
aggregated to the country level. For example, tweets USA Gymnastics 12 18 14
from China and Russia about the tournament are UK Gymnastics 7 11 7
negative on average, and ones from USA and Canada are UK Kayaking 4 6 5
positive on average, while Europe varies. The results for UK Athletics 7 8 6
both experiments 1 and 2 are presented in the following UK Sailing 3 5 4
subsection. UK Boxing 3 4 3
UK Taekwondo 3 3 2
4.3 Experimental Results: Olympics Predic- UK Triathlon 3 3 2
tions. After deploying C-DSL steps, the predictions for UK Tennis 1 4 2
the first experiment were acceptable, certainly better UK Shooting 2 5 2
than without deploying context, however, not very sat- China Table 6 5 6
isfactory. The bar plot in Figure 4 the actual number of Tennis
medals (blue bar on the left) and orange color (on the China Athletics 6 8 7
right) indicates predicted number of medals through C- China Taekwondo 2 3 3
DSL. China Boxing 4 4 4
Russia Wrestling 9 9 9
Germany Kayaking 7 7 7
Germany Shooting 4 6 5
Germany Equestrian 6 7 8
Table 1: Selected results for different sports for top 5
countries
Reference/Actual
1 2 3 4
1 13 6 6 73
Prediction 2 9 3 10 61
3 5 12 9 63
Figure 4: Actual and predicted number of medals 4 638 634 678 11468
The observed adjusted R2 value for the simple linear Table 2: Confusion matrix for predictions
regression model is 0.5488. It can be easily observed
that for Japan, Canada, Brazil, New Zealand, and the
UK the actual number of medals and predicted number as an outlier issue), the results in Table 2 are potentially
of medals are very close, and potentially useful for a result of a model that is underfitting. The claim made
decision making. In the second round, after applying in this scenario is that context can be used as a pointer
weights for predicting number of medals per sport, to such unclear data lifecycle dilemmas.
for top 5 countries, it is observed that all the models
are predicting better number of medals for: USA, 4.4 Experimental Results: Text Mining for
China, Russia, and Germany, and that is reflective Context. After calculating the sentiments and the
of actual results. In the case of the UK, all the activities for all tweets, an aggregation of ATSS (the
models were close to the actual number of medals (90% coefficient) for every team is created. The coefficient
accuracy). In Table 1, the best results from C-DSL are reflects the ATSS for every team, as well as the count
presented. Results from C-DSL are much better than of tweets per team (i.e. interest and hype surrounding
the conventional regression process. Furthermore, Table that team). The results from this experiment are
2 shows results compared to actual events (confusion very successful (more than Experiment 1). When the
matrix). The model is able to predict 13 correct records coefficient-by-team is sorted (as Figure 5 shows), the
for (1 Gold), 3 correct records for (2 Silver) and 9 correct highest two teams are the teams that reached the final
records for (3 Bronze). game in that tournament. Followed by the other two
The overall accuracy of the random forests model is semi-finalists, and then followed by teams in the quarter
83.96%, which usually reflects high accuracy, however, finals, that result indicates how contextual awareness of
due to data imbalance (which could be also considered the tournament (through data from fans for instance),
can provide predictions with high statistical confidence. niques for data imputation, bias, and outlier detection
The predictions for this study are much more in- have a significant influence in C-DSL. Two experiments
dicative of actual events than when compared to the are performed, they utilize C-DSL steps slightly differ-
UEFA ranking of those teams for instance, or expec- ently, and they have different success rates. However,
tations based on stars playing for them, or any other both experiments are successful in providing better out-
conventional method. It is important to note however comes than the conventional data science lifecycle. The
that these results are not tested across multiple types method presented in this paper is deemed to be very
of tournaments, rather only for one year (2013). That specific to certain types of data sets, and certain data
is due to the availability of the data, this work however mining problems. The experiments presented illustrate
is certainly ongoing, and we aim to deploy the same it as a punctual solution to a broad problem, however,
method for multiple tournaments. In 2013, Bayern Mu- C-DSL could be generalized to many other types of data
nich won the tournament, and teams such as Barcelona sets. For future steps, we aim to do the following: 1.
and Paris Saint Germain unexpectedly lost. C-DSL, Develop a tool that automates the process of C-DSL, 2.
based on contextual understanding of the fans, the hype, Experiment with more types of sports events, 3. Rede-
social media attention, and collective knowledge is able fine C-DSL to create a more unified and generic process
to predict the winner. The work presented in both ex- that applies to all types of datasets, 4. Identify other
periments has potential for improvements, and is still data sets that have a variety of data types and test
undergoing, conclusions and next steps are presented in them through C-DSL, 5. Deploy C-DSL for upcoming
the next section. summer sports tournaments and compare the results to
media and experts predictions.
References
[1] F. A. Batarseh, A. J. Gonzalez, and R. Knauf, Context-
assisted test cases reduction for cloud validation, Inter-
Figure 5: Team coefficient very indicative of actual national and Interdisciplinary Conference on Modeling
and Using Context, 8175 (2013), pp. 288–301.
results
[2] F. A. Batarseh, and R. Yang, Federal data science:
Transforming government and agricultural policy using
artificial intelligence., Elsevier Academic Press, 2017.
[3] M. Bazire, and P. Brezillon, Understanding Con-text
Before Using It., The 5th International and Interdisci-
plinary Conference on Context, 3554 (2005), pp. 29–40.
[4] P. D. Turney, The management of context-sensitive fea-
tures: A review of strategies., The 13th International
Conference on Machine Learning, Workshop on Learn-
ing in Context-Sensitive Domains, 2002, pp. 60–66.
[5] G. Widmer, Recognition and exploitation of contex-
tual clues via incremental meta-learning (Extended ver-
sion), The 13th International Conference on Machine
Learning, 1996, pp. 525–533.
[6] P. Domingos, Context-sensitive feature selection for
lazy learners, Lazy learning, 1997, pp. 227–253.
[7] F. Bergadano, S. Matwin, R. S. Michalski, and J.
Figure 6: Actual results of 2012-13 UEFA Champions Zhang, Learning two-tiered descriptions of flexible con-
League [25] cepts: The POSEIDON system, Machine Learning, 8
(1992), pp. 5–43.
[8] P. H. Dinh, N. K. Nguyen, and A. C. Le, Combin-
5 Conclusions and Next Steps. ing statistical machine learning with transfor-mation
rule learning for Vietnamese word sense disambigua-
In this paper, a Context-driven Data Science Lifecy- tion, Computing and Communication Technologies,
cle (C-DSL) is introduced and tested for applications Research, Innovation, and Vision for the Future, 2012,
of sport predictions. It can be concluded from the re- pp. 1–6.
sults that context plays a crucial role for prediction. [9] F. A. Batarseh, Context-driven testing on the cloud,
In addition to that, based on our experiments, tech- Context in Computing, 2014, pp. 25–44.
[10] Mary-Anne Williams, Risky bias in artificial in-
telligence, The Australian Academy of Tech-
nology and Engineering, 2018, Retrieved from:
https://www.atse.org.au/content/news/risky-bias-in-
artificial-intelligence.aspx
[11] J. Zou, and L. Schiebinger, AI can be sexist and
racist - it’s time to make it fair., 2018, Retrieved
from: https://www.nature.com/articles/d41586-018-
05707-8.
[12] J. Sessa, and D. Syed, Techniques to deal with missing
data., Electronic Devices, Systems and Applications
(ICEDSA) 5th International Conference, 2016, pp. 1–4.
[13] H. Kang, The prevention and handling of the missing
data., Korean journal of anesthesiology, 64 (2013),
pp. 402–406.
[14] J. L. Schafer, and J. W. Graham, Missing data: our
view of the state of the art., Psychological methods, 7
(2002), pp. 147-177.
[15] J. W. Graham, Missing data analysis: Making it work
in the real world., Annual review of psychology, 60
(2009), pp. 549-576.
[16] C. J. Pannucci, and E. G. Wilkins, Identifying and
avoiding bias in research., Plastic and reconstructive
surgery, 126 (2010), pp. 619-625.
[17] N. Japkowicz, and S. Stephen, The class imbalance
problem: A systematic study., Intelligent data analysis,
6 (2002), pp. 429-449.
[18] D. D. Lewis, and M. Ringuette, A comparison of two
learning algorithms for text categorization., Third an-
nual symposium on document analysis and information
retrieval, 33 (1994), pp. 81-93.
[19] D. D. Lewis, and J. Catlett, Heterogeneous uncertainty
sampling for supervised learning., Machine Learning,
1994, pp. 148-156.
[20] S. V. Buuren, and K. Groothuis-Oudshoorn, mice:
Multivariate imputation by chained equations in R.,
Journal of statistical software, 45 (2010), pp. 1-68.
[21] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander,
LOF: identifying density-based local outliers., ACM
sigmod record, 29 (2000), pp. 93-104.
[22] T. Dasu, and T. Johnson, Exploratory data mining and
data cleaning., John Wiley & Sons, 479 (2003).
[23] S. Zhang, C. Zhang, and Q. Yang, Data preparation for
data mining., Applied artificial intelligence, 17 (2003),
pp. 375-381.
[24] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, Data
cleaning: Overview and emerging challenges., In Pro-
ceedings of the 2016 International Conference on Man-
agement of Data, 2016, pp. 2201-2206.
[25] 2012–13 UEFA Champions League image retrieved
from - https: // en. wikipedia. org/ wiki/ 2012% E2%
80% 9313_ UEFA_ Champions_ League