-

Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation

College of Science, George Mason University, 4400 University

0 0 Dr. , Fairfax, Virginia, USA 22030. , USA

sports studies presented, there is an in nite amount The results of data mining endeavors are majorly driven of information that could be collected and used for by data quality. Throughout these deployments, serious contextual awareness. For example, context can consist show-stopper problems are still unresolved, such as: data of data about the weather on the day of the competition, collection ambiguities, data imbalance, hidden biases in or the type of car that the athlete owns, or their data, the lack of domain information, and data incomplete- country's birth rate, or the type of shoes worn by them ness. This paper is based on the premise that context can during the competition, or whether the athlete had eggs aid in mitigating these issues. In a traditional data science or cereal for breakfast that day! The point is, the lifecycle, context is not considered. Context-driven Data amount and variety of data that could be collected to Science Lifecycle (C-DSL); the main contribution of this de ne the context of the event under study is in nite, paper, is developed to address these challenges. Two case which makes the scope of this challenge very di cult to studies (using datasets from sports events) are developed to capture. test C-DSL. Results from both case studies are evaluated In data collection, and given that any data could using common data mining metrics such as: coe cient be collected (theoretically), then the four Vs of big of determination (R2) and confusion matrices. The work data (velocity, variety, veracity, and volume) are not presented in this paper aims to re-de ne the lifecycle and representative of the real challenge within the lifecycle introduce tangible improvements to its outcomes. of data science; but the main (or rst) challenge to be addressed is: what data should be collected for

ing model, the slicing and dicing of data is constantly Another example used context for software testing. e ecting what context consists of. Therefore, if con- Context-Driven Testing (CDT), utilizes context to retext is that dynamic, then how can it be captured in duce the number of test cases and improve on the valia data science lifecycle? This paper examines that no- dation and veri cation of software systems. The authors tion and provides solutions to it using a Context-driven of the paper reported very signi cant improvements in Data Science Lifecycle (C-DSL). The paper is organized time and quality of testing results due to context [ 9 ]. as follows: next section discusses the literature review The issue of deriving context from data however, for context, data bias, and data incompleteness. After- is even more challenging, for instance, Mary-Anne wards, C-DSL is introduced along with the two experi- Williams [ 10 ] pointed out that data science algorithms mental studies, and in the nal section, conclusions and without realizing their context could have an opacity future research plans are presented. problem. This can cause models to be racist or sexist (for example). It is often observed that Google trans2 Related Works in Contextual Management. lator refers to women as `he said' or `he wrote' when As discussed prior, context plays a pivotal role in deci- translating from Spanish to English. This nding was sion making as it can change the meaning of concepts also veri ed by Google Inc. Another opacity example is present in a dataset. The context within a dataset can a word embedding algorithm which classi es European be extracted and represented as features [ 4 ]. Features names as pleasant and African American names as unin general fall into three categories: primary features, pleasant [ 11 ]. If a reductionist approach is considered, irrelevant features, and contextual features. Primary adding or removing data can surely rede ne context, it features are the traditional ones which are pertinent to is observed however, that most real-world data science a particular domain. Irrelevant features are features projects use incomplete data [ 12 ] [ 13 ]. Data incompletewhich are not helpful and can be safely removed, while ness occurs within one of the following categorizations: contextual features are the ones to pay attention to. 1) Missing Completely at Random (MCAR), 2) MissThat categorization helps in eliminating irrelevant data ing at Random (MAR), and 3) Missing not at Random but doesn't help in clearly de ning context. Another (MNAR). MAR depends on the observed data, but not promising method that aimed to solve this challenge, on unobserved data while MCAR depends neither on is called the Recognition and Exploitation of Contex- observed data nor unobserved data [ 14 ] [ 15 ]. There are tual Clues via Incremental Meta-Learning [ 5 ], which is various methods to handle missing data issues which a two-level learning model in which a Bayesian classi er includes listwise or pairwise detections, multiple impuis used for context classi cation, and meta algorithms tation, mean/ median/ mode imputation, regression imare used to detect contextual changes. putation, as well as learning without handling missing

Another method: context-sensitive feature selec- data [ 12 ]. tion [ 6 ] described a process that out performs tradi- All the aforementioned works were challenged with tional feature selection such as forward sequential se- the quality of the data. For example, several types of lection and backward sequential selection. Dominogos's bias can occur in any phase of the data science lifecycle method uses a clustering-based approach to select lo- or while extracting context. Bias can begin during cally relevant features. Additionally, Bergadano et al. data collection, data cleaning, modeling, or any other [ 7 ] introduced a two-tier contextual classi cation adjust- phase. Biases which arise in the data are independent of ment method called POISEDON. The rst tier captures the sample size or statistical signi cance, and they can the basic properties of context, and the second tier cap- directly a ect the context of the results or the model. tures property modi cations and context dependencies. They also a ect the association between variables, and Context injections however, have been more successful in extreme cases, they can even re ect the opposite of when they are applied to speci c domains. For exam- a true association or correlation [ 16 ]. ple, adding context to data has signi cantly improved Based on reviewing multiple works in data science, the accuracy of algorithms for solving Natural Language the most commonly observed bias is class imbalance due Processing (NLP) problems. Dinh et al. [ 8 ] added con- to covariate shifts. Class imbalance is represented by text to correct wrongly tagged words. In their paper, the unequal ratio of categories which can occur due to the authors have combined the output from the clas- changes in the distribution of data (covariate shifts). si er with a set of words manually labeled with con- Class imbalance depends on four factors: 1) degree text. A transformation based learning algorithm was of class imbalance 2) the complexity of the concept used to generate new rules for the classi er. The au- represented by the data 3) the overall size of the training thors claimed that this method increased the contextual size and 4) the type of classi er [ 17 ]. Datasets with accuracy of their application by 4.8%. imbalance create di culties in information retrieval, ltering tasks, and knowledge representation [ 18 ] [ 19 ]. R-squared; and performance of the models is compared

In this paper, context is extracted by deploying a with actual results of the sports events. C-DSL is variety of statistical methods: data imputation, creation meant with the continuous ne-tuning of data until of a generic coe cient, adding data columns (such as: a certain `contextual' sweet spot is achieved. The host country, sport, GDP, height, weight, and age), proposed combination of statistical methods are tools weighted modeling, and mitigation of bias. The details that are used to reach that contextual understanding of about the method (main contribution of this paper) and the dataset, and be able to then predict based on that. techniques used are presented in the next section. In the Olympics experiment, outliers and bias in data lead to results that are barely better than the 3 Context-Driven Data Science Lifecycle. conventional process, but in the second experiment C-DSL has ve main steps (Figure 1). Those ve (Champions League), and after understanding context steps are represented in two experiments (Olympics due to data imputation and inference, a coe cient medal predictions and the UEFA Champions League is proven very successful in predicting the results of winners and losers). In the rst step, data cleaning and a tournament with very high accuracy. In the next wrangling are performed. In the literature [ 22 ], [ 23 ], section, an in-depth explanation of the implementation [ 24 ] it is indicated that data cleaning helps to build of C-DSL for both experiments is presented. robust and more reliable models. Data wrangling is considered one of the most expensive phases in the data 4 Experimental Work. science lifecycle. During that phase, multiple decisions This section aims to test and evaluate the method are taken, that includes: eliminating subsets of data, presented in this paper, and present the detailed process ltering, and aggregation. In the second step of C-DSL, followed to de ne it. context is injected. For experiment 1, that is done by adding details like year, host city, sport, name of athlete, 4.1 Experiment #1 (Olympics Predictions): country of the athlete, medal type (gold, silver, and Data Preparation and Statistical Deployments. bronze) and athlete's demographical data. In this experiment, an application of sports predictions has been developed using summer Olympics data between years 1896 and 2016. Two datasets are pulled from Kaggle.com. The rst dataset has 31,165 observations, and the second dataset consists of more than 200,000 observations. The datasets can be found here { https://exchangelabsgmu-my.sharepoint. com/:f:/g/personal/akulkar8_masonlive_ gmu_edu/EuY3SFjeQl5EpNfK8P4ZUi0BcWFNpcUBRUpTvwuKgWmMg.

In the conventional data preparation step, winter data is ltered out (the aim is to predict next summer Olympics medal counts by country and sport). Summer data is then checked for missing values. Information on some athletes was missing, such as: Age, Height, and Weight. A function from the R \mice" package \md.pattern()" is used for getting insights into the Figure 1: C-DSL patterns of missing data. Additionally, it is for example observed that 1,888,464 athletes didn't win any medals;

For experiment 2, context is injected by collecting, that is represented by nulls in the medals' column. cleaning and generating sentiment scores from social Nulls are then replaced by \No medal", because some media text (tweets). For step 3, Data imputation, models in R choke when dealing with null values. bias removal, and outlier detection are performed for The missing values (count: 114,900) are then imputed the rst experiment (explained in great details in the using the Multivariate Imputation by Chained Equation next section). In the fourth step of C-DSL, prediction (MICE) technique [ 20 ]. After that, columns such as models are built for experiment 1, while a coe cient Sport, Gender, Age, Height, and Weight are used as is created for experiment 2 and used for predictions. context. This operation is performed by Predictive In the nal step of C-DSL context is evaluated using Mean Matching (PMM) method in R using the \mice()" confusion matrices, and model quality measure such as function. Fifty iterations of imputations were required to create all the missing data - approximately 15 hours words. Once all the tweets have scores, a coe cient is to complete the entire process. created: Average Team Sentiment Score (ATSS). It is

Outlier detection is then performed, using Local de ned as: (Sum of Sentiment score of all tweets at the Outlier Factor (LOF). It is a density-based outlier detec- team level) / (Count of tweets at the team level). tion technique [ 21 ]. The main reason for choosing this method is the type of variables in the dataset. In outlier detection it is essential to convert categorical variables into numerical variables. In addition to that the numerical variables are scaled using the \scale()" function. Initially, there are 5 columns (Sport, Gender, Age, Height, and Weight) in the data but after performing scaling and encoding of values in categories, fty three representative columns are created (as iterative combinations of these columns). The function \lofactor()" is used with \k = 5" for outlier detection. In the function, k denotes the number of nearest neighbors that represent the locality used for estimating the density. Figure 2: Sentiments of tweets and counts of tweets per

Afterwards, model selection was deployed; regres- team sion and random forests are used for this experiment. In the rst part, a simple linear regression model is built in The idea of the coe cient is to represent the R using the \lm()" function. Further, predictions per team's popularity and the sentiments of its fans. This sport per country are developed using multiple linear study was deployed for eight teams: Barcelona, Real regression. For that purpose, six di erent weight sce- Madrid, Juventus, Bayern Munich, Borussia Dortmund, narios are used, and the models are tweaked to enforce Galatasaray, and Paris Saint Germain. Figure 2 shows more signi cance on recent years. For random forests, a data visualization that illustrates the results of senticlassi cation is based on the type of the medal (gold, sil- ments tweets. It shows a sample of all tweets and their ver, bronze, and no medal), Sport, Gender, Age, Height, sentiment values. Red is a negative sentiment, green and Weight of the athlete. To perform the classi ca- is a positive sentiment, and blue is neutral. The main tion, medals are encoded by numbers (\Gold = 1", \Sil- takeaway from Figure 2 is to visualize the distribution of ver=2", \Bronze=3" and \No medal=4"), and then the sentiments from the tweets on all the di erent teams. It model is trained on the entire dataset from 1896 to 2012 can be observed from the heat map that most of the sen(using \randomForest" and \ranger" packages in R). timents are neutral (blue), while the pie chart indicates The results of this experiment were not very convincing that Barcelona F.C. has the highest number tweets. (presented in experimental results), although much better than conventional predictions. This experiment re

ected the importance of tuning the value of k, creating a coe cient, and the criticality of inference, something that is deployed in the second experiment. 4.2 Experiment #2 (Text Mining for Context): Setup and Coe cient Creation. In this experiment, social media data are collected to be the main driver for Context. In sports, it is safe to assume that the fans of a sports team can re ect or in uence the team's status, and maybe even help in predicting the outcomes of that team. This study calculates sentiment scores for text relevant to the Champions League (a European Clubs Soccer Championship), and uses that as the context of a team to help predict whether the team will perform well in next stages or not. The sentiment score for each post or tweet is normalized on a -7 to +13 scale. The R \tm" package is used to scan through the tweets and assign scores based on a set of prede ned

Additionally, Figure 3 shows the sentiments when aggregated to the country level. For example, tweets from China and Russia about the tournament are negative on average, and ones from USA and Canada are positive on average, while Europe varies. The results for both experiments 1 and 2 are presented in the following subsection. 4.3 Experimental Results: Olympics Predictions. After deploying C-DSL steps, the predictions for the rst experiment were acceptable, certainly better than without deploying context, however, not very satisfactory. The bar plot in Figure 4 the actual number of medals (blue bar on the left) and orange color (on the right) indicates predicted number of medals through CDSL. 7 3 4 9 7 5 8

The observed adjusted R2 value for the simple linear Table 2: Confusion matrix for predictions regression model is 0.5488. It can be easily observed that for Japan, Canada, Brazil, New Zealand, and the UK the actual number of medals and predicted number as an outlier issue), the results in Table 2 are potentially of medals are very close, and potentially useful for a result of a model that is under tting. The claim made decision making. In the second round, after applying in this scenario is that context can be used as a pointer weights for predicting number of medals per sport, to such unclear data lifecycle dilemmas. for top 5 countries, it is observed that all the models are predicting better number of medals for: USA, 4.4 Experimental Results: Text Mining for China, Russia, and Germany, and that is re ective Context. After calculating the sentiments and the of actual results. In the case of the UK, all the activities for all tweets, an aggregation of ATSS (the models were close to the actual number of medals (90% coe cient) for every team is created. The coe cient accuracy). In Table 1, the best results from C-DSL are re ects the ATSS for every team, as well as the count presented. Results from C-DSL are much better than of tweets per team (i.e. interest and hype surrounding the conventional regression process. Furthermore, Table that team). The results from this experiment are 2 shows results compared to actual events (confusion very successful (more than Experiment 1). When the matrix). The model is able to predict 13 correct records coe cient-by-team is sorted (as Figure 5 shows), the for (1 Gold), 3 correct records for (2 Silver) and 9 correct highest two teams are the teams that reached the nal records for (3 Bronze). game in that tournament. Followed by the other two

The overall accuracy of the random forests model is semi- nalists, and then followed by teams in the quarter 83.96%, which usually re ects high accuracy, however, nals, that result indicates how contextual awareness of due to data imbalance (which could be also considered the tournament (through data from fans for instance), can provide predictions with high statistical con dence. niques for data imputation, bias, and outlier detection

The predictions for this study are much more in- have a signi cant in uence in C-DSL. Two experiments dicative of actual events than when compared to the are performed, they utilize C-DSL steps slightly di erUEFA ranking of those teams for instance, or expec- ently, and they have di erent success rates. However, tations based on stars playing for them, or any other both experiments are successful in providing better outconventional method. It is important to note however comes than the conventional data science lifecycle. The that these results are not tested across multiple types method presented in this paper is deemed to be very of tournaments, rather only for one year (2013). That speci c to certain types of data sets, and certain data is due to the availability of the data, this work however mining problems. The experiments presented illustrate is certainly ongoing, and we aim to deploy the same it as a punctual solution to a broad problem, however, method for multiple tournaments. In 2013, Bayern Mu- C-DSL could be generalized to many other types of data nich won the tournament, and teams such as Barcelona sets. For future steps, we aim to do the following: 1. and Paris Saint Germain unexpectedly lost. C-DSL, Develop a tool that automates the process of C-DSL, 2. based on contextual understanding of the fans, the hype, Experiment with more types of sports events, 3. Redesocial media attention, and collective knowledge is able ne C-DSL to create a more uni ed and generic process to predict the winner. The work presented in both ex- that applies to all types of datasets, 4. Identify other periments has potential for improvements, and is still data sets that have a variety of data types and test undergoing, conclusions and next steps are presented in them through C-DSL, 5. Deploy C-DSL for upcoming the next section. summer sports tournaments and compare the results to media and experts predictions.

In this paper, a Context-driven Data Science Lifecycle (C-DSL) is introduced and tested for applications of sport predictions. It can be concluded from the results that context plays a crucial role for prediction.

In addition to that, based on our experiments, tech

[1]

F. A.

Batarseh ,

A. J.

Gonzalez , and

Knauf , Contextassisted test cases reduction for cloud validation , International and Interdisciplinary Conference on Modeling and Using Context , 8175 ( 2013 ), pp. 288 { 301 .

[2]

F. A.

Batarseh , and

Yang , Federal data science: Transforming government and agricultural policy using arti cial intelligence ., Elsevier Academic Press, 2017 .

[3]

Bazire , and

Brezillon , Understanding Con-text Before Using It ., The 5th International and Interdisciplinary Conference on Context , 3554 ( 2005 ), pp. 29 { 40 .

[4]

P. D.

Turney , The management of context-sensitive features: A review of strategies ., The 13th International Conference on Machine Learning, Workshop on Learning in Context-Sensitive Domains , 2002 , pp. 60 { 66 .

[5]

Widmer , Recognition and exploitation of contextual clues via incremental meta-learning ( Extended version) , The 13th International Conference on Machine Learning , 1996 , pp. 525 { 533 .

[6]

Domingos , Context-sensitive feature selection for lazy learners , Lazy learning , 1997 , pp. 227 { 253 .

[7]

Bergadano ,

Matwin , R. S. Michalski, and

Zhang , Learning two-tiered descriptions of exible concepts: The POSEIDON system , Machine Learning , 8 ( 1992 ), pp. 5 { 43 .

[8]

P. H.

Dinh ,

N. K.

Nguyen , and

A. C.

Le , Combining statistical machine learning with transfor-mation rule learning for Vietnamese word sense disambiguation , Computing and Communication Technologies , Research, Innovation, and Vision for the Future , 2012 , pp. 1 { 6 .

[9]

F. A.

Batarseh , Context-driven testing on the cloud , Context in Computing, 2014 , pp. 25 { 44 .

[10] Mary-Anne

Williams

, Risky bias in arti cial intelligence , The Australian Academy of Technology and Engineering , 2018 , Retrieved from: https://www.atse.org.au/content/news/risky-bias -inarti cial-intelligence .aspx

[11]

Zou , and L. Schiebinger, AI can be sexist and racist - it's time to make it fair ., 2018 , Retrieved from: https://www.nature.com/articles/d41586-018- 05707-8.

[12]

Sessa , and

Syed , Techniques to deal with missing data ., Electronic Devices, Systems and Applications (ICEDSA) 5th International Conference , 2016 , pp. 1 { 4 .

[13]

Kang , The prevention and handling of the missing data ., Korean journal of anesthesiology , 64 ( 2013 ), pp. 402 { 406 .

[14]

J. L.

Schafer , and

J. W.

Graham , Missing data: our view of the state of the art ., Psychological methods , 7 ( 2002 ), pp. 147 - 177 .

[15]

J. W.

Graham , Missing data analysis: Making it work in the real world ., Annual review of psychology , 60 ( 2009 ), pp. 549 - 576 .

[16]

C. J.

Pannucci , and

E. G.

Wilkins , Identifying and avoiding bias in research ., Plastic and reconstructive surgery , 126 ( 2010 ), pp. 619 - 625 .

[17]

Japkowicz , and

Stephen , The class imbalance problem: A systematic study ., Intelligent data analysis , 6 ( 2002 ), pp. 429 - 449 .

[18] D. D. Lewis , and M. Ringuette , A comparison of two learning algorithms for text categorization ., Third annual symposium on document analysis and information retrieval , 33 ( 1994 ), pp. 81 - 93 .

[19] D. D. Lewis , and J. Catlett , Heterogeneous uncertainty sampling for supervised learning ., Machine Learning , 1994 , pp. 148 - 156 .

[20]

S. V.

Buuren , and

Groothuis-Oudshoorn , mice: Multivariate imputation by chained equations in R. , Journal of statistical software , 45 ( 2010 ), pp. 1 - 68 .

[21] M. M. Breunig , H. P.

Kriegel , R. T.

Ng , and J.

Sander , LOF: identifying density-based local outliers ., ACM sigmod record , 29 ( 2000 ), pp. 93 - 104 .

[22]

Dasu , and T. Johnson, Exploratory data mining and data cleaning ., John Wiley & Sons, 479 ( 2003 ).

[23]

Zhang ,

Zhang , and

Yang , Data preparation for data mining ., Applied arti cial intelligence , 17 ( 2003 ), pp. 375 - 381 .

[24]

Chu ,

I. F.

Ilyas ,

Krishnan , and

Wang , Data cleaning: Overview and emerging challenges ., In Proceedings of the 2016 International Conference on Management of Data , 2016 , pp. 2201 - 2206 .

[25] 2012 {13

UEFA

Champions League image retrieved from - https: // en. wikipedia. org/ wiki/ 2012% E2% 80 % 9313_ UEFA_ Champions_ League