An Approach to Automatically Detect and Visualize Bias in Data Analytics Ana Lavalle Alejandro Maté Juan Trujillo Lucentia Research (DLSI) Lucentia Research (DLSI) Lucentia Research (DLSI) University of Alicante University of Alicante University of Alicante San Vicente del Raspeig, Spain San Vicente del Raspeig, Spain San Vicente del Raspeig, Spain Lucentia Lab, Alicante, Spain Lucentia Lab, Alicante, Spain Lucentia Lab, Alicante, Spain alavalle@dlsi.ua.es amate@dlsi.ua.es jtrujillo@dlsi.ua.es ABSTRACT Unfortunately, most approaches developed until now are mainly Data Analytics and Artificial Intelligence (AI) are increasingly focused on machine-learning and rebalancing the biased datasets. driving key business decisions and business processes. Any flaws As [7] argues, the fairness of predictions should be evaluated in the interpretation of analytic results or AI outputs can lead in context of the data, and unfairness induced by inadequate to significant economic loses and reputation damage. Among samples sizes or unmeasured predictive variables should be ad- existing flaws, one of the most often overlooked is the use biased dressed through data collection rather than by constraining the data and imbalanced datasets. When unadverted, data bias warps model. As such, a general approach that automatically warns the the meaning of data and has a devastating effect on AI results. user of the existence of biases and lets her analyze the data from Existing approaches deal with data bias by constraining the data different perspectives without altering the dataset is missing. model, altering its composition until the data is no longer biased. Therefore, in this paper we focus our work on detecting and Unfortunately, studies have shown that crucial information about presenting in a humanly understandable way the existence of data the nature of data may be lost during this process. Therefore, in bias and imbalanced datasets, with a special focus on enabling this paper we propose an alternative process, one that detects the analysis through data analytics without altering the dataset. data biases and presents biased data in a visual way so that the Our approach complements our previous work [15] [14] where user can comprehend how data is structured and decide whether we presented an iterative Goal-Based modeling approach based or not constraining approaches are applicable in his context. Our on the i* language for the automatic derivation of data visual- approach detects the existence of biases in datasets through our izations and we aligned it with the Model Driven Architecture proposed algorithm and generates a series of visualizations in a (MDA) in order to facilitate the creation of the right visual ana- way that is understandable for users, including non-expert ones. lytics for non-expert users. Now, we include a Biases Detection In this way, users become aware not only of the existence of Process that automatically detects the existence of biases in the biases in the data, but also how they may impact their analytics datasets and enables users to measure them and select those and AI algorithms, thus avoiding undesired results. ones which are relevant to them. Our process includes a novel algorithm that takes into account the scope of the analysis, de- tects biases, and presents them in a way that is understandable 1 INTRODUCTION for users, including non-expert ones. In this way, users become Nowadays, Data Analytics have become a key component of aware not only of the existence of biases in their datasets, but many business processes. Whether driving business decisions also how they may impact their analytics and AI algorithms, thus or offering new services through Artificial Intelligence (AI) algo- avoiding unwanted results. rithms, data serves as the main resource for improving business The rest of the paper is structured as follows. Section 2 presents performance. Therefore, any flaws within the data or its use will a classification of types of biases. Section 3 summarizes the re- be translated into significant performance and economic loses. lated work in this area. Section 4 describes our proposed process. One of such flaws is data bias and the use of imbalanced Section 5 presents our Biases Detection Approach. Section 6 de- datasets. When unadverted, data bias can significantly affect scribes results of the experiments applying our approach. Finally, the interpretation of data, and has devastating impact on AI re- Section 7 summarizes the conclusions and our future work. sults as recently reported by the Gartner Group [6]. One area where biases lead to life-threatening consequences is Healthcare, 2 BIASES IN DATA where identifying as healthy a patient that is incubating a severe In order to illustrate the negative impact of data bias, in this illness may delay its treatment [2]. section, we provide a classification of types of biases. There are As such, data bias has become an important concern in the different types of biases in datasets, the most common being community, with Big companies like Amazon, Facebook, Mi- Class Imbalance and Dataset Shift. crosoft, Google, etc. investing resources and effort to tackle the Class Imbalance is the case where classes are not equally problem. Amazon Web Services [23] has published information represented in the data, this means that one or more categories about fairness in their machine-learning services in terms of ac- on the dataset have a higher representation than the rest of the curacy, false positive and false negative rates. Facebook [19] has categories. It is usual to find this kind of bias in real word datasets shown one of its internal anti-bias software tools, “Fairness Flow” [12]. This bias causes several problems, specially when people which measures how a model interacts with specific groups. are trying to analyze this data and/or applying AI algorithms. Dataset Shift refers to the case where the distribution of the © Copyright 2020 for this paper held by its author(s). Published in the proceedings of data within the training dataset does not match the distribution DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution in the test and real datasets. In real word datasets often train and 4.0 International (CC BY 4.0). test datasets have not been generated by the same distribution. Artificial Intelligence Algorithms trained on biased training sets distribution [20]. In [22] authors analyze, the relationship be- tend not to generalize well on test data that is from the true tween the class distribution of training data to determine the best underlying distribution of the population, which has an negative class distribution for learning. [10] have recently proposed deci- effect on the quality of a machine learning model. As [18] argue, sion tree learning for finding a model that is able to distinguish there are three potential types of dataset shift: between training and test distributions. Covariate Shift: It happens when the input attributes have On the other hand, some works have focused on the impact of different distributions between the training and test datasets. data flaws on the visual features of visualization. M. Correll et al. Prior Probability Shift: In this case, it happens when the in [8] show how it is possible to create visualizations that seem class distribution is different between the training and test datasets. “plausible” (design parameters are within normal bounds and pass Concept Shift: It happens when the relationship between the the visual sanity check) but hide crucial data flaws. The biases input and class variables changes. Usually occurs when training can be considered as data flaws if the context determines so. It data is collected at a different point in time than testing data. is possible to detect biases in datasets when the classification Biased datasets are very common and they can cause severe categories are not approximately equally represented. problems if bias are not taken into account and treated properly As we have shown, most approaches developed until now are depending on the type of bias we are facing, the context, and the mainly focused on machine-learning and rebalancing the biased objective that the dataset is being used for. Therefore, it is para- datasets. However, our goal is not to balance the biased datasets. mount to show users how biased their data are, in order to enable As [7] argues, the fairness of predictions should be evaluated in them to take into account those biases which are determinant to context of the data, and unfairness induced by inadequate samples them. Otherwise, their decisions will likely have unexpected and sizes or unmeasured predictive variables should be addressed negative consequences. through data collection rather than by constraining the model. For this reason, we propose an approach that automatically warns 3 RELATED WORK the user of the existence of biases and lets her analyze the data from different perspectives without altering the dataset. Since The class imbalance problem has been encountered in multi- one of the core benefits of visualizations is enabling people to ple areas, some of them with a serious impact, such as in the discover visual patterns that might otherwise be hidden [8]. interpretation of medical data [5]. This problem has been also considered one of the top 10 problems in data mining and pattern recognition [24]. The issue with imbalance in class distribution 4 PROPOSED PROCESS becomes more pronounced with the applications of the AI algo- In this section, we will describe our proposed process. Fig. 1 rithms. Mining and learning classifiers from imbalanced datasets summarizes the process followed in our proposal, representing are indeed a very important problem from both the algorithmic in a red cloud the new elements introduced in this paper. Rest of and performance perspective [13]. Not choosing the right dis- the elements were introduced in our previous work [15] [14]. tribution can introduce bias towards the most represented class. In our process, firstly, a sequence of questions guides users in Since most AI algorithms expect a balanced class distribution creating a User Requirements Model [15] that captures their [11], an algorithm trained with imbalanced datasets will tend to needs and analysis context. Then, this Model is complemented by unadvertedly return results of the most populated classes. the Data Profiling Model [15] that analyzes of the features of Different authors have proposed several techniques to han- the data sources selected to be visualized. The user requirements, dle with these problems. Generally, the approaches to deal with together with the data profiling information, are translated into Imbalanced Data issues involve three categories [16]: a Visualization Specification that enables users to derive the Data perspective: uses techniques to artificially re-balance the best visualization types [15] in each context automatically. This class distribution by sampling the data space to diminish the effect transformation generates a Data Visualization Model [14]. caused by class imbalance. As [10] argues, one intuitive method The Data Visualization Model enables users to specify vi- is undersampling the majority classes by dropping training ex- sualization details regardless of their implementation technology. amples. This approach leads to smaller data sets, but important This model also enables users to determine if the proposed vi- examples could be dropped during the process. Another method sualization is adequate to satisfy the essential requirements for is oversampling the minority classes. which it was created or not. If the proposed visualization does Algorithmic perspective: these solutions try to adapt or mod- not pass the user validation, it will point out the existence of ify cost adjustment within the learning algorithm to make it missing or wrongly defined requirements. In this case, a new perform better on imbalanced data sets during the training pro- cycle is started by reviewing the existing model to identify which cess. For example, [17] proposes an algorithm that is able to deal aspects were not taken into account, generating in turn an up- with the uncertainty that is introduced in large volumes of data dated model. Otherwise, a successful validation will start the without disregarding the learning in the underrepresented class. Biases Detection Process. Once users have validated the visu- Ensemble approach: this type of solutions uses aspects from alization, the attributes of the collections that have been selected both perspectives to determine the final prediction. [9] proposes in the process to be represented in the visualization are analyzed. an integrated method for learning large imbalanced dataset. Their Our novel algorithm examines the data to automatically detect approach examines a combination of metrics across different biases and presents this information to the users. Users may de- learning algorithms and balancing techniques. The most accurate fine thresholds to adapt the Biases Detection Process to their method is then selected to be applied on real large, imbalanced, specific needs. The definition of thresholds is performed in an and heterogeneous datasets. easy way, adapted for non-expert users by defining two variables In the case of Dataset Shift (when the training data and test through the interface. This new functionality will make users data are distributed different), a common approach is to reweight aware of biases that could significantly alter the interpretation of data such that the reweighted distribution matches the target their data, as well as the techniques to be used for the analysis. User User Model Review Process Guidelines Requirements Model Data Visualization Review Data Visualization Visualization specification Model Biases Implementation Detection Data Profiling Add biases information Process Model Periodic Monitoring Data Source Figure 1: Overall view of the proposed process As a result of the process, users will obtain a visual represen- 5 BIASES DETECTION tation of the bias, being offered the option to include information Our proposal starts from the result of our process for automatic in their analytics about each of the attributes detected as biased derivation of visualizations, shown in Fig. 1. In this sense, we by the algorithm. If they decide to add information about a bi- assume the user has defined her requirements, the information ased attribute, they can integrate this information within the that she wants to analyze and that the visualization that best visualization that they had created for the initial analysis, or, al- suits her needs has been automatically derived. Once the user ternatively, in a new visualization that is dynamically connected has validated the visualization, it is possible that certain elements with the visualization of the process, so that when one of the are changing the interpretation of the data and the user is un- visualizations is interacted with, the other one is updated. aware of them. Therefore, at this point we introduce our novel If users decide to add new information about some biased Biases Detection Process to detect biases in the data, based on attribute, a new visualization specification will be generated. the algorithm proposed in this paper that will facilitate this task. Therefore, in the Data Visualization Model, users will be able It is important to note that, although we assume that the user to customize the visualization/visualizations and select how to has followed our previous approach, the process proposed can be represent the biases information. Once users validate the new applied to visualizations obtained through other tools, as long as visualizations and do not wish to add further information, the the necessary information is facilitated as input to the algorithm. corresponding implementation will be generated. The first step in our Biases Detection Process proposed is to Finally, when the visualization has been implemented and automatically analyze the attributes of the collections used for users are working with it, it is possible to program a Periodic the visualization defined in the process through Algorithm 1. Monitoring. The aim of this continuous monitoring is to ensure This algorithm enables us to automatically detect biases in the that, as new data populates the data sources, no new biases are data by an analysis of the datasets, giving us information as to introduced unadvertedly. The Periodic Monitoring event will how biased data are. Users can alter the limits for bias detection trigger an execution of our Biases Detection Algorithm with in order to tailor the algorithm to their particular case. the aim of automatically detect if the data has exceeded the It is important to note that, although we exemplify the im- defined thresholds. If a new threshold has been exceed, an alert plementation of our algorithm assuming an existing relational will be shown to users. This will enable them to return to the database, our proposal can be applied to any context where struc- Biases Detection Process and choose if they want to edit or add tured or semi-structured data is being analyzed. information about this new bias in the visualizations. Algorithm 1 starts with the input of the data tables (tables_vis) By following this process, we facilitate the data analysis and that are used for the visualization. These tables will come auto- bias awareness for non-expert users in data visualization. Further- matically to the algorithm from the previous step of our process. more, all users may benefit from the reduction in time involved On the other side, the variables thdCategorical and thdBiases in using this approach, since skipping the existing biases will lead define the thresholds to delimit the biases and attributes, these to problems, requiring users to manually identify the biases that thresholds do not need to be defined, as they are already assigned originated them and requiring to rebuild all the visualizations or default values according to our experience analyzing datasets. re-train their AI algorithms. Therefore, the process enables users To define the thresholds, we have analyzed different studies. to retain control of how data biases affect their data and makes Academic research [11] suggest that there is a situation of class them aware of the impact on their analytics and AI algorithms. imbalance when the majority-to-minority class ratio is within the Algorithm 1: Biases Detection Algorithm range of 100:1 to 10000:1. However, from the viewpoint of effec- /* tables_vis comes automatically from the process tive problem solving, lower class imbalances that make modeling thdCategorical and thdBiases are defined by default, but and prediction of the minority class a complex and challenging users may personalize it */ task (i.e. in the range of 50:1 and lower) are considered high class Input : tables_vis[] = list of tables used in the imbalance by domain experts [21]. visualization, thdCategorical = 0,05; number In our case, the variable thdCategorical is a number that that represents the maximum percentage of the represents the maximum percentage of the total elements of the total elements of the table to be considered a table to be considered a categorical attribute. An attribute is categorical attribute, thdBiases = 8; number categorical when it can only take a limited number of possible between 0 and 10 that establishes the admissible values. The default threshold for this variable has been defined bias ratio of the attributes (being 0 equally heuristically, setting the value of this variable to 5% (0,05). This distributed and 10 very biased) threshold enables us to discover categorical attributes within the Output : biasedAtt = list of attributes and their bias data, even when a schema is not available, such as with NoSQL databases or file-based systems. 1 foreach table in tables_vis do Moreover, the variable thdBiases is a number between 0 and 2 Statement stmt = con.createStatement(); 10 that establishes the admissible bias ratio of the attributes (being /* Query 1 */ 0 equally distributed and 10 very biased). The bias ratio represents 3 String rowsQuery = "SELECT COUNT(*) FROM " + the relationship between the values that appear the least and table; most in an attribute. Therefore, adjusting this variable, users /* Query 2 */ may limit when an attribute is considered biased, i.e. when the 4 String attributesQuery = "SELECT COLUMN_NAME difference between the most and least common value is decisive FROM INFORMATION_SCHEMA.COLUMNS for them. We propose 8 as default value. Therefore, if the most WHERE TABLE_NAME = " + table; common value has 8 times or more the representation of the least /* number of rows from the table */ common value, then it will be considered as highly biased. 5 ResultSet rsRN = stmt.executeQuery(rowsQuery); Finally, the output of this algorithm will be biasedAtt, a list 6 int RN = rsRN.getInt(1); with the information about each attribute and its bias ratio. /* list of attributes from the table */ The algorithm will be executed for each table used for the 7 ResultSet attributes = visualization (line 1). For each table, it stores the number of rows stmt.executeQuery(attributesQuery); from the table in the variable RN (lines 5-6). Then, the attributes /* for each attribute from the table */ of the table are included in the variable attributes (line 7). For 8 foreach attribute in attributes do each attribute in the list (line 8), a ResultSet rsGroupAttr with /* Query 3 */ the number of repetitions of each different value is stored (line 9 String groupAttrQuery = "SELECT COUNT(" + 10). In (lines 11-12), the number of distinct values of this attribute attribute + " ) FROM " + table + " GROUP BY " + is calculated and stored in RND. Afterwards, the algorithm eval- attribute ; uates whether this attribute is categorical or not (line 13). An /* number of times that each different value of the attribute is considered categorical when the number of distinct attribute appears */ values of this attribute (RND) is be lower than the number of 10 ResultSet rsGroupAttr = rows from the table (RN) multiplied by the categorical threshold stmt.executeQuery(groupAttrQuery); defined earlier (5%) thdCategorical. If this comparison is ful- /* number of distinct values of the attribute */ filled, the values that have the highest (max) (line 14) and lowest 11 rsGroupAttr.last(); (min) (line 15) representation are extracted from the ResultSet 12 int RND = rsGroupAttr.getRow(); rsGroupAttr that contains the number of times that each differ- /* if it is a categorical attribute */ ent value of the attribute appears. Then, the bias of each attribute 13 if RND < RN*thdCategorical then is calculated and normalized in biasAttribute (line 16) using the /* is extracted the value that is repeated more following formula: and less times */ 14 int max = max(rsGroupAttr); 𝑚𝑎𝑥 − 𝑚𝑖𝑛 15 int min = min(rsGroupAttr); ∗ 10 (1) 𝑚𝑎𝑥 /* is calculated and normalized the bias in the attribute */ We have used Min-Max normalization because it guarantees 16 float biasAttribute = ((max - min)/max)*10; that all attributes will have the exact same scale and highlights /* if the bias is bigger than the threshold outliers. This is a desirable characteristic in our case, since de- defined by the user */ tecting the existence of these outlier biases and warning the user 17 if biasAttribute > thdBiases then is one of our main goals. With this normalization, we will have 18 biasedAtt.append(attribute, biasAttribute); a ratio for each attribute in biasAttribute that will provide an 19 end indication in the 0 to 10 range how biased is the attribute, 0 being 20 end equally distributed and 10 very biased. 21 end If the biasAttribute is bigger than the threshold thdBiases 22 return (biasedAtt); (line 17), it means that the attribute has a considerable bias that should be analyzed. Then, the name of the attribute and the bias 23 end ratio of the attribute previously calculed in biasAttribute will be stored in biasedAtt (line 18). Therefore, when the algorithm concludes, the variable biasedAtt will contains a list of attributes ACKNOWLEDGMENTS with their bias ratio. This work has been co-funded by the ECLIPSE-UA (RTI2018- 094283-B-C32) project funded by Spanish Ministry of Science, 6 PERFORMANCE ANALYSIS Innovation, and Universities. Ana Lavalle holds an Industrial PhD In order to do an implementation of the experiment, we have Grant (I-PI 03-18) co-funded by the University of Alicante and downloaded the Fire Department Calls for Service dataset from the Lucentia Lab Spin-off Company. [1] where we have get an 1,75 GB file. We have chosen Apache Spark [3] to process this file because REFERENCES its speed, ease of use, advanced and in-memory analytical capa- [1] 2019. Fire Department Calls for Service dataset. https://data.sfgov.org/ Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3. Accessed: bilities. Specifically we have used as a development environment 23/10/2019. Apache Zeppelin [4] 0.8. The configuration is as default. [2] Alaa Althubaiti. 2016. Information bias in health research: definition, pitfalls, We have run the experiment on a single laptop with the fol- and adjustment methods. Journal of multidisciplinary healthcare 9 (2016), 211. [3] Apache. 2019. Apache Spark. https://spark.apache.org/. Accessed: 23/10/2019. lowing characteristics: Intel Core i5 CPU M 460 @ 2.53GHz × 4, [4] Apache. 2019. Apache Zeppelin. https://zeppelin.apache.org/. Accessed: HDD at 7200 rpm, 6GB of RAM and OS: Ubuntu 16.04 LTS. 23/10/2019. Although in the definition of Algorithm 1 we establish con- [5] Colin B Begg and Jesse A Berlin. 1988. Publication bias: a problem in inter- preting medical data. Journal of the Royal Statistical Society: Series A (Statistics nections with the database, since we are running the algorithm in Society) 151, 3 (1988), 419–445. on Spark this is not necessary, loading the dataset into the frame- [6] Kenneth Brant, Moutusi Sau, Anthony Mullen, Magnus Revang, Chi- rag Dekate, Daryl Plummer, and Whit Andrews. 2017. Predicts 2018: work using a load instruction instead. We have loaded the Fire_ Artificial Intelligence. https://www.gartner.com/en/documents/3827163/ Department_Calls_for_Service.csv into the variable dfCalls and predicts-2018-artificial-intelligence. Accessed: 23/10/2019. we run the following queries as part of the algorithm: [7] Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why Is My Clas- sifier Discriminatory?. In Advances in Neural Information Processing Systems (1) Number of rows from the table: dfCalls.count() 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS (2) List of attributes from the table: dfCalls.columns 2018, 3-8 December 2018, Montréal, Canada. 3543–3554. [8] Michael Correll, Mingwei Li, Gordon Kindlmann, and Carlos Scheidegger. (3) Number of distinct values from each attribute: 2018. Looks Good To Me: Visualizations As Sanity Checks. IEEE transactions dfCallsG = dfCalls.groupBy(attribute).count() on visualization and computer graphics 25, 1 (2018), 830–839. [9] Mojgan Ghanavati, Raymond K Wong, Fang Chen, Yang Wang, and Chang- dfCallsG.count() Shing Perng. 2014. An effective integrated method for learning big imbalanced The execution times of our approach over a 5.1 millions of data. In 2014 IEEE International Congress on Big Data. IEEE, 691–698. [10] Patrick O. Glauner, Petko Valtchev, and Radu State. 2018. Impact of Biases in rows and including all the passes to process the 34 columns of the Big Data. CoRR (2018). dataset (1,75 GB) are: 46 seconds to be load the data table. Query [11] Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE 1 takes 27 seconds. Query 2 is executed in under 1 second and Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284. [12] Richard A. Bauder Joffrey L. Leevy, Taghi M. Khoshgoftaar and Naeem Seliya. finally, Query 3 takes 993 seconds. Therefore, the time required 2018. A survey on addressing high-class imbalance in big data. J. Big Data 5 to run Algorithm 1 in this experiment is a total of 1066 seconds, (2018), 42. i.e. 17 minutes and 46 seconds. [13] Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30, 1 (2006), 25–36. 7 CONCLUSIONS AND FUTURE WORK [14] Ana Lavalle, Alejandro Maté, and Juan Trujillo. 2019. Requirements-Driven Vi- sualizations for Big Data Analytics: a Model-Driven approach. In International Data bias is becoming a prominent problem due to its impact in Conference on Conceptual Modeling ER 2019, to appear. Springer. data analytics and AI. Current solutions focus on the problem [15] Ana Lavalle, Alejandro Maté, Juan Trujillo, and Stefano Rizzi. 2019. Visualiza- tion Requirements for Business Intelligence Analytics: A Goal-Based, Iterative from an AI outputs perspective, centering their efforts in con- Framework. In 27th IEEE International Requirements Engineering Conference straining the model to re-balance the data at hand. The side effect RE 2019, to appear. is that the datasets are altered without understanding whether [16] Chaoliang Li and Shigang Liu. 2018. A comparative study of the class im- balance problem in Twitter spam detection. Concurrency and Computation: there is a problem at the data gathering step or the data is repre- Practice and Experience 30, 5 (2018). senting the actual distribution of the sample. In turn, potentially [17] Victoria López, Sara Del Río, José Manuel Benítez, and Francisco Herrera. 2015. Cost-sensitive linguistic fuzzy rule based classification systems under important information about the nature of the data is lost, which the MapReduce framework for imbalanced big data. Fuzzy Sets and Systems can have implications for interpreting the data and finding the 258 (2015), 5–38. root causes of the original imbalance. [18] Victoria López, Alberto Fernández, and Francisco Herrera. 2014. On the importance of the validation technique for classification with imbalanced Compared to these solutions, in this paper we have presented datasets: Addressing covariate shift when data is skewed. Information Sciences a Bias Detection Approach. Our proposal complements our previ- 257 (2014), 1–13. ous works [14, 15] by including a novel algorithm that takes into [19] Jerome Pesenti. 2018. TAI at F8 2018: Open frameworks and responsible development. https://engineering.fb.com/ml-applications/ account the scope of the analysis, detects biases, and presents ai-at-f8-2018-open-frameworks-and-responsible-development/. Accessed: them in a way that is understandable for users, including non- 23/10/2019. [20] Sashank Jakkam Reddi, Barnabás Póczos, and Alexander J. Smola. 2015. Dou- expert ones. The great advantage of our proposal is that we bly Robust Covariate Shift Correction. In Twenty-Ninth AAAI Conference on enable users to understand their data and make decisions con- Artificial Intelligence. sidering biases from different perspectives without altering the [21] Isaac Triguero, Sara del Río, Victoria López, Jaume Bacardit, José Manuel Benítez, and Francisco Herrera. 2015. ROSEFW-RF: The winner algorithm dataset. Furthermore, all users may benefit from the reduction in for the ECBDL’14 big data competition: An extremely imbalanced big data time required to inspect and understand existing biases within bioinformatics problem. Knowledge-Based Systems 87 (2015), 69–79. their datasets, while at the same time they avoid biases going [22] Gary M Weiss and Foster Provost. 2003. Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial unadverted with the problems that it entails. intelligence research 19 (2003), 315–354. As a part of our future work, we are continuing our work on [23] Matt Wood. 2018. Thoughts On Machine Learning Accuracy. https://aws. amazon.com/es/blogs/aws/thoughts-on-machine-learning-accuracy/. Ac- new techniques to present biased attributes with a high number cessed: 23/10/2019. of categories. We are also applying our approach to unstructured [24] Qiang Yang and Xindong Wu. 2006. 10 challenging problems in data mining data and including analytic requirements as an input to estimate research. International Journal of Information Technology & Decision Making 5, 04 (2006), 597–604. the impact of data biases for each particular user.