=Paper= {{Paper |id=Vol-2572/short11 |storemode=property |title=An Approach to Automatically Detect and Visualize Bias in Data Analytics |pdfUrl=https://ceur-ws.org/Vol-2572/short11.pdf |volume=Vol-2572 |authors=Ana Lavalle,Alejandro Maté,Juan Trujillo |dblpUrl=https://dblp.org/rec/conf/dolap/LavalleMT20 }} ==An Approach to Automatically Detect and Visualize Bias in Data Analytics== https://ceur-ws.org/Vol-2572/short11.pdf
     An Approach to Automatically Detect and Visualize Bias in
                         Data Analytics
                    Ana Lavalle                                             Alejandro Maté                                  Juan Trujillo
           Lucentia Research (DLSI)                                   Lucentia Research (DLSI)                       Lucentia Research (DLSI)
            University of Alicante                                     University of Alicante                         University of Alicante
        San Vicente del Raspeig, Spain                             San Vicente del Raspeig, Spain                 San Vicente del Raspeig, Spain
         Lucentia Lab, Alicante, Spain                              Lucentia Lab, Alicante, Spain                  Lucentia Lab, Alicante, Spain
             alavalle@dlsi.ua.es                                         amate@dlsi.ua.es                               jtrujillo@dlsi.ua.es
ABSTRACT                                                                                    Unfortunately, most approaches developed until now are mainly
Data Analytics and Artificial Intelligence (AI) are increasingly                         focused on machine-learning and rebalancing the biased datasets.
driving key business decisions and business processes. Any flaws                         As [7] argues, the fairness of predictions should be evaluated
in the interpretation of analytic results or AI outputs can lead                         in context of the data, and unfairness induced by inadequate
to significant economic loses and reputation damage. Among                               samples sizes or unmeasured predictive variables should be ad-
existing flaws, one of the most often overlooked is the use biased                       dressed through data collection rather than by constraining the
data and imbalanced datasets. When unadverted, data bias warps                           model. As such, a general approach that automatically warns the
the meaning of data and has a devastating effect on AI results.                          user of the existence of biases and lets her analyze the data from
Existing approaches deal with data bias by constraining the data                         different perspectives without altering the dataset is missing.
model, altering its composition until the data is no longer biased.                         Therefore, in this paper we focus our work on detecting and
Unfortunately, studies have shown that crucial information about                         presenting in a humanly understandable way the existence of data
the nature of data may be lost during this process. Therefore, in                        bias and imbalanced datasets, with a special focus on enabling
this paper we propose an alternative process, one that detects                           the analysis through data analytics without altering the dataset.
data biases and presents biased data in a visual way so that the                            Our approach complements our previous work [15] [14] where
user can comprehend how data is structured and decide whether                            we presented an iterative Goal-Based modeling approach based
or not constraining approaches are applicable in his context. Our                        on the i* language for the automatic derivation of data visual-
approach detects the existence of biases in datasets through our                         izations and we aligned it with the Model Driven Architecture
proposed algorithm and generates a series of visualizations in a                         (MDA) in order to facilitate the creation of the right visual ana-
way that is understandable for users, including non-expert ones.                         lytics for non-expert users. Now, we include a Biases Detection
In this way, users become aware not only of the existence of                             Process that automatically detects the existence of biases in the
biases in the data, but also how they may impact their analytics                         datasets and enables users to measure them and select those
and AI algorithms, thus avoiding undesired results.                                      ones which are relevant to them. Our process includes a novel
                                                                                         algorithm that takes into account the scope of the analysis, de-
                                                                                         tects biases, and presents them in a way that is understandable
1     INTRODUCTION                                                                       for users, including non-expert ones. In this way, users become
Nowadays, Data Analytics have become a key component of                                  aware not only of the existence of biases in their datasets, but
many business processes. Whether driving business decisions                              also how they may impact their analytics and AI algorithms, thus
or offering new services through Artificial Intelligence (AI) algo-                      avoiding unwanted results.
rithms, data serves as the main resource for improving business                             The rest of the paper is structured as follows. Section 2 presents
performance. Therefore, any flaws within the data or its use will                        a classification of types of biases. Section 3 summarizes the re-
be translated into significant performance and economic loses.                           lated work in this area. Section 4 describes our proposed process.
    One of such flaws is data bias and the use of imbalanced                             Section 5 presents our Biases Detection Approach. Section 6 de-
datasets. When unadverted, data bias can significantly affect                            scribes results of the experiments applying our approach. Finally,
the interpretation of data, and has devastating impact on AI re-                         Section 7 summarizes the conclusions and our future work.
sults as recently reported by the Gartner Group [6]. One area
where biases lead to life-threatening consequences is Healthcare,                        2    BIASES IN DATA
where identifying as healthy a patient that is incubating a severe                       In order to illustrate the negative impact of data bias, in this
illness may delay its treatment [2].                                                     section, we provide a classification of types of biases. There are
    As such, data bias has become an important concern in the                            different types of biases in datasets, the most common being
community, with Big companies like Amazon, Facebook, Mi-                                 Class Imbalance and Dataset Shift.
crosoft, Google, etc. investing resources and effort to tackle the                          Class Imbalance is the case where classes are not equally
problem. Amazon Web Services [23] has published information                              represented in the data, this means that one or more categories
about fairness in their machine-learning services in terms of ac-                        on the dataset have a higher representation than the rest of the
curacy, false positive and false negative rates. Facebook [19] has                       categories. It is usual to find this kind of bias in real word datasets
shown one of its internal anti-bias software tools, “Fairness Flow”                      [12]. This bias causes several problems, specially when people
which measures how a model interacts with specific groups.                               are trying to analyze this data and/or applying AI algorithms.
                                                                                            Dataset Shift refers to the case where the distribution of the
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of   data within the training dataset does not match the distribution
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution           in the test and real datasets. In real word datasets often train and
4.0 International (CC BY 4.0).                                                           test datasets have not been generated by the same distribution.
Artificial Intelligence Algorithms trained on biased training sets        distribution [20]. In [22] authors analyze, the relationship be-
tend not to generalize well on test data that is from the true            tween the class distribution of training data to determine the best
underlying distribution of the population, which has an negative          class distribution for learning. [10] have recently proposed deci-
effect on the quality of a machine learning model. As [18] argue,         sion tree learning for finding a model that is able to distinguish
there are three potential types of dataset shift:                         between training and test distributions.
   Covariate Shift: It happens when the input attributes have                On the other hand, some works have focused on the impact of
different distributions between the training and test datasets.           data flaws on the visual features of visualization. M. Correll et al.
   Prior Probability Shift: In this case, it happens when the             in [8] show how it is possible to create visualizations that seem
class distribution is different between the training and test datasets.   “plausible” (design parameters are within normal bounds and pass
   Concept Shift: It happens when the relationship between the            the visual sanity check) but hide crucial data flaws. The biases
input and class variables changes. Usually occurs when training           can be considered as data flaws if the context determines so. It
data is collected at a different point in time than testing data.         is possible to detect biases in datasets when the classification
   Biased datasets are very common and they can cause severe              categories are not approximately equally represented.
problems if bias are not taken into account and treated properly             As we have shown, most approaches developed until now are
depending on the type of bias we are facing, the context, and the         mainly focused on machine-learning and rebalancing the biased
objective that the dataset is being used for. Therefore, it is para-      datasets. However, our goal is not to balance the biased datasets.
mount to show users how biased their data are, in order to enable         As [7] argues, the fairness of predictions should be evaluated in
them to take into account those biases which are determinant to           context of the data, and unfairness induced by inadequate samples
them. Otherwise, their decisions will likely have unexpected and          sizes or unmeasured predictive variables should be addressed
negative consequences.                                                    through data collection rather than by constraining the model.
                                                                          For this reason, we propose an approach that automatically warns
3    RELATED WORK                                                         the user of the existence of biases and lets her analyze the data
                                                                          from different perspectives without altering the dataset. Since
The class imbalance problem has been encountered in multi-
                                                                          one of the core benefits of visualizations is enabling people to
ple areas, some of them with a serious impact, such as in the
                                                                          discover visual patterns that might otherwise be hidden [8].
interpretation of medical data [5]. This problem has been also
considered one of the top 10 problems in data mining and pattern
recognition [24]. The issue with imbalance in class distribution          4    PROPOSED PROCESS
becomes more pronounced with the applications of the AI algo-             In this section, we will describe our proposed process. Fig. 1
rithms. Mining and learning classifiers from imbalanced datasets          summarizes the process followed in our proposal, representing
are indeed a very important problem from both the algorithmic             in a red cloud the new elements introduced in this paper. Rest of
and performance perspective [13]. Not choosing the right dis-             the elements were introduced in our previous work [15] [14].
tribution can introduce bias towards the most represented class.             In our process, firstly, a sequence of questions guides users in
Since most AI algorithms expect a balanced class distribution             creating a User Requirements Model [15] that captures their
[11], an algorithm trained with imbalanced datasets will tend to          needs and analysis context. Then, this Model is complemented by
unadvertedly return results of the most populated classes.                the Data Profiling Model [15] that analyzes of the features of
   Different authors have proposed several techniques to han-             the data sources selected to be visualized. The user requirements,
dle with these problems. Generally, the approaches to deal with           together with the data profiling information, are translated into
Imbalanced Data issues involve three categories [16]:                     a Visualization Specification that enables users to derive the
   Data perspective: uses techniques to artificially re-balance the       best visualization types [15] in each context automatically. This
class distribution by sampling the data space to diminish the effect      transformation generates a Data Visualization Model [14].
caused by class imbalance. As [10] argues, one intuitive method              The Data Visualization Model enables users to specify vi-
is undersampling the majority classes by dropping training ex-            sualization details regardless of their implementation technology.
amples. This approach leads to smaller data sets, but important           This model also enables users to determine if the proposed vi-
examples could be dropped during the process. Another method              sualization is adequate to satisfy the essential requirements for
is oversampling the minority classes.                                     which it was created or not. If the proposed visualization does
   Algorithmic perspective: these solutions try to adapt or mod-          not pass the user validation, it will point out the existence of
ify cost adjustment within the learning algorithm to make it              missing or wrongly defined requirements. In this case, a new
perform better on imbalanced data sets during the training pro-           cycle is started by reviewing the existing model to identify which
cess. For example, [17] proposes an algorithm that is able to deal        aspects were not taken into account, generating in turn an up-
with the uncertainty that is introduced in large volumes of data          dated model. Otherwise, a successful validation will start the
without disregarding the learning in the underrepresented class.          Biases Detection Process. Once users have validated the visu-
   Ensemble approach: this type of solutions uses aspects from            alization, the attributes of the collections that have been selected
both perspectives to determine the final prediction. [9] proposes         in the process to be represented in the visualization are analyzed.
an integrated method for learning large imbalanced dataset. Their         Our novel algorithm examines the data to automatically detect
approach examines a combination of metrics across different               biases and presents this information to the users. Users may de-
learning algorithms and balancing techniques. The most accurate           fine thresholds to adapt the Biases Detection Process to their
method is then selected to be applied on real large, imbalanced,          specific needs. The definition of thresholds is performed in an
and heterogeneous datasets.                                               easy way, adapted for non-expert users by defining two variables
   In the case of Dataset Shift (when the training data and test          through the interface. This new functionality will make users
data are distributed different), a common approach is to reweight         aware of biases that could significantly alter the interpretation of
data such that the reweighted distribution matches the target             their data, as well as the techniques to be used for the analysis.
                          User


                         User                       Model Review Process
       Guidelines     Requirements
                         Model


                                                             Data Visualization
                                                                  Review

                                                            Data
                                 Visualization
                                                        Visualization
                                 specification             Model
                                                                                   Biases   Implementation
                                                                                  Detection
                    Data Profiling               Add biases information
                                                                                   Process
                       Model


                                                                                                 Periodic Monitoring
                        Data
                       Source




                                                 Figure 1: Overall view of the proposed process


   As a result of the process, users will obtain a visual represen-                     5    BIASES DETECTION
tation of the bias, being offered the option to include information                     Our proposal starts from the result of our process for automatic
in their analytics about each of the attributes detected as biased                      derivation of visualizations, shown in Fig. 1. In this sense, we
by the algorithm. If they decide to add information about a bi-                         assume the user has defined her requirements, the information
ased attribute, they can integrate this information within the                          that she wants to analyze and that the visualization that best
visualization that they had created for the initial analysis, or, al-                   suits her needs has been automatically derived. Once the user
ternatively, in a new visualization that is dynamically connected                       has validated the visualization, it is possible that certain elements
with the visualization of the process, so that when one of the                          are changing the interpretation of the data and the user is un-
visualizations is interacted with, the other one is updated.                            aware of them. Therefore, at this point we introduce our novel
   If users decide to add new information about some biased                             Biases Detection Process to detect biases in the data, based on
attribute, a new visualization specification will be generated.                         the algorithm proposed in this paper that will facilitate this task.
Therefore, in the Data Visualization Model, users will be able                          It is important to note that, although we assume that the user
to customize the visualization/visualizations and select how to                         has followed our previous approach, the process proposed can be
represent the biases information. Once users validate the new                           applied to visualizations obtained through other tools, as long as
visualizations and do not wish to add further information, the                          the necessary information is facilitated as input to the algorithm.
corresponding implementation will be generated.                                             The first step in our Biases Detection Process proposed is to
   Finally, when the visualization has been implemented and                             automatically analyze the attributes of the collections used for
users are working with it, it is possible to program a Periodic                         the visualization defined in the process through Algorithm 1.
Monitoring. The aim of this continuous monitoring is to ensure                          This algorithm enables us to automatically detect biases in the
that, as new data populates the data sources, no new biases are                         data by an analysis of the datasets, giving us information as to
introduced unadvertedly. The Periodic Monitoring event will                             how biased data are. Users can alter the limits for bias detection
trigger an execution of our Biases Detection Algorithm with                             in order to tailor the algorithm to their particular case.
the aim of automatically detect if the data has exceeded the                                It is important to note that, although we exemplify the im-
defined thresholds. If a new threshold has been exceed, an alert                        plementation of our algorithm assuming an existing relational
will be shown to users. This will enable them to return to the                          database, our proposal can be applied to any context where struc-
Biases Detection Process and choose if they want to edit or add                         tured or semi-structured data is being analyzed.
information about this new bias in the visualizations.                                      Algorithm 1 starts with the input of the data tables (tables_vis)
   By following this process, we facilitate the data analysis and                       that are used for the visualization. These tables will come auto-
bias awareness for non-expert users in data visualization. Further-                     matically to the algorithm from the previous step of our process.
more, all users may benefit from the reduction in time involved                         On the other side, the variables thdCategorical and thdBiases
in using this approach, since skipping the existing biases will lead                    define the thresholds to delimit the biases and attributes, these
to problems, requiring users to manually identify the biases that                       thresholds do not need to be defined, as they are already assigned
originated them and requiring to rebuild all the visualizations or                      default values according to our experience analyzing datasets.
re-train their AI algorithms. Therefore, the process enables users                          To define the thresholds, we have analyzed different studies.
to retain control of how data biases affect their data and makes                        Academic research [11] suggest that there is a situation of class
them aware of the impact on their analytics and AI algorithms.                          imbalance when the majority-to-minority class ratio is within the
 Algorithm 1: Biases Detection Algorithm                            range of 100:1 to 10000:1. However, from the viewpoint of effec-
     /* tables_vis comes automatically from the process             tive problem solving, lower class imbalances that make modeling
        thdCategorical and thdBiases are defined by default, but    and prediction of the minority class a complex and challenging
        users may personalize it                               */   task (i.e. in the range of 50:1 and lower) are considered high class
     Input : tables_vis[] = list of tables used in the              imbalance by domain experts [21].
              visualization, thdCategorical = 0,05; number              In our case, the variable thdCategorical is a number that
              that represents the maximum percentage of the         represents the maximum percentage of the total elements of the
              total elements of the table to be considered a        table to be considered a categorical attribute. An attribute is
              categorical attribute, thdBiases = 8; number          categorical when it can only take a limited number of possible
              between 0 and 10 that establishes the admissible      values. The default threshold for this variable has been defined
              bias ratio of the attributes (being 0 equally         heuristically, setting the value of this variable to 5% (0,05). This
              distributed and 10 very biased)                       threshold enables us to discover categorical attributes within the
     Output : biasedAtt = list of attributes and their bias         data, even when a schema is not available, such as with NoSQL
                                                                    databases or file-based systems.
1    foreach table in tables_vis do                                     Moreover, the variable thdBiases is a number between 0 and
2       Statement stmt = con.createStatement();                     10 that establishes the admissible bias ratio of the attributes (being
        /* Query 1                                             */
                                                                    0 equally distributed and 10 very biased). The bias ratio represents
3        String rowsQuery = "SELECT COUNT(*) FROM " +               the relationship between the values that appear the least and
          table;                                                    most in an attribute. Therefore, adjusting this variable, users
        /* Query 2                                             */   may limit when an attribute is considered biased, i.e. when the
4        String attributesQuery = "SELECT COLUMN_NAME               difference between the most and least common value is decisive
          FROM INFORMATION_SCHEMA.COLUMNS                           for them. We propose 8 as default value. Therefore, if the most
          WHERE TABLE_NAME = " + table;                             common value has 8 times or more the representation of the least
        /* number of rows from the table                       */   common value, then it will be considered as highly biased.
5        ResultSet rsRN = stmt.executeQuery(rowsQuery);                 Finally, the output of this algorithm will be biasedAtt, a list
6        int RN = rsRN.getInt(1);                                   with the information about each attribute and its bias ratio.
        /* list of attributes from the table                   */       The algorithm will be executed for each table used for the
7        ResultSet attributes =                                     visualization (line 1). For each table, it stores the number of rows
          stmt.executeQuery(attributesQuery);                       from the table in the variable RN (lines 5-6). Then, the attributes
        /* for each attribute from the table                   */   of the table are included in the variable attributes (line 7). For
8        foreach attribute in attributes do                         each attribute in the list (line 8), a ResultSet rsGroupAttr with
            /* Query 3                                         */   the number of repetitions of each different value is stored (line
 9           String groupAttrQuery = "SELECT COUNT(" +              10). In (lines 11-12), the number of distinct values of this attribute
              attribute + " ) FROM " + table + " GROUP BY " +       is calculated and stored in RND. Afterwards, the algorithm eval-
              attribute ;                                           uates whether this attribute is categorical or not (line 13). An
            /* number of times that each different value of the     attribute is considered categorical when the number of distinct
                attribute appears                              */   values of this attribute (RND) is be lower than the number of
10           ResultSet rsGroupAttr =                                rows from the table (RN) multiplied by the categorical threshold
              stmt.executeQuery(groupAttrQuery);                    defined earlier (5%) thdCategorical. If this comparison is ful-
            /* number of distinct values of the attribute      */   filled, the values that have the highest (max) (line 14) and lowest
11           rsGroupAttr.last();                                    (min) (line 15) representation are extracted from the ResultSet
12           int RND = rsGroupAttr.getRow();                        rsGroupAttr that contains the number of times that each differ-
            /* if it is a categorical attribute                */   ent value of the attribute appears. Then, the bias of each attribute
13           if RND < RN*thdCategorical then                        is calculated and normalized in biasAttribute (line 16) using the
                /* is extracted the value that is repeated more     following formula:
                     and less times                            */
14               int max = max(rsGroupAttr);
                                                                                              𝑚𝑎𝑥 − 𝑚𝑖𝑛
15               int min = min(rsGroupAttr);                                                            ∗ 10                          (1)
                                                                                                𝑚𝑎𝑥
                /* is calculated and normalized the bias in the
                     attribute                                 */       We have used Min-Max normalization because it guarantees
16               float biasAttribute = ((max - min)/max)*10;        that all attributes will have the exact same scale and highlights
                /* if the bias is bigger than the threshold         outliers. This is a desirable characteristic in our case, since de-
                     defined by the user                       */   tecting the existence of these outlier biases and warning the user
17           if biasAttribute > thdBiases then                      is one of our main goals. With this normalization, we will have
 18              biasedAtt.append(attribute, biasAttribute);        a ratio for each attribute in biasAttribute that will provide an
19           end                                                    indication in the 0 to 10 range how biased is the attribute, 0 being
20       end                                                        equally distributed and 10 very biased.
21    end                                                               If the biasAttribute is bigger than the threshold thdBiases
22    return (biasedAtt);                                           (line 17), it means that the attribute has a considerable bias that
                                                                    should be analyzed. Then, the name of the attribute and the bias
23 end
                                                                    ratio of the attribute previously calculed in biasAttribute will
                                                                    be stored in biasedAtt (line 18). Therefore, when the algorithm
concludes, the variable biasedAtt will contains a list of attributes   ACKNOWLEDGMENTS
with their bias ratio.                                                 This work has been co-funded by the ECLIPSE-UA (RTI2018-
                                                                       094283-B-C32) project funded by Spanish Ministry of Science,
6    PERFORMANCE ANALYSIS                                              Innovation, and Universities. Ana Lavalle holds an Industrial PhD
In order to do an implementation of the experiment, we have            Grant (I-PI 03-18) co-funded by the University of Alicante and
downloaded the Fire Department Calls for Service dataset from          the Lucentia Lab Spin-off Company.
[1] where we have get an 1,75 GB file.
    We have chosen Apache Spark [3] to process this file because       REFERENCES
its speed, ease of use, advanced and in-memory analytical capa-         [1] 2019. Fire Department Calls for Service dataset. https://data.sfgov.org/
                                                                            Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3.              Accessed:
bilities. Specifically we have used as a development environment            23/10/2019.
Apache Zeppelin [4] 0.8. The configuration is as default.               [2] Alaa Althubaiti. 2016. Information bias in health research: definition, pitfalls,
    We have run the experiment on a single laptop with the fol-             and adjustment methods. Journal of multidisciplinary healthcare 9 (2016), 211.
                                                                        [3] Apache. 2019. Apache Spark. https://spark.apache.org/. Accessed: 23/10/2019.
lowing characteristics: Intel Core i5 CPU M 460 @ 2.53GHz × 4,          [4] Apache. 2019. Apache Zeppelin. https://zeppelin.apache.org/. Accessed:
HDD at 7200 rpm, 6GB of RAM and OS: Ubuntu 16.04 LTS.                       23/10/2019.
    Although in the definition of Algorithm 1 we establish con-         [5] Colin B Begg and Jesse A Berlin. 1988. Publication bias: a problem in inter-
                                                                            preting medical data. Journal of the Royal Statistical Society: Series A (Statistics
nections with the database, since we are running the algorithm              in Society) 151, 3 (1988), 419–445.
on Spark this is not necessary, loading the dataset into the frame-     [6] Kenneth Brant, Moutusi Sau, Anthony Mullen, Magnus Revang, Chi-
                                                                            rag Dekate, Daryl Plummer, and Whit Andrews. 2017. Predicts 2018:
work using a load instruction instead. We have loaded the Fire_             Artificial Intelligence. https://www.gartner.com/en/documents/3827163/
Department_Calls_for_Service.csv into the variable dfCalls and              predicts-2018-artificial-intelligence. Accessed: 23/10/2019.
we run the following queries as part of the algorithm:                  [7] Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why Is My Clas-
                                                                            sifier Discriminatory?. In Advances in Neural Information Processing Systems
    (1) Number of rows from the table: dfCalls.count()                      31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS
    (2) List of attributes from the table: dfCalls.columns                  2018, 3-8 December 2018, Montréal, Canada. 3543–3554.
                                                                        [8] Michael Correll, Mingwei Li, Gordon Kindlmann, and Carlos Scheidegger.
    (3) Number of distinct values from each attribute:                      2018. Looks Good To Me: Visualizations As Sanity Checks. IEEE transactions
        dfCallsG = dfCalls.groupBy(attribute).count()                       on visualization and computer graphics 25, 1 (2018), 830–839.
                                                                        [9] Mojgan Ghanavati, Raymond K Wong, Fang Chen, Yang Wang, and Chang-
        dfCallsG.count()                                                    Shing Perng. 2014. An effective integrated method for learning big imbalanced
    The execution times of our approach over a 5.1 millions of              data. In 2014 IEEE International Congress on Big Data. IEEE, 691–698.
                                                                       [10] Patrick O. Glauner, Petko Valtchev, and Radu State. 2018. Impact of Biases in
rows and including all the passes to process the 34 columns of the          Big Data. CoRR (2018).
dataset (1,75 GB) are: 46 seconds to be load the data table. Query     [11] Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE
1 takes 27 seconds. Query 2 is executed in under 1 second and               Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284.
                                                                       [12] Richard A. Bauder Joffrey L. Leevy, Taghi M. Khoshgoftaar and Naeem Seliya.
finally, Query 3 takes 993 seconds. Therefore, the time required            2018. A survey on addressing high-class imbalance in big data. J. Big Data 5
to run Algorithm 1 in this experiment is a total of 1066 seconds,           (2018), 42.
i.e. 17 minutes and 46 seconds.                                        [13] Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. 2006.
                                                                            Handling imbalanced datasets: A review. GESTS International Transactions on
                                                                            Computer Science and Engineering 30, 1 (2006), 25–36.
7    CONCLUSIONS AND FUTURE WORK                                       [14] Ana Lavalle, Alejandro Maté, and Juan Trujillo. 2019. Requirements-Driven Vi-
                                                                            sualizations for Big Data Analytics: a Model-Driven approach. In International
Data bias is becoming a prominent problem due to its impact in              Conference on Conceptual Modeling ER 2019, to appear. Springer.
data analytics and AI. Current solutions focus on the problem          [15] Ana Lavalle, Alejandro Maté, Juan Trujillo, and Stefano Rizzi. 2019. Visualiza-
                                                                            tion Requirements for Business Intelligence Analytics: A Goal-Based, Iterative
from an AI outputs perspective, centering their efforts in con-             Framework. In 27th IEEE International Requirements Engineering Conference
straining the model to re-balance the data at hand. The side effect         RE 2019, to appear.
is that the datasets are altered without understanding whether         [16] Chaoliang Li and Shigang Liu. 2018. A comparative study of the class im-
                                                                            balance problem in Twitter spam detection. Concurrency and Computation:
there is a problem at the data gathering step or the data is repre-         Practice and Experience 30, 5 (2018).
senting the actual distribution of the sample. In turn, potentially    [17] Victoria López, Sara Del Río, José Manuel Benítez, and Francisco Herrera.
                                                                            2015. Cost-sensitive linguistic fuzzy rule based classification systems under
important information about the nature of the data is lost, which           the MapReduce framework for imbalanced big data. Fuzzy Sets and Systems
can have implications for interpreting the data and finding the             258 (2015), 5–38.
root causes of the original imbalance.                                 [18] Victoria López, Alberto Fernández, and Francisco Herrera. 2014. On the
                                                                            importance of the validation technique for classification with imbalanced
    Compared to these solutions, in this paper we have presented            datasets: Addressing covariate shift when data is skewed. Information Sciences
a Bias Detection Approach. Our proposal complements our previ-              257 (2014), 1–13.
ous works [14, 15] by including a novel algorithm that takes into      [19] Jerome Pesenti. 2018.           TAI at F8 2018: Open frameworks and
                                                                            responsible development.          https://engineering.fb.com/ml-applications/
account the scope of the analysis, detects biases, and presents             ai-at-f8-2018-open-frameworks-and-responsible-development/. Accessed:
them in a way that is understandable for users, including non-              23/10/2019.
                                                                       [20] Sashank Jakkam Reddi, Barnabás Póczos, and Alexander J. Smola. 2015. Dou-
expert ones. The great advantage of our proposal is that we                 bly Robust Covariate Shift Correction. In Twenty-Ninth AAAI Conference on
enable users to understand their data and make decisions con-               Artificial Intelligence.
sidering biases from different perspectives without altering the       [21] Isaac Triguero, Sara del Río, Victoria López, Jaume Bacardit, José Manuel
                                                                            Benítez, and Francisco Herrera. 2015. ROSEFW-RF: The winner algorithm
dataset. Furthermore, all users may benefit from the reduction in           for the ECBDL’14 big data competition: An extremely imbalanced big data
time required to inspect and understand existing biases within              bioinformatics problem. Knowledge-Based Systems 87 (2015), 69–79.
their datasets, while at the same time they avoid biases going         [22] Gary M Weiss and Foster Provost. 2003. Learning when training data are
                                                                            costly: The effect of class distribution on tree induction. Journal of artificial
unadverted with the problems that it entails.                               intelligence research 19 (2003), 315–354.
    As a part of our future work, we are continuing our work on        [23] Matt Wood. 2018. Thoughts On Machine Learning Accuracy. https://aws.
                                                                            amazon.com/es/blogs/aws/thoughts-on-machine-learning-accuracy/. Ac-
new techniques to present biased attributes with a high number              cessed: 23/10/2019.
of categories. We are also applying our approach to unstructured       [24] Qiang Yang and Xindong Wu. 2006. 10 challenging problems in data mining
data and including analytic requirements as an input to estimate            research. International Journal of Information Technology & Decision Making
                                                                            5, 04 (2006), 597–604.
the impact of data biases for each particular user.