1. Introduction

The use of Maximum Completeness to Estimate Bias in AI-based Recommendation Systems

Alessandro Simonetta

Maria Cristina Paoletti

Alessio Venticinque

0 0 Department of Electrical and Information Engineering, University of Naples Federico II , Napoli , Italy 1 Department of Enterprise Engineering, University of Rome Tor Vergata , Rome , Italy

76 84

The use of AI based recommendation systems, based on data analysis using Machine Learning algorithms, is taking away people's full control over decision making. The presence of unbalanced and incomplete data can cause discrimination to religious, ethnic, and political minorities without this phenomenon being easily detectable. In this context, it becomes critically important to understand what are the potential risks associated with learning with such a dataset and what consequences it may have on the outcome of decision making using Machine Learning algorithms. In this paper, we tried to identify how to measure the group fairness of a prediction of a classification algorithm, to identify the quality features of the dataset that influence the learning process, and finally, to evaluate the relationships between the quality features and the fairness measures.

eol>Fairness Clustering Machine Learning Completeness ISO/IEC 25012 Maximum Completeness Metrics Bias Classification

1. Introduction

quality of the information available (balance, completeness, ...). The correctness and authenticity of the data In 2019, the Economist [ 1 ] stated that data is an impor- alone [ 12, 13, 14, 15, 16, 17 ] are not enough to guarantee tant resource comparable to oil. Moreover, Forbes [ 2 ] their quality. Thus, the presence of poor quality in the defines data as the fuel of the information age, and Ma- data, or a low level of representativeness, can lead to chine Learning (ML) as the engine that uses it. These biased learning. technologies in addition to the evolution of networks A very striking example of discrimination is the result [ 3, 4 ] is providing opportunities to develop new applica- of the algorithm used in Florida on predicting the risk of tions. re-ofense, brought to light by the nonprofit organization Many companies and organizations are investing, increas- ProPublica [ 18 ]. In fact, the algorithm, which assigned ingly, in decision-making processes centered on AI based each person a score indicating the likelihood of reofrecommendation systems to ofer a variety of services fending, was trained using an unbalanced dataset, and ranging from marketing [ 5, 6 ] to fault diagnosis [ 7 ]. Tools as result, black people showed greater recidivism than that make use of these types of algorithms relieve people other ethnicities [ 19 ]. from making decisions that may be influenced by moods, In this paper, we will present a methodology that, startbiases and subjective thoughts, they ensure fairness and ing from training data, allows us to estimate the risk of repeatability at diferent times too. The reasons why ML getting an unfair treatment in the prediction. algorithms arrived at a certain type of result may not be To do this, it is necessary to identify how to measure the transparent or easily understood by users. For this reason, fairness of a prediction of a classification algorithm, to techniques have been introduced, such as Explenable AI, discover the quality characteristics of the dataset that which allow analysts to understand how a given choice influence the goodness of learning, and, last but not was arrived at, or Reinforcement Learning, which allows least, to study the relationships that exist between quality the decision-making process to be distributed across dif- characteristics and the fairness measures of the ML algoferent levels in a counterbalance way [ 8, 9, 10, 11 ]. rithm.

These decision systems are based on a data-driven ap- The study of the fairness of ML algorithms is widely deproach and their results are strongly influenced by the bated topic in science, in [ 20 ], [ 21 ] many performance and fairness indices are studied, such as False Positive and SnYeeSrYinEgMa2n0d2M2: a8tthhe mScahtoiclas,r’BsrYuenaerkl,yJSuylym2p3o,s2iu02m2 of Technology, Engi- Equalized Odds. These metrics could be used to enhance * Corresponding author. All the authors contributed equally. diferent performance characteristics, that could be in " alessandro.simonetta@gmail.com (A. Simonetta); contrast each others and each of them fit best a particular mariacristina.paoletti@gmail.com (M. C. Paoletti) class of objectives [ 22 ],[ 23 ]. For example, there are in(M.0C0.00P-a0o0l0e3tt-i2);00020-0908-10500(3A-.32S8im6-o3n1e3t7ta(A); .0V00e0n-t0ic0i0n1q-u6e8)50-1184 dices that prefer accuracy and others that prefer precision © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License [ 20 ], the right trade-of must be found between the two, CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) depending on the problem to be solved. Indeed, in the cancer detection we prefer recall rather than precision: be identified through a combination of them too. One better to plot a healthy patient as probably ill than not work including metrics for assessing fairness is [ 22 ] to screen one who actually is. where Disparate Impact and Demographic Parity (StatisOther studies [ 22, 24, 25, 26 ] are aimed at identifying the tical Parity) are introduced. The first one use the ratio relationship between sensitive attributes and the target between ( = 1| = ) and ( = 1| = ), one. These show dependencies between the number of instead the second one is the diference between the incorrect predictions (e.g., ratio of predicted positive to two probabilities. Demography Parity is present in [32], real positive) and the features of the dataset. For example, too, and it is referred to as Independence, as it indicates if it get wrong advantageously with respect to sensitive the degree of independence of the target variable R attributes, that is, by attributing more positive outcomes compared to the sensitive attribute. Another measure of to them, individuals belonging to this set are considered a fairness reported in [ 22 ] is the Equalized Odds which is privileged group. Conversely, if the algorithm get wrong satisfied if the prediction is conditionally independent to negatively, associating more unfavorable outcomes than the sensitive attribute, given the true value; it highlight should normally be indicated, that group is considered the diference between true positive rate and false an unprivileged. positive rate. In our work we call the latter index as Regarding data quality aspects, we identified the inter- Separation. The Equal Opportuinity [ 22 ], requires the national standards ISO/IEC 25012 [ 27 ] and and ISO/IEC true positive rates to be similar across groups. Other 25024 [ 28 ] as the models from which to draw the notion metrics for estimating faireness are the Suficiency [32], of completeness. This choice was also supported by the similar to Equal Opportunity, but focuses on true values presence of studies [ 29 ],[ 30 ] that use these standards for rather than predicted values, and the Overall Accuracy dataset construction and maintenance of its quality over Equality [ 21 ], which tests the average error between time. In particular, we identified the notion of maximum predictions across groups. completeness [31] as satisfying the goal we had set for Determine what fairness metrics are best for finding ourselves. what is the right configuration of an algorithm to use This paper will start from the state of the art, Section II, in a decision support system depends on the purpose comparing the advantages and disadvantages of diferent for which the it should be built and what discrimination approaches to fairness and show alternative synthesis risks it may be exposed to. Studies [ 23 ], [33] point solutions that are useful in identifying critical issues in out that it is not possible to maximize all metrics the input data. In section III, we will present our solution simultaneously and therefore one must choose among developed from what has been proposed in the literature the features that these measures tend to enhance, such and in particular we will decline it into two diferent ver- as accuracy and recall. In [ 25 ] the authors present a sions defining its pros and cons. In section IV, we will framework for comparing indices and highlighting when point out to identify the limitations of the present work the maximization of one conflicts with that of another. and what are the possible future developments. Finally, Their work goes beyond analyzing individual metrics, in section V, we will present concluding remarks. and groups them according to their characteristics (fairness of treatment, fairness of opportunity, interest groups, sensitive attributes, etc.) and their usefulness 2. State of Art with respect the target that the system has (i.e. support for film discovery on a streaming platform). These In general, a classification model is called fair if the are clustered using a hierarchical algorithm applied to mistakes are equally distributed across the diferent correlation between metrics to identify similar ones. groups, identified within the sensitive attribute. The The results are then diagrammed in a simplified manner input features X of the values space are mapped onto through the use of Principal Component Analysis (PCA) a target variable R according to a function f(X). Where to reduce the state space in two dimensions. In particular, the values of R are represented by classes or a score, i.e., the authors conclude that through the use of PCA it is in a range of a scale of N diferent values, they can be possible to explain the relationships among diferent mapped back to a binary value by defining a threshold. metrics by reducing the state space in a range from one In order to train these classifiers, example data are to three component. used in which the input features X, are associated to The idea of using balancing indices to predict the risk truth variable Y with the real result. From the goodness of discrimination can be found in [34]. In this work for of these examples derives the quality of the resulting the first time they use a measure of fairness applied to classifier. the sensitive attribute and not to the comparison among From now on, we will refer to A as the sensitive attribute, groups. In the next section we will start from this topic which can identify a minority. Although, these variables and then propose two diferent solutions for calculating are treated individually, an underprivileged group could a synthetic index related to the sensitive attribute. the rest of the other ethnic groups (Native-American, Caucasian, Asian, Mexican, Other).

The synthetic index described by the equation 3 can show on average how much a sensitive attribute is at risk of discrimination, but might underestimate the inequity. With reference to the dataset Compas, in

3. Methodology

The first formal criterion introduced in [ 32] requires that the sensitive attribute A be statistically independent of the predicted value R. Assuming we use a dataset with a field A having cardinality m, = {1, ..., }, the random variable A is independent compared to R if and only if for each , ∈ [1, ], with ̸= we have that: ( = 1| = ) = ( = 1| = ) (1)

To understand how far the two predictions deviate from the ideal case (zero diference), we can calculate the distance between two probabilities:

U(, ) = | ( = 1| = ) − ( = 1| = )| (2)

To get a synthesis value of the non-independence between A and R [34], the arithmetic mean of the distances can be considered:

U(1, .., ) =

2 ( − 1) =1 =+1 − 1 ∑︁ ∑︁

U(, ) (3)

Instead of using the equation 2, some authors [22]

apply a diferent notion of independence: the table 1 it is observed that the values in the second column ( = 1| = ) cluster around 0.26 and 0.65. = | ( = 1| = ) − ( = 1| ̸= )| (4) eFvigiduerenc1esohfotwhes gceranptrhoiicdaslliydethnetivfiedaltuhersooufgthhtehteabKl-eM1e,awnisth What we have seen so far fails to explain whether there algorithm. The diference between the value calculated are groups, within the sensitive attribute, that undergo using the equation 3 (P=0.24) and the value obtained the same mode of treatment nor the presence of discrim- by considering centroids (P=0.39) turns out to be 0.15 ination among groups, the present work was born from points. This demonstrates what was stated earlier with this reflection. For explaining our idea we prefer to use respect to the use of a central tendency index. However, an example previously mentioned. the equation 3 can be used in the presence of more than The sensitive attibute A=Race, of Compas dataset, con- two clusters. tains six diferent ethnicities shown in the first column of the table 1.

In the following, we will illustrate alternative methods • COMPAS Recidivism Dataset [ 19 ]; • Recidivism in juvenile justice [35] • UCI Statelog German Credit [36]; • default of credit card clients Data Set [37]; • Adult Data Set [38]; • Student Performance Data Set [39].

The sensitive attributes are listed in the following table 2

This index could be calculated on multiple categorical attributes by considering as the number of possible combinations of the chosen categorical variables and as the maximum number of items grouped over the number of attributes considered. For example, considering the Compas dataset and the attributes Race and Sex, to have the maximum completeness (, ) = 1, we must have a number of items for all combinations of race and sex equal to 2,626, that is the number of items of male and AfricanAmerican ethnicity, which is the category most numerous. To do this the number of records within the dataset must increase to 31,512 from the current N=6,172 ( (, ) = 0.19). 3.2. Clustering Method for calculating diferent synthetic fairness indices that regression [40] that ofers categorical results and thereallow for greater sensitivity to discriminatory situations. fore can be trained to predict the membership of an item Next, we will try to identify the relationship that exists in a class. between the dateset completeness index and the identi- In order to evaluate the completeness of the dataset we ifed fairness indices. This will allow us to anticipate the examined the Max Completeness, introduced in [41]. Maxrisks of bias arising from incomplete data. imum completeness is an index measuring the percentage degree of completeness of the dataset (Incomplete=0, 3.1. Dataset Complete=1) with respect to one or more categorical attributes, when the expected value is that in which the In this section are present the list of the dataset used for attributes considered have a number of replications equal the sperimentation: to that of the predominant. Assuming we wish to calculate the completeness of a dataset on a categorical attribute A:

Dataset Compas Juvenile UCI Income Race

Sex Age

V3_nacionalitat V2_estranger V1_sexe V5_edat_fet_agrupat V4_nacionalitat_agrupat V8_edat_fet Sex Education Education Race

Sex

Native country Status Sex foreignworker Sex

Age mMather job

Father job Mother Education Father Education Starting from the consideration that the conditional prob

Statelog abilities of the random variable R with respect to membership in a sensitive attribute ( ( = 1| = )) may determine afinity in treatment equivalence classes, we thought of using unsupervised clustering algorithms to Student identify possible clusters in the probabilities. Among the ML algorithms that were analyzed, we chose K-Means and DBSCAN [42]. K-Means is a clustering algorithm that tends to separate samples into K groups with equal variance, minimizing the within-cluster sum-of-squares

For each dataset we have calculated the predicted value criterion. To use this method, it is necessary to know using a classification model, but only Compas Dataset a priori the K number of clusters into which to divide and Recidivism in juvenile justice have already this in- the samples. Once the centroids have been calculated, it formation, so we have used the original one. Diferent is possible to use the equation 2 or the 3 depending on models for classification are present in literature and are the value of K to obtain the synthetic index. Compared implemented in software libraries. We chose the logistic with [34], bundling multiple instances of the sensitive attribute into groups results in a lower m-number, indeed, 3.3. MinMax Method the term is composed of all elements treated similarly.

The critical issue of correctly identifying the K-number to be used for clustering led us to study other approaches and subsequent experimentation with DBSCAN. This sees clusters as areas of high density separated by areas of lower density. The application of such an algorithm needs only the parameters indicating the number of samples found in the area forming a cluster and the one indicating the density required to form a cluster (eps).

One of its limitations is that some elements turn out not to belong to any cluster; it was decided to treat them as unitary clusters. Since the concept of a centroid does not exist for DBSCAN, the value of the fairness metric was calculated as follows:

Although both clustering methodologies gave good re

sults to work on, another approach was explored starting from the definition found in [ 22 ] to calculate the value of the fairness metric trying to find the worst case. The process is described below referring to the Demographic Parity metric (equation 4) (Independence), but without loss of generality can be extended to all. This description places emphasis on the fact that is considered as an unprivileged group and the set of all other elements as a privileged group. The algorithm, for all values of the sensitive attribute A, calculates the result of the equation 4 associated with each group by considering from time to time the element under observation as discriminated and all others as privileged. Considering, again, • clusters: probability as per equation 3 to calculate the field Race, if the element we are calculating for is cluster fairness; Asian, this is and all others constitute the other group • individual elements: average of the fairness in- in the equation. Once we have iterated this process for dices of individual instances of the sensitive at- all values of the sensitive attribute we will go on to select tribute. the highest and lowest result. The former is the group for which the predicted variable R is most dependent on ethnicity, while the latter is the most independent. The diference between these two values indicates how large the inequality of treatment between the privileged and unprivileged group is, relative to the sensitive attribute considered.

Compared with the use of clustering, this methodology

Applying the two clustering methods resulted

in values that were higher on average than those calculated using only the arithmetic mean reported in the equation 3. Furthermore, this made it possible to identify groups that were similar in treatment type. arises in the worst case by considering as the index of treatment disequity the largest diference between the obtained by applying the equation 4 and reported in the table 3.

4. Discussion During the experimental phase, the methods presented

in the previous paragraphs were applied to sensitive attributes belonging to six known datasets. The fairness metrics used during the testing phase are: Independence, Separation, Suficiency and Overall Accuracy Equality as defined in [ 32],[ 21 ]. Separation and Suficiency were calculated by considering positive and negative predicted cases separately. The metrics resulting from this split are: Separation TPR (True Positive Rate), Separation FPR (False Positive Rate), Suficiency PPV (Positive Predictive Value) and Suficiency NPV (Negative Predictive Value). Once the results were evaluated for the six chosen fairness metrics, we related them to the maximum completeness balancing index.

Each diagram consists of two box plots containing the values of sensitive attributes divided in this way: Maximum Completeness values less than 0.33 (low risk in yellow) and greater than 0.66 (high risk in red). Intermediate values are not reported because it is more dificult for them to determine whether they are fair or not. We remarks that Fairness metrics, as defined, take value in the range [ 0,1 ] (Fair=0, Unfair=1).

Figure 3 shows the case where the six fairness metrics are calculated using the K-means technique. Note that all boxplots tails overlap, while for the body remains a clear separation for Separation TPR and Suficiency PPV. The worst case is for the Suficiency NPV metric where there is total overlap.

Figure 4 shows the results of applying the DBSCAN technique. In this case the values obtained are similar to the K-means, although there are worst results for Overall Accuracy Equality and NPV Suficiency. Such plots were not optimal in 3 too.

Finally, in figure 5 the method of MinMax as for equation 4 is applied instead of the clustering algorithms. In these

5. Current Limits and Future Works

diagrams, one can see a lengthening of the boxplots re- However, establishing a number of clusters that forces lated to the risk cases and a sharper separation between such a division could join groups that are not actually the two boxplots for all diagrams. In the Independence treated equally. Another point we will investigate is and OAE cases, there is no more overlap between the tails. the possibility of changing the algorithm to identify the In the PPV Suficiency the high risk cases tend to one synthetic value of the cluster for the metric under conand almost full separation between values is achieved. sideration. One solution, we will explore, is to integrate For sensitive attributes that have a greater than the calculation of the diference between maximum and 0.66, there are values of fairness metrics greater than 0.2. minimum in a clustering algorithm that does not have In conclusion, the MinMax method yielded better results a predefined number K of clusters, such as the already compared to clustering, but conversely, using methods presented DBSCAN. Related to this algorithm, alternasuch as K-Means and DBSCAN can help to better define tives on how to treat elements that are not associated treatment similarities among groups. with any cluster will be explored. The question is: should these cases discarded because they are outliers or do they represent borderline cases, e.g., a highly discriminated minority?

6. Conclusion The experimentation carried out has yielded encoura

ging results for what is the separation of high and low risk fairness forecast with respect to . Although The spread of ML algorithms for constructing decision the result obtained shows values that are sometimes in a systems make the data used in their construction increasrange that is not very wide, improvements have already ingly important. Imbalances or biases that may be present been identified that can be investigated in future work. within the information can afect the results of such sysThe first task is the choice of clustering method and pa- tems, causing discrimination toward certain groups. rameters that afect the number of clusters. Having few The use of the fairness metrics that have been presented clusters brings us closer to the worst case, and it becomes becomes important to predict the impact related to such more easy to identify the correct meaning to give to each biases and go to act accordingly on both algorithms and clusters, depending on their distance and their shape. input data in line with the objectives. In this work we tried to provide a methodology to identify similar clusters by treatment type and to calculate a synthetic index that could predict how at risk the system is with respect to sensitive attributes. The experimentation carried out provided good results both in terms of identifying agglomerations that undergo similar treatments and in calculating a parameter that would give a conservative assessment of the metric.

The relationship between the maximum completeness index and the fairness indices calculated by the showed methods provided a guideline in order to recognize highrisk and lower-risk sensitive attributes. This will give to the analysts the information to better configure classification algorithms.

The results of this work lay the groundwork for future developments aimed at improving the identification of groups within sensitive attributes and researching alternative synthetic indices that will have a greater precision.

[1]

The

Economist , The world's most valuable resource is no longer oil, but data, The Economist , USA (6th May 2019 ).

[2]

Marr , The 5 biggest data science trends in 2022, Oct 2021 . URL: https: //www.forbes.com/sites/bernardmarr/2021/ 10/04/the-5 -biggest-data-science-trends-in-2022/ ?sh=22f5fc1d40d3(AccessedMay, 2022 ).

[3]

Giuliano , The next generation network in 2030: Applications, services, and enabling technologies , in: 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI) , 2021 , pp. 294 - 298 . doi: 10 .23919/ EECSI53397. 2021 . 9624241 .

[4]

Napoli ,

Pappalardo ,

Tramontana , A hybrid neuro-wavelet predictor for qos control and stability , Lecture Notes in Computer Science (including subseries Lecture Notes in Artiifcial Intelligence and Lecture Notes in Bioinformatics) 8249 LNAI ( 2013 ) 527 - 538 . doi: 10 .1007/ 978-3- 319 -03524-6_ 45 .

[5]

Verma ,

Sharma ,

Deb ,

Maitra , Artificial intelligence in marketing: Systematic review and future research direction , International Journal of Information Management Data Insights 1 ( 2021 ) 100002 . doi:https://doi.org/10.1016/ j.jjimei. 2020 . 100002 .

[6]

Capizzi ,

G. Lo

Sciuto ,

Napoli , E. Tramontana, An advanced neural network based solution to enforce dispatch continuity in smart grids , Applied Soft Computing Journal 62 ( 2018 ) 768 - 775 .

[7]

Dhomad ,

Jaber , Bearing fault diagnosis using motor current signature analysis and the artificial neural network 10 ( 2020 ) 70 - 79 .

[8]

Matta ,

G. C.

Cardarilli ,

L. Di

Nunzio ,

Fazzolari ,

Giardino ,

Nannarelli , M. Re,

Spanò , A reinforcement learning-based qam/psk symbol synchronizer , IEEE Access 7 ( 2019 ) 124147 - 124157 . doi: 10 .1109/ACCESS. 2019 . 2938390 .

[9]

Brociek ,

Magistris ,

Cardia ,

Coppa ,

Russo , Contagion prevention of covid-19 by means of touch detection for retail stores , in: CEUR Workshop Proceedings , volume 3092 , CEUR-WS , 2021 , pp. 89 - 94 .

[10]

Canese ,

G. C.

Cardarilli ,

L. Di

Nunzio ,

Fazzolari ,

Giardino , M. Re, S. Spanò, Multi-agent reinforcement learning: A review of challenges and applications , Applied Sciences 11 ( 2021 ). URL: https://www.mdpi.com/2076-3417/11/11/4948. doi: 10 .3390/app11114948.

[11]

Brandizzi ,

Russo ,

Brociek ,

Wajda , First studies to apply the theory of mind theory to green and smart mobility by using gaussian area clustering , volume 3118 , CEUR-WS , 2021 , pp. 71 - 76 .

[12]

Fallucchi ,

Gerardi ,

Petito , E. De Luca, Blockchain framework in digital government for the certification of authenticity, timestamping and data property , 2021 . doi: 10 .24251/HICSS. 2021 . 282 .

[13]

Brandizzi ,

Bianco , G. Castro,

Russo ,

Wajda , Automatic rgb inference based on facial emotion recognition , in: CEUR Workshop Proceedings , volume 3092 , CEUR-WS , 2021 , pp. 66 - 74 .

[14]

Nowak ,

Nowicki ,

Woźniak , C. Napoli, Multi-class nearest neighbour classifier for incomplete data handling , in: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) , volume 9119 , Springer Verlag, 2015 , pp. 469 - 480 . doi: 10 .1007/978-3- 319 -19324-3_ 42 .

[15]

Połap ,

Woźniak ,

Napoli , E. Tramontana,

Damaševičius , Is the colony of ants able to recognize graphic objects? , Communications in Computer and Information Science 538 ( 2015 ) 376 - 387 . doi: 10 .1007/978-3- 319 -24770-0_ 33 .

[16]

Illari ,

Russo ,

Avanzato ,

Napoli , A cloudoriented architecture for the remote assessment and follow-up of hospitalized patients , in: CEUR Workshop Proceedings , volume 2694 , CEUR-WS , 2020 , pp. 29 - 35 .

[17]

Dat ,

Ponzi ,

Russo ,

Vincelli , Supporting impaired people with a following robotic assistant by means of end-to-end visual target navigation and reinforcement learning approaches , in: CEUR Workshop Proceedings , volume 3118 , CEUR-WS , 2021 , pp. 51 - 63 .

[18]

Angwin ,

Larson ,

Mattu ,

Kirchner , Machine bias : There's software used across the country cation using iso/iec 25012: Industrial experito predict future criminals. and it's biased against ences , Journal of Systems and Software 176 blacks ., https://www.propublica.org/ ( 2016 ). ( 2021 ) 110938 . URL: https://www.sciencedirect.com/

[19]

Larson ,

Mattu ,

Kirchner , J. Angwin, science/article/pii/S0164121221000352. doi:https: Compas recidivism dataset, 2016 . URL: //doi.org/10.1016/j.jss. 2021 . 110938 . https://www.propublica.org/article/ [31]

Simonetta ,

Trenta ,

M. C.

Paoletti , A. Vetrò, how-we-analyzed-the-compas-recidivism-algorithm/ Metrics for identifying bias in datasets , SYSTEM (AccessedOct , 2021 ). ( 2021 ).

[20]

Lee , Analysis of precision and accuracy in a [32]

Barocas ,

Hardt ,

Narayanan , Fairness and simple model of machine learning , Journal of the machine learning , 2020 . URL: https://fairmlbook. Korean Physical Society, 2017 , p. 866 - 870 . doi:10. org/(AccessedSept, 2021 ), chapter: Classification. 3938/jkps.71.866. [33]

Burkholder ,

Kwock ,

Xu ,

Liu , C. Chen,

[21]

Carey , X. Wu, The statistical fairness field S. Xie, Certification and trade-of of multiple fairguide: perspectives from social and formal sci- ness criteria in graph-based spam detection , Assoences, AI and Ethics ( 2022 ) 1 - 23 . doi: 10 .1007/ ciation for Computing Machinery, 2021 , p. 130 - 139 . s43681 - 022 -00183-3. doi: 10 .1145/3459637.3482325.

[22]

Pessach , E. Shmueli, Algorithmic fairness, [34]

Vetrò ,

Torchiano ,

Mecati , A data 2020 . URL: https://arxiv.org/abs/ 2001 .09784. doi:10. quality approach to the identification of discrim48550/ARXIV. 2001 . 09784 . ination risk in automated decision making sys-

[23]

Prince , Bias and fairness in ai, 2019 . tems, Government Information Quarterly 38 URL: https://www.borealisai.com/en/blog/ ( 2021 ) 101619 . URL: https://www.sciencedirect.com/ tutorial1-bias - and - fairness-ai/(accessedMar, science/article/pii/S0740624X21000551. doi:https: 2022 ). //doi.org/10.1016/j.giq. 2021 . 101619 .

[24]

Capizzi ,

G. Lo

Sciuto ,

Napoli ,

Woźniak , [ 35 ] Department of Justice, Recidivism in juvenile jusG. Susi, A spiking neural network-based long-term tice , 2016 . URL: https://cejfe.gencat.cat/en/recerca / prediction system for biogas production , Neural opendata/jjuvenil/reincidencia-justicia-menors/ Networks 129 ( 2020 ) 271 - 279 . index. html(AccessedOct , 2021 ).

[25]

Miron ,

Tolan ,

Gómez ,

Castillo , Address- [36]

D. H.

Hofmann , Uci statelog german credit, ing multiple metrics of group fairness in data-driven 1994 . URL: https://archive.ics.uci.edu/ml/datasets/ decision making, 2020 . URL: https://arxiv.org/abs/ statlog+ (german+credit+data)(AccessedOct , 2021 ). 2003 . 04794 . doi: 10 .48550/ARXIV. 2003 . 04794 . [37] I . -C. Yeh, default of credit card clients data set,

[26]

Capizzi ,

G. Lo

Sciuto ,

Napoli ,

Shikler , 2016 . URL: https://archive.ics.uci.edu/ml/datasets/ M. Wozniak, Optimizing the organic solar cell man- default+of+credit+card+clients)(AccessedOct, ufacturing process by means of afm measurements 2021 . and neural networks , Energies 11 ( 2018 ). [38]

Kohavi ,

Becker , Adult data set, 1996 .

[27] International Organization for Standardization, URL: https://archive.ics.uci.edu/ml/datasets/ "ISO/IEC 25012:2008 Software engineering - adult (AccessedOct , 2021 ). Software product Quality Requirements and [39]

Cortez , Student performance data sett , 2014 . URL: Evaluation (SQuaRE ) - Data quality model" , https://archive.ics.uci.edu/ml/datasets/student+ 2008. URL: https://www.iso.org/standard/35736. performance( AccessedOct , 2021 ). html(accessedJan , 2021 ). [40] scikit-learn developers, Logistic regression ,

[28] International Organization for Standardization, 2022 . URL: https://scikit-learn. org/stable/ "ISO/IEC 25024:2015 Systems and software engi- modules/generated/sklearn.linear_model. neering - Systems and software Quality Require- LogisticRegression .html( AccessedMay , 2022 ). ments and Evaluation (SQuaRE) - Measurement [41] A.

Simonetta , A.

Vetrò , M. C.

Paoletti , M.

Torchiof data quality" , 2015 . URL: https://www.iso. org/ ano, Integrating square data quality model with iso standard/35749 .html( accessedJan , 2022 ). 31000 risk management to measure and mitigate

[29]

Calabrese ,

Esponda ,

P. M.

Pesado , Frame- software bias, CEUR Workshop Proceedings ( 2021 ) work for Data Quality Evaluation Based on ISO /IEC pp. 17 - 22 . 25012 and ISO/IEC 25024, in: VIII Conference on [42] scikit-learn developers, Algorithmic fairness , Cloud Computing, Big Data & Emerging Topics , 2022 . URL: https://scikit-learn.org/stable/modules/ 2020. URL: http://sedici.unlp.edu.ar/handle/10915/ clustering.html( AccessedMay , 2022 ). 104778 .

[30]

Gualo ,

Rodriguez ,

Verdugo , I. Caballero,

Piattini , Data quality certifi-