1. Introduction

Data quality certification using iso/iec Journal on Advanced Science

10.1109/ICCSCE.2014.7072735

Completeness for the Prediction of Discrimination

Alessandro Simonetta

alessandro.simonetta@gmail.com 1

Tsuyoshi Nakajima

Maria Cristina Paoletti

mariacristina.paoletti@gmail.com

Alessio Venticinque

2 0 Department of Computer Science and Engineering Shibaura Institute of Technology , Tokyo , Japan 1 Department of Enterprise Engineering University of Rome Tor Vergata , Rome , Italy 2 Naples , Italy

2016

3114 0000 0003

Data has assumed increasing importance within the global economy, and its use is becoming more pervasive in multiple contexts. However, learning systems are exposed to various critical issues that can be addressed through ISO standards. Indeed, machine learning (ML) models may be exposed to the risk of perpetrating societal prejudice simply because the same bias exists in the data. Based on these notions, we have build a model to identify similar treatment groups based on the type of classification errors made by ML algorithms. A way to calculate fairness indices on the protected attributes of the dataset will be illustrated in the article. Finally, we will consider the degree of relationship existing between maximal completeness and fairness of forecasting algorithms through an inverse procedure of constructing a complete dataset. The use of mutual information provided an alternative method for calculating synthetic fairness indices and a useful basis for future research.

fairness machine learning maximum completeness treatment similarity mutual information entropy

1. Introduction

Data has become increasingly important within the global economy, and its use, which often occurs through sive in many areas.

The Economist [1] was one of the first to define data the oil of the modern age. With the rise of Artificial Intelligence (AI) algorithms in decision support, data quality has become always more important, therefore Forbs [2] points to data as the fuel of ML algorithms. Consequently, a new business has emerged based on their collection and sale. WEB giants such as Google ofer free services and products with the target of collecting information often and companies earn considerable sums from selling the information rather than from payment services. This has pushed these companies to use increasingly sophisticated technologies [3] and algorithms to collect information and integrate it with those from other data sources to maximize their insights. In addition, as presented in the documentary ”The Social Dilemma” [4] information about users, including contacts and interactions on platforms Woodstock’21: Symposium on the irreproducible science, June 07–11, nEvelop-O [8]. It is worth mentioning that also in the General Data Protection Regulation (GDPR) 2016/679 [9], defined to harmonize the data privacy laws among the European countries, there are data quality notions such as accuracy, timeliness and security. The same could be found in the European regulation Solvency II [10], which states the need for insurance companies to have internal procedures and processes in place to ensure the appropriateness.

As we mentioned in [11] we believe that a good solution to ensure the correct use of data and their quality according to regulation and ethics values is the compliance to ISO standards: ISO/IEC 27000 [12], ISO 31000 [13] e ISO/IEC 25000 [14]. The introduction of maximum completeness, as dataset balance index, and its relation to fairness metrics are emanations of the of SQuaRE approach in measuring data quality and assessing its implications.

2. The Present Situation

Although it is dificult to estimate the cost of the absence of quality in data, a primary goal for organizations (public and private) that base their business on the digitization of processes and the operation of the organization itself is to have trusted data [15]. Some experiences show how the application of the SQuaRE series is a solution for measuring and monitoring data quality over time. In Italy, the first indication towards public administration managing databases of national interest was in 2013, in fact the Agency for Digital Italy (AgID) had identified in the ISO /IEC 25012 standard the data quality model to be adopted [16]. and information presentation (quantity of data presented to the user and order of priority) issues that may afect the fairness of computing systems. Although these issues are related to the biases within the data, characteristics of recommender systems can introduce a greater degree of uncertainty. These are related to the permissions of the users who use them to access the information or the size of the data that can be processed by the algorithms. This makes it even more dificult to find countermeasures to avoid discrimination.

Finally, in [26] the authors show a methodology for identifying critical attributes that can lead to discrimination by classification-based learning systems.

3. Solution Proposed Since in 2013, AgID had identified within the 15 quality

When using an ML-based recommendation system on a characteristics, those that should be inescapably used dataset where bias is present, the bias propagates within (accuracy, consistency, completeness, and newness) for the model itself, replicating the guesswork and prejudices databases of national interest. In the three-year plan for in the data. So, we run the risk of thinking that we applied public administration information technology 2021-2023 an objective and neutral evaluation system, while we are [17], AgID confirms increasing data and metadata quality using a biased system within an AI algorithm. as a strategic goal (OB2.2).

One of the purposes of this research is to verify that In [18] are reported three case studies of data quality the system behaves in a non-discriminatory way toward evaluation and certification process about repositories.

The diferent visions are analyzed to evaluate the impact certain groups. By considering the diferent fairness measures in [27], it is possible to calculate their value with of the adoption of the ISO/IEC 25012, ISO/IEC 25024 respect to two groups, identified by a protected attribute, and ISO/IEC 25040 and their benefit recognized in the three organization before and after the process. The results show that applying their methodology helps the organization to get a better sustainability in the long term, improve the knowledge of the business and drive the organizations in better data quality initiatives for the future.

Among the environments in which the above ISO standards can be most useful are undoubtedly those where the information contains sensitive or safety data [19] such as the healthcare and legal domains. An example is the proposed OpenEHR standard in [20]. The issues that touch clinical records from the perspective of data quality are presented in [21]. In [22] the authors propose a generalized model for big data: a solution based on the application of ISO/IEC 25012 and ISO/IEC 25024. The study introduces three data quality dimensions: Contextual Consistency, Operational Consistency and Temporal Consistency. In [11] the authors show how using the SQuaRE series can ensure GDPR compliance. In [23] the study examines discrimination against nonwhite teachers who are present on online English language teaching platforms.

One possible solution to the problem that bias in the data can propagate into the inferences of ML algorithms is through the dataset labeling mechanism presented in [24]. In [25] the authors present a range of fair access to see if there are any disparities in treatment. For example the formal criterion of Independence requires that the sensitive attribute A would be statistically independent of the predicted value R and this could be calculated ∀, ∶ ≠ as: ( = 1| = ) = ( = 1| = )

( 1 )

To understand whether an attribute is a cause of discrimination in prediction outcomes, that is, whether there are homologous treatment groups, it is necessary to know the attribute’s level of fairness. Ideally, therefore, should be better to have a single measure that gives an idea of how likely that attribute is to lead to discrimination.

In [26], the authors propose a method to compute several synthetic indices related to the fairness of the classification system. Two diferent methods are described in the article: the first performs clustering with

DBSCAN and Kmeans methods while the second, MaxMin, searches for the worst case by dividing the protected attribute instances into privileged and unprivileged. Both methods allow grouping the elements of a protected attribute according to the type of treatment. In this way the calculation of the synthetic index is based on a few influence classes returning to the definition from which we started [27]. These two approaches were used to test for a link between the notion of maximum completeness and fairness indices. This would allow a priori identification of whether learning on a present dataset can lead to mi- the issue becomes more complicated when there are catnority discrimination. At this point, alternative methods egorical attributes with higher cardinality as the number are proposed to identify homogeneous treatment groups of relations increases. With reference to the Juvenile with respect to the result obtained from a classification dataset [28], considering the V3_nacionalitat attribute system. Algorithms may err toward some groups equally, representing the nationality of the students, it is possible i.e. for African-Americans and Native Americans they to draw the phenomenon through a subway diagram (Fig. may give a degree of recidivism in excess of what hap- 1). In this graph, it is easier to check intersections bepens in reality. tween sets. For example, Group 0 and Group 1 have the element Colombia in common. The top histogram shows 3.1. Identification of homogeneous the number of elements participating in the intersection while the left histogram shows the number of elements treatment groups in the group.

To start, we need to calculate the fairness indices reported The result obtained with the Pearson coeficient threshin [27] for the protected attributes of the dataset, consid- old of 0.9 identifies twelve homogeneous treatment ering the predictions of the classification algorithm and groups. In order to reduce their number, we kept as the actual corrected result. a representative of a set of groups the one that contained

We refereed to a classic case study for this kind of them in the inclusion relation. This reduced the twelve problem: the Compas dataset [7], where we observed to four completely disjointed groups. a similar trend between groups. Table 1 shows the values of the 6 fairness indices for the protected attribute Race: Independence (Ind), Separation True Positive Rate (SepTPR), Separation False Positive Rate (SepFPR), Sufifciency Positive Predictive Value (SufPPV), Suficiency Negative Predictive Value (SufNPV) and Overall Accuracy Equality (OAE).

Table 2 shows the correlation matrix according to Pearson’s coeficient and the existence of correlation between the indices measured for diferent ethnicities. Considering a correlation value of 0.9, it is easy to detect the existence of two treatment groups (Table 3): G0 and G1. Figure 2: Scatterplot of races in Compas Dataset, mean of

Although this method works well for the case study, fairness metrics Vs maximum completeness

3.2. Relationship between mean of

fairness indexes and At this point, we studied if there was a relationship between the composition of the groups made using Pear- 3.3. Alternative synthetic indices son’s coeficient and the maximum completeness, as The presence of outliers in the values of fairness indices shown in the [29], [30] studies. For this purpose, we related to a protected attribute could impact the valuation used the scatterplot diagram in which each ethnicity was of these parameters. For this reason, in this paper we drawn in relation to the pair of values: mean of the fair- propose a diferent way of calculating fairness indices. In ness indices, in the abscissa, and maximum completeness, this research, we calculate independence, separation, sufin the ordinate (Fig. 2). Considering the positioning of ifciency and OAE using the notion of entropy and mutual the diferent ethnic groups and a scale that reports the information. The idea is to find a new representation of highest value as the limit of the diagram, we observe the synthetic index that would allow more confident identhat they tend to cluster on average around the grand tification of whether a given protected attribute could mean of the fairness attributes, most noticeably when we lead to possible discrimination. Considering the condilook at the privileged group. Items belonging to the same tion of Independence between two groups = and dgeroxu.pThteesnedctoonsriedmeraaitniocnlsosaererleelsastitvreuetofotrhtehefamiranxeismsuimn- = : completeness index ( ).

After extending the analysis to the diferent attributes of the datasets already present in [26] [31], seems to be a strongly characterizing parameter, more so than the other indices proposed in [32]. In fact, repeating the analysis on other protected attributes, such as V3_nacionalitat of the Juvenile dataset, within the scatterplot the clustering of similarly treated elements was found to be strongly related not only to the average of the fairness indices, but also to the as present in Fig. 3, considering the groups with intersection present in 1. | ( = 1| = ) − ( = 1| ≠

)| < This condition can be extended to all categories of suficiency is expressed by the following equation: ( , |) = ( , ) + (, ) − ( , , ) − () the protected attribute and also expressed by orthogo- finally, the OAE is computed by: nality between the predicted value R and the group A through mutual information. Given two variables, they are independent if their mutual information is zero: (, | = ) = (, = )+ )) )) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) are: as: H(A) is the entropy associated to A and it is calculated ifnally, the third term H(A,R) is computed by: (, ) = ( ∩ )( (

∩ )) The other indices can also be expressed by mutual in- study, we performed the comparison of the three methodformation and in particular referring to [33] and [26]

Separation is calculated by:

(, | ) = (, ) + (, ) − (, , ) − ( ) to the fairness measures, without loss of generality, we have reported only the relationships between Indepen- identified similarities that previously remained hidden dence measure and maximum completeness. In Fig. 4 in in search of possible discrimination. red is shown the dependence curve related to MaxMin The other achievement was that we were able to assomethodology, in blue that with DBSCAN and in black ciate fairness measures with protected attributes, indethat with mutual information. The graph, highlighted in pendently of those of individual values, using the concept Fig. 4 shows the trend of independence versus varying of mutual information and entropy. This approach laid maximum completeness. The process of construction the foundation for new experimentation to relate the of the dataset initially select few tuples of the original response of these measures to changes in maximum comone ( =0.324) and after insert new tuples until the pleteness. dataset reaches the overall completeness ( = 1), Finally, we compared the classical approaches [31] which corresponds to maximum independence. versus the method using mutual information and

The curve related to the MaxMin method initially hires entropy. In this way, we tested the response of fairness greater values than the other two methods, while the measures against maximum completeness and found phenomenon decreases as the number of records entered confirmation against the premises of the work, namely, increases. Thus, we can conclude from the present re- that non-quality in the data leads to unfair treatments if search that there is a greater sensitivity of independence AI and ML are used in the decision-making process of measure with respect to varying maximum completeness recommender systems. if the MaxMin method is used.

3.4. Limit and Future Works

This work identified homologous treatment groups using Pearson’s coeficient, which detects the correlation between fairness characteristics associated with diferent groups.

In the future, further research should be done to investigate new similarity mechanisms based on ML and Deep Learning algorithms considering other clustering methodologies that can avoid overlapping between groups.

A second line of research will aim to identify discrimination caused by belonging to more than one protected attribute such as gender and race simultaneously.

Since we do not considered explainable AI algorithms, future works could be extended considering framework that analyze how AI models make decisions (i.e. Watson OpenScale [34]).

4. Conclusion

The use of AI and ML in the decision-making process of many recommendation systems makes it possible to mitigate the risk of subjective classifications.

While these systems are reliable forecasting tools, they do not always allow for an explanation of why such conclusions were reached. Thus, the presence of incomplete or unbalanced data, that can be measured through the SQuaRE series (completeness measures), can lead to biased results.

This work made it possible to us, to calculate similar groups in terms of equivalence of treatment through the application of Pearson’s coeficient to synthetic indices related to protected attributes. In such a way, we software bias, CEUR Workshop Proceedings (2021) pp. 17–22. [31] A. Vetrò, M. Torchiano, M. Mecati, A data quality approach to the identification of discrimination risk in automated decision making systems, Government Information Quarterly 38 (2021) 101619. doi:https: //doi.org/10.1016/j.giq.2021.101619. [32] A. Simonetta, M. C. Paoletti, M. Muratore, A new approach for designing of computer architectures using multi-value logic, International Journal on Advanced Science, Engineering and Information Technology 11 (2021) 1440–1446. doi:10.18517/ ijaseit.11.4.15778. [33] D. Steinberg, A. Reid, S. O’Callaghan, F. Lattimore, L. McCalman, T. S. Caetano, Fast fair regression via eficient approximations of mutual information, CoRR abs/2002.06200 (2020). URL: https://arxiv.org/ abs/2002.06200. [34] IBM, Watson openscale, 2022. URL: https: //www.ibm.com/it-it/cloud/watson-openscale/ drift(Access10-22).

[1]

The

Economist , The world's most valuable resource is no longer oil, but data, The Economist , USA (6th May 2017 ).

[2]

Marr , The 5 biggest data science trends in 2022, Oct 2021 . URL: https: //www.forbes.com/sites/bernardmarr/2021/ 10/04/the-5 -biggest-data-science-trends-in-2022/ ?sh=22f5fc1d40d3.

[3]

Giuliano , The next generation network in 2030: Applications, services, and enabling technologies , in: 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI) , 2021 , pp. 294 - 298 . doi: 10 .23919/ EECSI53397. 2021 . 9624241 .

[4]

Orlowski , The social dilemma , Sep . 2020 . URL: https://www.netflix.com/it/title/81254224.

[5]

G. C.

Cardarilli ,

L. Di

Nunzio ,

Fazzolari ,

Giardino ,

Re ,

Ricci ,

Spanò , An fpga-based multi-agent reinforcement learning timing synchronizer , Computers and Electrical Engineering 99 ( 2022 ) 107749 . doi:https://doi.org/10.1016/j. compeleceng. 2022 . 107749 .

[6]

G. C.

Cardarilli ,

Re ,

L. Di

Nunzio , A pseudosoftmax function for hardware-based high speed image classification, Scientific Reports ( 2021 ). doi:10.1038/s41598- 021- 94691- 7.

[7]

Larson ,

Mattu ,

Kirchner ,

Angwin , Compas recidivism dataset, 2016 . URL: https://www.propublica.org/article/ how -we-analyzed-the-compas-recidivism-algorithm.

[8] Council of Europe, Recommendation CM/Rec(

2020 ) 1 of the Committee of Minis-