Data Quality Dimensions for Fair AI ⋆

Data Quality Dimensions for Fair AI ⋆ CamillaQuaresmini camilla.quaresmini@polimi.it Department of Electronics, Information and Bioengineering Politecnico di Milano

Piazza Leonardo da Vinci 32 20133 Milan Italy

GiuseppePrimiero giuseppe.primiero@unimi.it Department of Philosophy LUCI Lab and PhilTech Research Center Università degli Studi di Milano

Via Festa del Perdono 7 20122 Milan Italy

MIRAI Srl Data Quality Dimensions for Fair AI ⋆ 1613-0073 73357500A81AF62EEA4F8FF6A5C110E1 arXiv:1901.04966. GROBID - A machine learning software for extracting information from scholarly documents Bias mitigation Fairness Information Quality Mislabeling Timeliness

Artificial Intelligence (AI) systems are not intrinsically neutral and biases trickle in any type of technological tool. In particular when dealing with people, the impact of AI algorithms' technical errors originating with mislabeled data is undeniable. As they feed wrong and discriminatory classifications, these systems are not systematically guarded against bias. In this article we consider the problem of bias in AI systems from the point of view of data quality dimensions. We highlight the limited model construction of bias mitigation tools based on accuracy strategy, illustrating potential improvements of a specific tool in gender classification errors occurring in two typically difficult contexts: the classification of non-binary individuals, for which the label set becomes incomplete with respect to the dataset; and the classification of transgender individuals, for which the dataset becomes inconsistent with respect to the label set. Using formal methods for reasoning about the behavior of the classification system in presence of a changing world, we propose to reconsider the fairness of the classification task in terms of completeness, consistency, timeliness and reliability, and offer some theoretical results.

Introduction

Machine Learning (ML) models trained on huge amounts of data are intrinsically biased when dealing with people. Common face recognition systems used in surveillance tasks generate false positives labeling innocent people as suspects. Social credit systems link individuals to the state of their social credit, making decisions based on that score. In all of those cases, subjects suffer a credibility deficit due to prejudices related to their social identity [1]: a dark-skinned man could be characterized by a higher risk of recidivism after being arrested; a short-haired skinny young woman -or a long-haired boy with feminine traits -might be the target of transphobic attacks following misgendering. Through the deployment of these technologies, society makes the gap separating rich from poor, cisnormative from non-cisnormative individuals, more constitutive as automatized and standardized.

Already before the explosion of ML algorithms, [2] offered a framework for understanding three categories of bias in computer systems, assuming the absence of bias as necessary to define their quality. Later on, the emergence of contemporary, data-driven AI systems based on learning has significantly worsened the situation, see e.g. [3,4]. On this basis, the development and deployment of fairer Artificial Intelligence (AI) systems has been increasingly demanded. Such request appears especially relevant in certain application contexts. For example, as examined in [5], face is commonly used as a legitimate mean of gender classification, and this is operazionalized and automatized in technologies such as Automatic Gender Recognition (AGR), which algorithmically derives gender from faces' physical traits to perform classification [6,7]. This technique relies on the assumption that gender identity can be computationally derived from facial traits. However, a recent study [8] shows that the most famous AGR systems are not able to classify non-binary genders, also performing poorly on transgender individuals. This is due to the fact that AGR incapsulates a binary, cisnormative conception of gender, adopting a male/female scheme which invalidates non-binary identities.

We declare ourselves against the use of gender classification, as considering face as a proxy for detecting gender identity seems to resonate with phrenology and physiognomy, and we believe that the process of automatic gender recognition can easily lead to mismatches between the theoretical understanding of constructs underlying identity and their operationalization [9], especially when it comes to classification of individuals who recognise themselves outside of binarism. However, we note that this kind of classification is already happening [10], spreading with commercial systems offering gender classification as a standard feature, causing a huge impact on the lives of misgendered individuals. Therefore there are contexts in which it is potentially inevitable that classification exists, and in these contexts it must be fairer. This translates into asking whether there is a strategy to ensure that the labels assigned during classification are as less stereotypical and archetypal as possible. While this paper does not investigates the ethical aspects of AGR, we aim at addressing the issues related to the classification strategies to make them fairer, as an initial study to prepare for implementing mitigation strategies.

An important task, common to technology and philosophy, is therefore the identification and verification of criteria that may help developing fairness conditions for AI systems. While a number of techniques are available to mitigate bias, their primary focus on purely statistical analysis to control accuracy across sensitive classes is clearly insufficient to control social discrimination. A different approach is represented by the explicit formulation of ethical principles to be verified across protected attributes, combining statistical measures with logical reasoning, as formally defined in [11,12,13,14,15] and implemented by the BRIO tool in [16,17]. In this latter context, an important direction to explore for a refined definition of ethically-laden verification criteria is the study of quality dimensions and associated biases. In the following of this paper, we offer a theoretical contribution in this direction, preparing the ground for a future implementation. We argue that, even if maximizing data quality and fairness simultaneously can be hard as improving one can deteriorate the other [18], the task of bias mitigation tools can be supported by reasoning on quality dimensions that so far have been left ignored. In particular, we offer examples to show how dimensions of consistency, completeness, timeliness and reliability can be used to establish fairer AI classification systems. This research is in line with the quest for integrating useful empirical metrics on fairness in AI with asking key (conceptual) questions, see [19].

The paper is structured as follows. In Section 2 we offer an overview of fainess definitions and bias types relevant for this work. In Section 3 we briefly overview the technical details of

Bias type

Definition Literature

Data Bias

Behavioral bias

User's behavior can be different across contexts [38]

Exclusion bias

Systematic exclusion of some data [39]

Historical bias

Cultural prejudices are included into systematic processes [40]

Time interval bias

Data collection in a too limited time range [41] Label Bias

Chronological bias

Distortion due to temporal changes in the world which data are supposed to represent [39] Historical bias Cultural prejudices are included into systematic processes [40] Misclassification bias Data points are assigned to incorrect categories [42] a particular bias mitigation tool to illustrate what we consider essential limitations of purely statistical analyses. In Section 4 we introduce data quality dimensions arguing for reconsidering their relevance in the task of evaluating the fairness of classification systems, presenting two examples to justify this requirement. In Section 5 we propose a definition of fair AI classification that includes such dimensions and formulate some theoretical results. Section 6 concludes the work illustrating future research lines.

Fairness and Bias in ML

Despite a unique definition missing in the literature [2,3,20,21,22,23,24,25], fairness is often presented as corresponding to the avoidance of bias [26]. This can be formulated at two distinct levels: first, identifying and correcting problems in datasets [27,28,29,30,31,32], as a model trained with a mislabeled dataset will provide biased outputs; second, correcting the algorithms [21,33], as even in the design of algorithms biases can emerge [34]. In the present section we are interested in considering datasets and their labels. Indeed, bias may also affect the label set [35,36]. Accordingly, we talk about label quality bias when errors hit the quality of labels. As shown in [37], the most well-known AI datasets are full of labeling errors. A crucial task is therefore the development of conceptual strategies and technical tools to mitigate bias emergence in both data and label sets.

A variety of approaches and contributions is available in the literature focusing on identifying bias in datasets and labels. Here we list the types of bias which are relevant to the present work, see Table 2. Albeit not exhaustive, these lists of biases represent a good starting point to investigate quality dimensions required to address them. We now analyze a common mitigation strategy used by existing tools addressing the issue of bias in data, showing their limitations. We then study the bias in the classification algorithm (i.e., bias in labels) of the mitigation tool. A correct label for the datapoint 𝑑 𝜋 Threshold variable

Mitigating Bias

A bias mitigation algorithm is a procedure for reducing unwanted bias in training datasets or models, with the aim to improve the fairness metrics. Those algorithms can be classified into three categories [43]: pre-processing, when the training data is modified; in-processing, when the learning algorithm is modified; post-processing, when the predictions are modified. Several tools are available to audit and mitigate biases in datasets, thereby attempting to implement diversity and to reach fairness. Among the most common are AIF360 [22], Aequitas [44] and Cleanlab [45]. Recently a post-hoc evaluation model for bias mitigation has been proposed by the tool BRIO [16,17]. In this article, we consider Cleanlab as a testbed, illustrating below in Section 4 its limitations in view of data quality dimensions. Instead, we propose a theoretical frame for the resolution of such limitation in Section 5, further illustrating the possibility to implement the present analysis in the tool BRIO. For an overview of the symbols used from now on, see Table 2.

Cleanlab is a framework to find label errors in datasets. It uses Confident Learning (CL), an approach which focuses on label quality with the aim to address uncertainty in dataset labels using three principles: counting examples that are likely to belong to another class using the confident joint and probabilistic thresholds to find label errors and to estimate noise; pruning noisy data; and ranking examples to train with confidence on clean data. The three approaches are combined by an initial assumption of a class-conditional noise process, to directly estimate the joint distribution between noisy given labels and uncorrupted unknown ones. For every class, the algorithm learns the probability of it being mislabeled as any other class. This assumption may have exceptions but it is considered reasonable. For example, a "cat" is more likely to be mislabeled as "tiger" than as "airplane". This assumption is provided by the classification noise process (CNP, [46]), which leads to the conclusion that the label noise only depends on the latent true class, not on the data. CL [45] exactly finds label errors in datasets by estimating the joint distribution of noisy and true labels. The idea is that when the predicted probability of an example is greater than a threshold per class, we confidently consider that example as actually belonging to the class of that threshold, where the thresholds for each class are the average predicted probability of examples in that class. Given ỹ ∈ [𝑚] takes an observed, noisy label (potentially flipped to an incorrect class); and 𝑦 * ∈ [𝑚] takes the unknown (latent), true, uncorrupted label (latent true label), CL assumes that for every example it exists a correct label 𝑦 * and defines a class-conditional noise process mapping 𝑦 * → ỹ , such that every label in class 𝑗 ∈ [𝑚] may be independently mislabeled as class 𝑖 ∈ [𝑚], with probability 𝑝( ỹ = 𝑖 | 𝑦 * = 𝑗). So, maps are associations of data to wrong labels. Then CL estimates 𝑝( ỹ | 𝑦 * ) and 𝑝(𝑦 * ) jointly, evaluating the joint distribution of label noise 𝑝( ỹ , 𝑦 * ) between noisy given labels and uncorrupted unknown labels. CL aims to estimate every 𝑝( ỹ , 𝑦 * ) as a matrix 𝑄 ỹ ,𝑦 * to find all mislabeled examples 𝑥 in dataset 𝑋, where 𝑦 * ≠ ỹ . Given as inputs the out-of-sample predicted probabilities P 𝑘,𝑖 and the vector of noisy labels ỹ 𝑘 , the procedure is divided into three steps: estimation of Q ỹ ,𝑦 * to characterize class-conditional label noise, filtering of noisy examples, training with the errors found.

To estimate Q ỹ ,𝑦 * i.e. the joint distribution of noisy labels ỹ and true labels 𝑦 * , CL counts examples that may belong to another class using a statistical data structure named confident joint 𝐶 ỹ ,𝑦 * , formally defined as follows

𝐶 ỹ ,𝑦 * [𝑖][𝑗] ∶=| X ỹ =𝑖,𝑦 * =𝑗 |(1)

In other words, the confident joint estimates the set 𝑋 ỹ =𝑖,𝑦 * =𝑗 of examples with noisy label i which actually have true label j by making a partition of the dataset 𝑋 into bins X ỹ =𝑖,𝑦 * =𝑗 , namely the set of examples labeled ỹ = 𝑖 with large enough expected probability p ( ỹ = 𝑗; 𝑥, 𝜃) to belong to class 𝑦 * = 𝑗, determined by a per-class threshold 𝑡 𝑗 , where 𝜃 is the model. This kind of tools are extremely useful in estimating label error probabilities. However they have some limitations, and it is easy to formulate examples for which their strategy seems unsound. A first problem arises from the initial assumption of the categoricity of data. Take for example the case of gender labeling of facial images, which is typically binary (i.e. with values male, female). For each datapoint, a classification algorithm calculates the projected probability that an image is assigned to the respective label. Consider though two very noisy cases: images of non-binary individuals; images of transgender individuals. In the former case, the label set becomes incomplete with respect to the dataset; in the second case, the dataset is inconsistent with respect to the label set. Hence, there can be datapoints that have either 1) none of the available labels as the correct one, or 2) at different times they can be under different labels. By definition, if we have disjoint labels there can be high accuracy but only on those datapoints which identify themselves in the disjointed categories. In situations like these, it appears that the dimension of accuracy alone does no longer satisfy the correctness of the classification algorithm. In terms of quality dimensions, the possibility of an uncategorical datapoint or that of a moving datapoint is no longer only an accuracy problem. Hence, the identification of other data quality dimensions to be implemented in tools for bias mitigation may help achieve more fairness in the classification task. In the next section we suggest an improvement of the classification strategy by adding dimensions that should be considered when evaluating the fairness of the classification itself.

Extending Data Dimensions for Fair AI

In the literature, data quality dimensions are defined both informally and qualitatively. Metrics can be associated as indicators of the dimension's quality. However, there is no single and objective vision of data quality dimensions, nor a universal definition for each dimension. This is because often dimensions escape or exceed a formal definition. The cause of the large amount of dimensions [47,48] also lies in the fact that data aim to represent all spatial, temporal and social phenomena of the real world [49]. Furthermore, they are constantly evolving in response to continuous development of new data-driven technologies.

For the purposes of our analysis, we focus on the following basic set of data quality dimensions which is the focus of the majority of authors in the literature [50,51]:

• Accuracy, i.e. the closeness between a value 𝑣 and a value 𝑣 ′ , where the latter is the correct representation of the real-life phenomenon that 𝑣 aims to represent [47]; • Completeness, i.e. the level at which data have the sufficient breadth, depth, and scope for their task [48,52,47]; • Consistency, i.e. the coherence dimension: it amounts to check whether or not the semantic rules defined on a set of data elements have been respected [47]; • Timeliness, the data freshness over time for a specific task [53,54].

We thus indicate them as potential candidates to be implemented in the context of bias mitigation strategies. In particular, we argue that, as data are characterized by evolution over time, the timeliness dimension [47] can be taken as basis for other categories of data quality. We aim at suggesting improvements on errors identification in the classification of datapoints, using the gender attribute as an illustrative case. We thus suggest the extension of classification with dimensions of completeness, consistency and timeliness and then return to Cleanlab to illustrate how this extension could be practically implemented.

Incomplete Label Set and Inconsistent Labeling

Consider the first example of a datapoint which represents a non-binary individual. This kind of identity is rarely considered in technology [55]. Non-binary identities do not recognize themselves within the binary approach characteristic of classification systems. As such, individual identity is not correctly recognized by the classification system, highlighting the insufficiency of the model which flattens the gender identity umbrella on the two options of male/female.

The conceptual solution would be to simply assume the label set as incomplete. This means that the bias origin is in the pre-processing phase, and a possible strategy is to extend the partition of the labels adding categories as appropriate, e.g. "non binary". The problem is here reduced to the consideration of the completeness of the label set. [8] can be considered a first attempt in this direction.

Consider now a transgender datapoint whose identity shifts over time, being a fluid datapoint by definition. Currently AI systems operationalize gender in a way which is completely transexclusive, see e.g. [7,6]. However, identity is not static: it may move with respect to the labels we have, leading the datapoint to be configured in a label or in a different one during a selected time range. In this case, any extension of the label set is misleading, or at least insufficient.

Here we cannot just add more categories, but we have to find a logical solution to changing the label of the same datapoint at different timepoints.

Enter Time

The two problems above can be formulated adding to completeness and consistency the dimension of temporality. Thus, an important starting point is represented by adding the dimension of timeliness, which concerns the degree to which data represent reality within a certain defined time range for a given population.

We suggest here considering the labeling task within a given time frame, whose length depends on the dataset and the classification task over the pairing of datapoints to labels, to measure a probability of a label-change over time. Intuitively, if the analysis is performed less than a certain number of timestamps away from the last data labeling, then we consider the labeling still valid. Otherwise, a new analysis with respect to both completeness of the dataset and label set must be performed. Technically, this means associating temporal parameters to labels and to compute the probability that a given label might change over the given time frame. The probability of a label being correct (its accuracy) decreases within the respective temporal window. In particular, reasoning on the temporal evolution of the dataset could allow us to model the evolution of the label partitions. Two fundamental theses are suggested for evaluation: the correctness of the task does no longer assume static completeness of the label set, i.e. given the label set is complete at time 𝑡 𝑛 , it can be incomplete at time 𝑡 𝑛+𝑚 ; the labeling does no longer assume static perseverance of the labels, that is, given a label 𝑖 that is correct at a time 𝑡 𝑛 for a datapoint 𝑑, it could be incorrect at a later time, and conversely if it is incorrect it could become correct.

Back to Cleanlab

Considering a possible implementation in Cleanlab able to account for such differences implies renouncing the starting assumption on the categoricity of the data. Instead, assume that the probability of assigning a label may change over time. This can be formulated in two distinct ways. First, the probability value of a given label 𝑖 being wrong, given a label 𝑗 is correct (their distance) may change over time. The task is now to give a mapping of all the label-variable pairs, i.e. given a mapping 𝑦 * → ỹ between variables, where 𝑦 * is the correct label and ỹ the wrong one, compute the probability over the time frame

𝒯 ∶= {𝑡 1 , … , 𝑡 𝑛 } 𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛 | (𝑦 * = 𝑗) 𝑡 𝑛−𝑚 ] (2)

such that label 𝑖 is wrong at time 𝑡 𝑛 , given that label 𝑗 was correct at time 𝑡 𝑛−𝑚 . This probability can increase or decrease, depending on the dataset and on the label set. For the definition of the confident joint, this means taking the evaluation of all the elements that have an incorrect label 𝑖 when their correct label is 𝑗, and then associate the wrong label to a time 𝑡 𝑛 and the correct label to a previous time. This estimate must be made on all time points, so for every 𝑚 < 𝑛. Given a timepoint 𝑛 at which the label is wrong, the estimate on all pairs of probabilities for that point with a previous point in which another label can be correct has to be computed

𝐶 ỹ ,𝑦 * [𝑖, 𝑗, 𝒯 ] ∶= 𝑛∈𝒯 ∑ 1≤𝑚<𝑛∈𝒯 | X ỹ =𝑖 𝑡 𝑛 ,𝑦 * =𝑗 𝑡 𝑛−𝑚 |(3)

Second, given a mapping 𝑦 * → ỹ between variables, where 𝑦 * is the correct label and ỹ the wrong one, what is the probability

𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛 | (𝑦 * = 𝑖) 𝑡 𝑛−𝑚 ](4)

such that label 𝑖 is wrong at time 𝑡 𝑛 , given that the same label 𝑖 was correct at time 𝑡 𝑛−𝑚 ? In this case, the same label is fixed and the probability that it becomes incorrect can be calculated. The definition of confident joint thus becomes

𝐶 ỹ ,𝑦 * [𝑖, 𝒯 ] ∶= 𝑛∈𝒯 ∑ 1≤𝑚<𝑛∈𝒯 | X ỹ =𝑖 𝑡 𝑛 ,𝑦 * =𝑖 𝑡 𝑛−𝑚 |(5)

To illustrate the point we consider a toy example. Compute

i.e. the error rate of 𝑦 * = 𝑚𝑎𝑙𝑒 has to be determined. First, a confusion matrix is constructed to analyze errors. Suppose to have a dataset of 10 datapoints, see Figure 1. From the matrix, 𝑝(𝑦 * = 𝑗) = 5/10 and 𝑝( ỹ = 𝑖) = 4/10. So there are 5 women, of which 2 are incorrectly labeled "male" and 3 are correctly labeled "female", and 5 men of which 1 is incorrectly labeled "female" and 4 are correctly labeled "male". Replacing the values in Equation 6, 𝑝( ỹ = 𝑖 | 𝑦 * = 𝑗) = 0.2. The obtained value represents the error rate of the "male" label, i.e. the probability of a male datapoint being labeled "female". Looking at the diagonals, the true positive rate TPR = 70% and the false positive rate FPR = 30%.

Consider now the same dataset at a later time 𝑡 𝑛+𝑚 , see Figure 2. The labels might have changed. From the matrix, 𝑝(𝑦 * = 𝑗) = 5/10 and that 𝑝( ỹ = 𝑖) = 5/10. Now there are 5 women, of which 3 are incorrectly labeled "male" and 2 are correctly labeled "female", and 5 men of which 3 are incorrectly labeled"female" and 2 are correctly labeled "male". Replacing again the values in 6, 𝑝 ′ ( ỹ = 𝑖 | 𝑦 * = 𝑗) = 0.6. In this case the true positive rate TPR = 40% and the false positive rate FPR = 60%. To understand how the error rate changes, the difference between the two matrices has to be considered. Thus, the change rate can be computed as 𝜀 = p

𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛+𝑚 | (𝑦 * = 𝑗) 𝑡 𝑛 ] = 𝑝[(𝑦 * = 𝑗) | ( ỹ = 𝑖)] 𝑡 𝑛+𝑚 ⋅ [𝑝( ỹ = 𝑖)𝑡 𝑛 ± 𝜀] 𝑝(𝑦 * = 𝑗) 𝑡 𝑛 = 0.288(7)

This value represents the (highest) probability that a given label is wrong at a given time, provided it was correct at some previous time. Indirectly, this also expresses the probability that the labeling set is applied to a dataset containing a point for which the labeling becomes inconsistent over time.

Temporal-based Fairness in AI

We have argued that a more general discussion on the data dimensions to be adopted in bias mitigation tools is needed, and in particular that the dimension of timeliness is crucial. In this section we summarise our proposal and offer non-exhaustive criteria for fairness in AI based on such temporal approach along with some basic theoretical results.

The first metric that has been addressed in this work is completeness as applied to the label set. In a world where gender classification is actually changing, the present strategy includes the completeness dimension in the quality assessment, verifying that the label set is complete with respect to the ontology of the world at the time this assessment is made. The solution here is to extend the label set as desired adding new labels for the classification task, as already suggested in [8]. Additionally, we suggest an explicit temporal parametrization: completeness can be considered as a relationship between a label set and an individual 𝑝 belonging to a certain population 𝑃, where 𝑝 is any domain item that enters 𝑃 at a time 𝑡. We must ensure that a correct label 𝑙 exists for each datapoint in the dataset at each time. In other words, the completeness of a dataset over a time frame is granted if for every datapoint representing an element in the population of interest there exists at any two possibly consecutive points in time a correct label for it.

Next, we considered consistency of the label set with respect to datapoints possibly shifting in categorization. The method here again is to reduce consistency to timeliness. We suggest to compute the probability of an inconsistency arising from a correct label change. Accuracy, albeit the most used metric for evaluating classification models' performances due to its easy calculability and interpretation, is reductive, trivial and incorrect in some contexts. For example, if the distribution of the class is distorted, accuracy is no longer a useful, nor a relevant metric. Even worse, sometimes greater accuracy leads to greater unfairness [56]: some labels like race or gender may allow models to be more predictive, although it seems to be often controversial to use such categories to increase predictive performance. We have suggested to consider temporal accuracy [57] as a function of the error rate over time.

The ability to compute the variance in the error rate across time is functional to determine the reliability of AI systems. This metric is linked to the notion of accuracy, as it is considered as a measure of data correctness, see [47]. In [48] and [57] reliability is even contained in the definition of accuracy itself: data must be reliable to satisfy the accuracy dimension. Overall, it seems that reliability is not actually controlled beyond physical reliability, as in the literature on data quality there is no formal definition to compute it. However, following [58] the previously provided temporal approach is again useful: evaluating reliability is based on the revisions which show how close the initial estimate of accuracy is to the following ones. In this sense, reliability can be reduced to accuracy over time in terms of a threshold on the error rate: Definition 2 (Reliability of a classification algorithm). A classification algorithm in a AI system 𝑋 is considered reliable over a time frame 𝒯 ∶= {𝑡 1 , … , 𝑡 𝑛 } denoted as 𝑅𝑒𝑙 𝒯 (𝑋 ) iff 𝜀 𝒯 (𝑋 ) < 𝜋, for some safe value 𝜋.

The change rate 𝜀 we have computed shows how much the system's accuracy deteriorates. If it exceeds a fixed safe value 𝜋, the system is no longer accurate. Plain accuracy is the numerical measure at some time 𝑡 ∈ 𝒯 ∶= {𝑡 1 , … , 𝑡 𝑛 }. If this value does not deteriorate over a certain fixed threshold, the system is considered reliable, and therefore accurate with respect to time.

The two previous definitions offer non-exhaustive criteria for the identification of fair AI systems: Definition 3 (Fairness for AI classification systems). 𝐹 𝑎𝑖𝑟 𝒯 (𝑋 ) only if 𝑅𝑒𝑙 𝒯 (𝑋 ) and 𝐶𝑜𝑚𝑝𝑙 𝒯 (𝐿(𝑋 )).

Hence we claim that fairness requires the system's ability to give reliable and correct outcomes over time. While we do not consider these properties sufficient, we believe they are necessary. On this basis, we can formulate two immediate theoretical results: Theorem 1. Given a label set 𝐿 complete at time 𝑡, a classification algorithm guarantees a fair classification at time 𝑡 ′ > 𝑡 if and only if the change rate determined with respect to 𝐿 is 𝜖 < 𝜋.

Proof. Assume 𝐶𝑜𝑚𝑝𝑙 𝑡 (𝐿(𝑋 )), then for 𝐹 𝑎𝑖𝑟 𝑡 ′ (𝑋 ) we need to show 𝑅𝑒𝑙 𝑡 ′ (𝑋 ) for 𝑡 ′ > 𝑡 ∈ 𝒯. Assume 𝜖 > 𝜋, then by Definition 2 reliability is not satisfied; hence, if 𝑅𝑒𝑙 𝒯 (𝑋 ), it must be the case that 𝜖 < 𝜋.

Theorem 2. Given a fixed change rate 𝜖 < 𝜋, a classification algorithm with fair behaviour at time 𝑡 remains fair at time 𝑡 ′ > 𝑡 if and only if the change to make the label set complete at time 𝑡 ′ does not exceed an 𝜖 ′ such that 𝜖 + 𝜖 ′ > 𝜋.

Proof. Consider 𝐹 𝑎𝑖𝑟 𝑡 (𝑋 ) with change rate 0 < 𝜋 as a base case, then by Definition 3 𝑅𝑒𝑙 𝑡 (𝑋 ) and 𝐶𝑜𝑚𝑝𝑙 𝑡 (𝐿(𝑋 )). Now consider 𝑡 ′ > 𝑡 and a required change 𝜖 ′ in 𝐶𝑜𝑚𝑝𝑙 𝑡 (𝐿(𝑋 )) such that 𝑅𝑒𝑙 𝑡 ′ (𝑋 ) holds. This obviously holds only if 0 + 𝜖 ′ < 𝜋. Generalize for any 𝜖 > 0.

Note that in these results the value of 𝜖, respectively 𝜖 ′ , is a proxy for how much the world has changed at 𝑡 ′ with respect to 𝐶𝑜𝑚𝑝𝑙 𝑡 (𝐿(𝑋 )).

In the context of an incomplete label set, a detected label bias can originate from an exclusion bias in data, which can also result from a time interval bias. In the case of label-changing datapoints a chronological bias occurs. Then, misclassification bias can be reduced to the two previous types. In the context of use, emergent bias can arise as a result of changes in societies and cultures. It might appear in data as chronological, historical or behavioral bias. Here, a different value bias occurs for example when the users are different from the assumed ones during the system's development. This is the case of ontology switching, to which a label set must adapt. These types of bias can be mitigated by implementing the proposed framework. The tool BRIO [16,17] works as a post-hoc model evaluation, taking in input the test dataset of the model under investigation and its output. The tool allows to investigate behavioural differences of the model both with respect to an internal analysis on the classes of interest, and externally with respect to chosen reference metrics. Morever, it allows to measure bias amplification comparing the bias present in the dataset and how that manifests itself in the output. While the present work does not aim at offering a full implementation of our theoretical analysis for the BRIO tool, some remarks are appropriate. The time-based analysis of completeness and reliability offered in Definitions 1 and 2, in turn grounding a notion of fairness in Definition 3 are easily implementable in BRIO: both completeness and reliability require the definition of a timeframe to check respectively that any given datapoint of interest is matched against a desirable label and that the overall change rate of error for one or more classes of interest does not surpass a certain threshold. Both features rely on the user for the identification of the desirable label for any datapoint and for the admissible distance.

Conclusion

We presented some recommendations for AI systems design, focusing on timeliness as a founding dimension for developing fairer and more inclusive classification tools. Despite the crucial importance of accuracy as shown by significant works such as [4] and [59], the problem of unfairness in AI systems is much broader and more foundational. This can be expressed in terms of data quality: AI systems are limited in that they maximize accuracy, and even if systems become statistically accurate some problems remain unsolved. This is exemplified by the case of binary gender labeling, which leads to inaccurate simplistic classifications [60]. Furthermore, as the work of classification is always a reflection of culture, the completeness of the label set and the (constrained) consistency of labeling have an epistemological value: constructing AIs requires us to understand society, and society reflects an ontology of individuals. For this reason, misgendering is first of all an ontological error [6].

We suggested that timeliness is a crucial dimension for the definition of gender identity. If we are ready to consider gender as a property that shifts over time [61], and which can also be declined in the plural, as an individual may identify under more than one -not mutually exclusive -labels, then a change of paradigm is required. Design limitations such as binarism and staticity invalidate identities which do not fit into this paradigm. They must be addressed if fairer classifications and more inclusive models of gender are to be designed.

Further work in this direction includes: an implementation and empirical validation of the proposed model through the BRIO tool; and the design of an extension to compute the probability of incorrect labels becoming correct over time, i.e. the dual case of what presently addressed.

𝑝( ỹ = 𝑖 | 𝑦 * = 𝑗) = 𝑝(𝑦 * = 𝑗 | ỹ = 𝑖) ⋅ 𝑝( ỹ = 𝑖) 𝑝(𝑦 * = 𝑗) = [𝑝(𝑦 * =𝑗∧ ỹ =𝑖)] 𝑝( ỹ =𝑖) ⋅ 𝑝( ỹ = 𝑖) 𝑝(𝑦 * = 𝑗)

Figure 1 :1Figure 1: Confusion matrix at time 𝑛.

Figure 2 :2Figure 2: Confusion matrix at time 𝑛 + 𝑚.

′ ( ỹ ; 𝑥 𝑖 ; 𝜃) − p ( ỹ ; 𝑥 𝑖 ; 𝜃) = 0.4. Now 𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛 | (𝑦 * = 𝑗) 𝑡 𝑛−𝑚 ] can be written as 𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛+𝑚 | (𝑦 * = 𝑗) 𝑡 𝑛 ]. Thus, at a time 𝑡 𝑛 we have 𝑝 𝑡 𝑛 (𝑦 * = 𝑗) = 1 − 𝑝( ỹ = 𝑖) 𝑡 𝑛 ). At a subsequent time 𝑡 𝑛+𝑚 we have 𝑝 𝑡 𝑛+𝑚 (𝑦 * = 𝑗) = 1 − 𝑝( ỹ = 𝑖) 𝑡 𝑛+𝑚 . Equation 6 can be computed with respect to time as

Table 11Data and Label Bias.

Table 22Symbols used in the present work.

𝑡 𝑛Time index𝒯 ∶= {𝑡 1 , … , 𝑡 𝑛 }Time frame𝑑Generic datapoint𝑖, 𝑗, 𝑙Data Labels𝑦 *Discrete random variable correctly labeledỹDiscrete random variable wrongly labeled[𝑚]The set of unique class labels𝑦 * → ỹA mapping between variables𝑝 𝒯 [( ỹ = 𝑖) 𝑡 𝑛 | (𝑦 * = 𝑗) 𝑡 𝑛−𝑚 ]The probability of label 𝑖 being wrong at time 𝑡 𝑛 , given that label 𝑗 wascorrect at time 𝑡 𝑛−𝑚𝐶 ỹ ,𝑦 * [𝑖, 𝑗, 𝒯 ]Temporal confident joint, where the correct label can change from 𝑖 to 𝑗in time frame 𝒯𝐶 ỹ ,𝑦 * [𝑖, 𝒯 ]Temporal confident joint, where the correctness of the same fixed label𝑖 can change in time frame 𝒯𝜀Change ratep ′ ( ỹ ; 𝑥 𝑖 ; 𝜃)Predicted probability of label ỹ for variable 𝑥 𝑖 and model parameters 𝜃𝐿Label set𝑋AI system𝐿 𝑡

1 ∶= {𝑙 1 , … , 𝑙 𝑛 } Partition of the label set 𝑃 Population of interest 𝑝 An element from 𝑃 𝑑(𝑋 ) 𝒯 A datapoint in system 𝑋 over time frame 𝒯 𝑦 * (𝑑)

Definition 1 (Completeness of a label set). A label set 𝐿 for a classification algorithm in a AI system 𝑋 is considered complete over a time frame 𝒯 ∶ {𝑡 1 , … , 𝑡 𝑛 } denoted as 𝐶𝑜𝑚𝑝𝑙 𝒯 (𝐿(𝑋 )) iff given two partitions 𝐿 𝑡 1 ∶= {𝑙 1 , … , 𝑙 𝑛 } and 𝐿 𝑡 𝑛 ∶= {𝑙 ′ 1 , … , 𝑙 ′ 𝑛 }, where possibly 𝐿 𝑡 1 ∩ 𝐿 𝑡 𝑛 ≠ ∅ for all (𝑝 ∈ 𝑃) 𝒯 s.t. 𝑝 ∈ 𝑑(𝑋 ) 𝒯 there is 𝑙 ∈ 𝐿 𝑡 1 ∪ 𝐿 𝑡 𝑛 s.t. 𝑦 * (𝑑) = 𝑙.

Acknowledgments

This research has been partially funded by the Projects: PRIN2020 BRIO (2020SSKZ7R), PRIN2022 SMARTEST (20223E8Y4X), "Departments of Excellence 2023-2027" of the Department of Philosophy "Piero Martinetti" of the University of Milan, all awarded by the Italian Ministry of University and Research (MUR); and MUSA -Multilayered Urban Sustainability Action, funded by the European Union -NextGenerationEU, under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5: Strenghtening of research structures and creation of R&D "innovation ecosystems", set up of "territorial leaders in R&D".

Epistemic Injustice: Power and the Ethics of Knowing MFricker 2007 Oxford University Press New York Bias in computer systems BFriedman HNissenbaum ACM Trans. Inf. Syst 14 1996 A survey on bias and fairness in machine learning NMehrabi FMorstatter NSaxena KLerman AGalstyan CoRR abs/1908.09635 2019 Gender shades: Intersectional accuracy disparities in commercial gender classification JBuolamwini TGebru Conference on Fairness, Accountability and Transparency, FAT 2018

New York, NY, USA

PMLR 23-24 February 2018. 2018 81 Auto-essentialization: Gender in automated facial analysis as extended colonial project AHanna MPape MKScheuerman 10.1177/20539517211053712 Big Data and Society 8 2021 The misgendering machines: Trans/HCI implications of automatic gender recognition OKeyes 10.1145/3274357 doi: Proc. ACM Hum.-Comput. Interact 2 2018 Gender recognition or gender reductionism? the social implications of embedded gender recognition systems FHamidi MKScheuerman SMBranham 10.1145/3173574.3173582 doi:10.1145/3173574.3173582 Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI '18 the 2018 CHI Conference on Human Factors in Computing Systems, CHI '18

New York, NY, USA

Association for Computing Machinery 2018 How computers see gender: An evaluation of gender classification in commercial facial analysis services MKScheuerman JPaul JBrubaker 10.1145/3359246 Proceedings of the ACM on Human-Computer Interaction 3 2019 AZJacobs HMWallach CoRR abs/1912.05511 Measurement and fairness 2019 Machine learning algorithms for gender prediction ARamon GOlaoye ALuz 2024 Probabilistic typed natural deduction for trustworthy computations FAD'asaro GPrimiero Proceedings of the 22nd International Workshop on Trust in Agent Societies (TRUST 2021) Co-located with the 20th International Conferences on Autonomous Agents and Multiagent Systems (AAMAS 2021) CEUR Workshop Proceedings DWang RFalcone JZhang the 22nd International Workshop on Trust in Agent Societies (TRUST 2021) Co-located with the 20th International Conferences on Autonomous Agents and Multiagent Systems (AAMAS 2021)

London, UK

May 3-7, 2021. 2021 3022 Proof-checking bias in labeling methods GPrimiero FAD'asaro Proceedings of 1st Workshop on Bias, Ethical AI, Explainability and the Role of Logic and Logic Programming (BEWARE 2022) co-located with the 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022) CEUR Workshop Proceedings GBoella FAD'asaro ADyoub GPrimiero 1st Workshop on Bias, Ethical AI, Explainability and the Role of Logic and Logic Programming (BEWARE 2022) co-located with the 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022)

Udine, Italy

December 2, 2022. 2022 3319 Modelling accuracy and trustworthiness of explaining agents ATermine GPrimiero FAD'asaro 10.1007/978-3-030-88708-7_19 doi: Logic, Rationality, and Interaction -8th International Workshop, LORI 2021 Lecture Notes in Computer Science SGhosh TIcard

Xi'ian, China

Springer October 16-18, 2021. 2021 13039 Proceedings Checking trustworthiness of probabilistic computations in a typed natural deduction system FAD'asaro FGenco GPrimiero arXiv:2206.12934 2024 A possible worlds semantics for trustworthy non-deterministic computations EKubyshkina GPrimiero 10.1016/j.ijar.2024.109212 International Journal of Approximate Reasoning 109212 2024 Brioxalkemy: a bias detecting tool GCoraglia FAD'asaro FAGenco DGiannuzzi DPosillipo GPrimiero CQuaggio Proceedings of the 2nd Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming co-located with the 22nd International Conference of the Italian Association for Artificial Intelligence (AI*IA 2023) CEUR Workshop Proceedings GBoella FAD'asaro ADyoub LGorrieri FALisi CManganini GPrimiero the 2nd Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming co-located with the 22nd International Conference of the Italian Association for Artificial Intelligence (AI*IA 2023)

Rome, Italy

November 6, 2023. 2023 3615 Evaluating ai fairness in credit scoring with the brio tool GCoraglia FAGenco PPiantadosi EBagli PGiuffrida DPosillipo GPrimiero arXiv:2406.03292 2024 Data quality and fairness: Rivals or friends? FAzzalini CCappiello CCriscuolo SCuzzucoli ADangelo CSancricca LTanca Proceedings of the 31st Symposium of Advanced Database Systems CEUR Workshop Proceedings SM GSilvello LTanca the 31st Symposium of Advanced Database Systems

Galzingano Terme, Italy

0001. July 2nd to 5th, 2023. 2023 3478 Non-empirical problems in fair machine learning TScantamburlo 10.1007/s10676-021-09608-9 Ethics Inf. Technol 23 2021 Fairness through awareness CDwork MHardt TPitassi OReingold RSZemel CoRR abs/1104.3913 2011 The case for process fairness in learning: Feature selection for fair decision making NGrgic-Hlaca MBZafar KPGummadi AWeller 2016 RK EBellamy KDey MHind SCHoffman SHoude KKannan PLohia SMartino JMehta AMojsilovic SNagar KNRamamurthy JRichards DSaha PSattigeri MSingh KRVarshney YZhang Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias 2018 Bias mitigation with aif360: A comparative study THAasheim KHufthammer SÅnneland HBrynjulfsen MSlavkovik Proceedings of the NIK-2020 Conference the NIK-2020 Conference 2020 MJKusner JRLoftus CRussell RSilva arXiv:1703.06856 Counterfactual fairness 2018 Equality of opportunity in supervised learning MHardt EPrice NSrebro CoRR abs/1610.02413 2016 Fair AI: Challenges and opportunities SFeuerriegel MDolata GSchwabe 10.1007/s12599-020-00650-3 Business Information Systems Engineering 62 2020 Data pre-processing techniques for classification without discrimination FKamiran TCalders 10.1007/s10115-011-0463-8 Knowledge and Information Systems 33 2011 -db: Functional dependencies to discover data bias and enhance data equity FAzzalini CCriscuolo LTanca E-Fair 10.1145/3552433 J. Data and Information Quality 14 2022 HWeerts MDudík REdgar AJalali RLutz MMadaio arXiv:2303.16626 Fairlearn: Assessing and improving fairness of ai systems 2023 Optimized data pre-processing for discrimination prevention FPCalmon DWei KNRamamurthy KRVarshney arXiv:1704.03354 2017 MFeldman SFriedler JMoeller CScheidegger SVenkatasubramanian arXiv:1412.3756 Certifying and removing disparate impact 2015 Learning fair representations RZemel YWu KSwersky TPitassi CDwork Proceedings of the 30th International Conference on Machine Learning SDasgupta DMcallester the 30th International Conference on Machine Learning

PMLR, Atlanta, Georgia, USA

2013 28 Proceedings of Machine Learning Research Fairness-aware classifier with prejudice remover regularizer TKamishima SAkaho HAsoh JSakuma 10.1007/978-3-642-33486-3_3 2012 Moving beyond algorithmic bias is a data problem SHooker Patterns 2 2021 Fairlabel: Correcting bias in labels SHSengamedu HPham arXiv:2311.00638 2023 Identifying and correcting label bias in machine learning HJiang ONachum 2019 Pervasive label errors in test sets destabilize machine learning benchmarks CGNorthcutt AAthalye JMueller 2021 Preprint Social data: Biases, methodological pitfalls, and ethical boundaries AOlteanu CCastillo FDiaz EKiciman Frontiers in Big Data 2 2019 A survey on bias in visual datasets SFabbrizzi SPapadopoulos ENtoutsi IKompatsiaris CoRR abs/2107.07919 2021 A framework for understanding sources of harm throughout the machine learning life cycle HSuresh JGuttag Equity and Access in Algorithms, Mechanisms, and Optimization 2021 Certnexus Promote the ethical use of data-driven technologies 2021 Centre for Evidence-Based, Catalogue of bias UO O 2022 Conscientious classification: A data scientist's guide to discrimination-aware classification BD'alessandro CO'neil TLagatta Big Data 5 2017 PSaleiro BKuester LHinkson JLondon AStevens AAnisfeld KTRodolfa RGhani arXiv:1811.05577 Aequitas: A bias and fairness audit toolkit 2018 arXiv preprint Confident learning: Estimating uncertainty in dataset labels CGNorthcutt LJiang ILChuang Journal of Artificial Intelligence Research (JAIR) 70 2021 Learning from noisy examples DAngluin PDLaird Mach. Learn 2 1987 CBatini MScannapieco Data Quality: Concepts, Methodologies and Techniques Springer 2006 Beyond accuracy: What data quality means to data consumers RYWang DMStrong J. Manag. Inf. Syst 12 1996 Towards a contextual approach to data quality SCanali 10.3390/data5040090 Data 5 90 2020 Methodologies for data quality assessment and improvement CBatini CCappiello CFrancalanci AMaurino ACM computing surveys (CSUR) 41 2009 Data quality under a computer science perspective MScannapieco TCatarci Journal of The ACM -JACM 2 2002 Data quality assessment LPipino YWLee RYWang Commun. ACM 45 2002 ARula Time-related quality dimensions in linked data 2014 Quality assessment for linked data: A survey AZaveri ARula AMaurino RPietrobon JLehmann SAuer Semantic Web 7 2016 Patching gender: Non-binary utopias in hci KSpiel OKeyes PBarlas 10.1145/3290607.3310425 2019 Practical Fairness ANielsen 2020 O'Reilly Media, Inc From data quality to big data quality CBatini ARula MScannapieco GViscusi 10.4018/JDM.2015010103 Journal of Database Management 26 2015 ABlack PVan Nederpelt Dimensions of Data Quality (DDQ) DAMA NL Foundation 2020 Machine bias JAngwin JLarson SMattu LKirchner 2016 ProPublica An epistemic lens on algorithmic fairness EEdenberg AWood Eaamo '23: Proceedings of the 3Rd Acm Conference on Equity and Access in Algorithms, Mechanisms, and Optimization 2023 Data for queer lives: How LGBTQ gender and sexuality identities challenge norms of demographics BRuberg SRuelos 10.1177/2053951720933286 Big Data and Society 7 2020