Assessing Demographic Bias Transfer from Dataset to Model: A Case Study in Facial Expression Recognition Iris Dominguez-Catena* , Daniel Paternain and Mikel Galar Institute of Smart Cities (ISC), Department of Statistics, Computer Science and Mathematics Public University of Navarre (UPNA) Arrosadia Campus, 31006, Pamplona, Spain Abstract The increasing amount of applications of Artificial Intelligence (AI) has led researchers to study the social impact of these technologies and evaluate their fairness. Unfortunately, current fairness metrics are hard to apply in multi-class multi- demographic classification problems, such as Facial Expression Recognition (FER). We propose a new set of metrics to approach these problems. Of the three metrics proposed, two focus on the representational and stereotypical bias of the dataset, and the third one on the residual bias of the trained model. These metrics combined can potentially be used to study and compare diverse bias mitigation methods. We demonstrate the usefulness of the metrics by applying them to a FER problem based on the popular Affectnet dataset. Like many other datasets for FER, Affectnet is a large Internet-sourced dataset with 291,651 labeled images. Obtaining images from the Internet raises some concerns over the fairness of any system trained on this data and its ability to generalize properly to diverse populations. We first analyze the dataset and some variants, finding substantial racial bias and gender stereotypes. We then extract several subsets with different demographic properties and train a model on each one, observing the amount of residual bias in the different setups. We also provide a second analysis on a different dataset, FER+. [5], which is already a standard for model pretraining, means that any transference of bias from the dataset to the trained models could impact very large populations in 1. Introduction unpredictable ways. This work explores a new metric-based methodology When algorithms and automated systems interact with for the analysis of bias in machine learning problems, fo- users, they can often cause harm in many unintentional cusing on the measure of bias transfer between the dataset ways. This effect is multiplied when the system is a ma- and the trained model. Despite the definition of multiple chine learning system trained to imitate human behavior, fairness metrics [6], to the best of our knowledge there is which is inherently conditioned by prejudice and cogni- no bias metric supporting multi-class classification prob- tive biases. In machine learning, the biases gathered in the lems studied for multiple potentially protected groups, training information can leak to the trained models, which capable of isolating dataset bias from model bias. In par- are otherwise expected to be fair. When these systems are ticular, our methodology is oriented to the bias transfer deployed to the real world they have been shown to exhibit from the dataset to the model, where most mitigation sys- gender, racial and other demographic biases [1, 2, 3]. tems can be implemented (given a fixed dataset with its The current state-of-the-art systems employ very large own inherent bias). datasets, with the most renowned example being Imagenet, Three metrics are proposed in this work. The first two with over 14 million images. Further analysis on this metrics measure representational bias in the dataset. One dataset has shown critical biases and problematic data of them is dedicated to quantifying the representational [2, 4]. The amount of models trained on this dataset imbalance, where some demographic groups are over or under-represented in the dataset. The other one is a novel The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety usage of the Normalized Mutual Information (NMI) and (AISafety 2022), July 24โ€“25, 2022, Vienna, Austria * Corresponding author. Normalized Pointwise Mutual Information (NPMI) met- $ iris.dominguez@unavarra.es (I. Dominguez-Catena); rics [7]. We propose using these metrics for stereotype daniel.paternain@unavarra.es (D. Paternain); measurement, a type of bias where demographic group mikel.galar@unavarra.es (M. Galar) representations differ among classes. These two metrics  0000-0002-6099-8701 (I. Dominguez-Catena); serve as a baseline for measuring the bias present in the 0000-0002-5845-887X (D. Paternain); 0000-0003-2865-6549 (M. Galar) input data. Over that baseline, we can then employ a third ยฉ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License proposed metric to measure the bias present in the trained Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 model. This metric measures the variation of recall per Copyright ยฉ 2022 for this paper by its authors. Use permitted under class among multiple simultaneous demographic groups, Creative Commons License Attribution 4.0 International (CC BY giving a single output value quantifying the amount of 4.0). bias with respect to the different demographic groups. source dataset and the trained model. The measurement The three metrics combined enable the study of the bias of bias in the source dataset aggregates historical and rep- transfer from the dataset to the trained model. resentational biases, and is common to any model and In particular, we apply these metrics to analyze the deployment that employs the same dataset. The measure- bias transfer in a Facial Expression Recognition (FER) ment of model bias includes learning and aggregation problem. The objective of FER systems is to identify the biases. Jointly measuring both biases will give us an esti- facial expression of people in either video or images, in mation of the amount of bias that leaks from the dataset an attempt to detect the underlying emotion. The nature to the model. This can help guide our efforts to the most of the data (human faces) in these problems make them critical parts of the system, easing the study of mitigation prone to diverse biases and misrepresentations. Human strategies. The remaining forms of bias are not directly biases in this context have already been studied [8]. applicable to our analysis. Although the application of these metrics requires de- mographic information of the subjects in the dataset, the 2.1.1. Bias metrics usage of existing demographic models makes recovering some of this information possible for unlabeled datasets. There are multiple fairness and bias metrics defined In the context of FER problems we employ a pretrained through the literature [6]. They are commonly defined model, Fairface [9] to obtain some demographic descrip- for binary classification problems and two populations, tors, namely apparent race and gender, from the face im- a general population and a protected group, with one or ages. Although these demographic predictions are only several protected attributes indicating the membership to an approximation of the real attributes, in the absence of these populations. more accurate data they can be used to perform a general However, in many applications there are multiple demo- analysis. graphic groups that could require protection at the same This work is aimed at helping reduce the racial, ethnic time, and more than two target classes, with none of them and gender inequalities that can arise in the development being clearly advantageous. To the best of our knowledge, and deployment of AI systems. Specifically, the proposed there is no bias metric covering this casuistry that is also study of demographic bias transfer can be used to both able to isolate true model bias from validation dataset detect demographic bias in machine learning systems and imbalance. Therefore, we focus on multi-class systems, assess the impact of mitigation methods, guiding the ef- studying bias across multiple demographic groups. Our forts in the implementation of fairer systems in general. bias definition considers any differential treatment suf- Code to replicate the reported results is available at fered by any of the demographic groups in any of the GitHub1 . An additional Appendix to this work with the problem classes. To design our metrics, we take the fol- results for a second dataset, FER+ [10] is also available at lowing as reference. the same repository. Disparate impact [12] asserts that the proportion of predictions of the positive class is similar across groups: 2. Related Work ๐‘ƒ (๐‘ฆห† = 1|๐‘  ฬธ= 1) โ‰ฅ1โˆ’๐œ€, (1) ๐‘ƒ (๐‘ฆห† = 1|๐‘  = 1) 2.1. Algorithmic Fairness where ๐‘  is the protected attribute (1 for the privileged The concept of algorithmic fairness is usually built around group and ฬธ= 1 otherwise) and ๐‘ฆห† the class prediction (1 the absence of bias or harm. In particular, several inde- for the positive class and ฬธ= 1 otherwise). pendent bias taxonomies have been proposed, focused on Disparate Impact is arguably the most common fairness different definitions and aspects of bias. Taxonomies like metric. However, it requires a defined positive outcome that of [11] consider up to 7 sources of bias in a machine and privileged and protected groups, making it unfit for learning pipeline, split in two subgroups: our application. Overall accuracy equality [13] requires a similar ac- โ€ข Biases in the data generation, comprising histor- curacy across groups: ical bias, representational bias and measurement bias. |๐‘ƒ (๐‘ฆ = ๐‘ฆห†|๐‘  = 1) โˆ’ ๐‘ƒ (๐‘ฆ = ๐‘ฆห†|๐‘  ฬธ= 1)| โ‰ค ๐œ€ , (2) โ€ข Biases in the building and implementation of the system, comprising learning bias, aggregation where ๐‘  is the protected attribute (1 for the privileged bias, evaluation bias and deployment bias. group and ฬธ= 1 otherwise), ๐‘ฆห† the predicted class, and ๐‘ฆ the real class. This work focuses on the measurement of bias in both As it focuses on the accuracy, the Overall accuracy subgroups independently, and more specifically, in the equality metric is applicable in both binary and multi- 1 https://github.com/irisdominguez/Dataset-Bias-Metrics class problems. It also treats all target classes equally, without requiring one of them to be defined as positive 3. Proposal or advantageous. Unfortunately, it still requires defined privileged and protected groups, making it suitable only 3.1. Representational Bias Metric for the analysis of individual demographic groups. Mutual Information [14] measures the statistical de- The most basic level of demographic analysis that can pendence between two attributes: be performed in a dataset is that of representational bias, where there is unequal representation of different demo- โˆ‘๏ธ โˆ‘๏ธ ๐‘ƒ (๐‘ฆห†, ๐‘ ) graphic groups in the overall dataset. A clear predomi- ๐‘ƒ (๐‘ฆห†, ๐‘ ) log โ‰ค๐œ€, (3) nance of a demographic group in the dataset can hint at a ๐‘ƒ (๐‘ฆ ห† )๐‘ƒ (๐‘ ) ^โˆˆ๐‘Œ ๐‘ฆ ^ ๐‘ โˆˆ๐‘† potential bias in favor of that group, generating a differ- entiated and potentially harmful behavior of the resulting where ๐‘  denotes the protected attribute (from a set ๐‘†) and model. Previous bias metrics (presented in Section 2.1.1) ๐‘ฆห† denotes the prediction (from a set ๐‘Œห† ). only measure the bias for the final model, so they can- Mutual Information is one of the few fairness metrics not detect this specific bias. As a way to measure the that can be directly applied to multi-class classification representational bias of the dataset, we propose using a problems, even when considering several potentially pro- metric based on the standard deviation of the normalized tected groups. Despite this, the metric only measures the demographic distribution. This Normalized Standard dependence between the protected attributes and either Deviation (NSD) adjusts the standard deviation of a nor- the predicted or real class. In problems where the valida- malized vector: tion partition is not balanced (the real class and protected attributes are dependent), the Mutual Information mea- โˆš๏ธ‚ โˆ‘๏ธ€๐‘› ๐‘› ๐‘–=1 (๐‘ฅ๐‘– โˆ’ ๐‘ฅยฏ )2 sured over the trained model prediction becomes unable NSD(๐‘ฅ) = โˆš , (4) ๐‘›โˆ’1 ๐‘› to disentangle that imbalance from the potential model bias. where ๐‘› is the number of elements of the vector ๐‘ฅ and ๐‘ฅ Nonetheless, for our analysis we propose the employ- stands for the arithmetic mean. ment of two variants of the Mutual Information metric to The NSD calculated over a normalized demographic measure the bias not in the final model, but in the source distribution is bounded in the interval [0, 1], where 0 is no dataset, where they only consider the real class and the bias (uniform distribution) and 1 is total bias (the entire protected attributes. These variants are the Normalized population belongs to a single group). Mutual Information (NMI) and the Normalized Pointwise Mutual Information (NPMI) proposed by [7] in the con- text of collocation extraction. They are both bounded and 3.2. Stereotypical Bias Metric easier to interpret than the classical Mutual Information, A second level of analysis of the dataset not covered by and in the case of the NPMI, it can also detect specific previous metrics is that of the presence of stereotypes, stereotypes on top of the general dataset bias. The mathe- understood as a variation in the demographic profile of matical definitions are given in Section 3.2. each target class in the dataset. In the context of FER, this can result in a different prior for an emotion label 2.2. Facial Expression Recognition depending on the perceived demographic group of the sample. The problem of automatic FER is commonly used as a For most FER datasets, and in general for most multi- proxy to the more general emotion recognition. Although class multi-demographic datasets, the analysis and quan- many works raise questions about the universality of facial tification of stereotypical bias is complex due to the dou- expressions in conveying emotions across cultures [8, 15], ble class and demographic imbalance usually accepted in and despite developments in other emotion measurement these datasets. To decouple this secondary bias from the modalities [16], FER is still one of the most widely used main representational bias, we employ the NPMI metric methods. The applications of these systems are multiple, proposed by [7]. ranging from robotics [17] to assistive technology [18]. The Normalized Pointwise Mutual Information Most works deal with either a continuous emotion cod- (NPMI) measures the statistical dependence between two ification, such as the Pleasure-Arousal-Dominance model attributes for a specific pair of values: [19] or a discrete codification, such as the six/seven ba- sic emotions proposed by [20]. This work focuses on ln ๐‘ƒ๐‘ƒ(๐‘ )๐‘ƒ (๐‘ ,๐‘ฆ) (๐‘ฆ) the second approach, the most used in modern machine NPMI(๐‘ , ๐‘ฆ) = โˆ’ , (5) ln ๐‘ƒ (๐‘ , ๐‘ฆ) learning. where ๐‘  denotes the protected attribute and ๐‘ฆ denotes the class. NPMI values lay in the range [โˆ’1, 1], with 1 being total correlation (overrepresentation), 0 being no correlation, and โˆ’1 being inverse correlation (underrepre- classes ๐ถ, giving us an Overall Disparity OD for the sentation). model, defined as: For an aggregated value representing the summary of the NPMI biases, we employ the NMI metric, also pro- 1 โˆ‘๏ธ OD = ID(๐‘) . (9) posed by [7]. |๐ถ| ๐‘โˆˆ๐ถ The Normalized Mutual Information (NMI) mea- sures the statistical dependence between two attributes: The OD value has a lower bound of 0 (no bias) and an upper bound of 1 (maximum bias). The same bounds โˆ‘๏ธ€ โˆ‘๏ธ€ ๐‘ƒ (๐‘ ,๐‘ฆ) ๐‘ฆโˆˆ๐‘Œ ๐‘ โˆˆ๐‘† ๐‘ƒ (๐‘ , ๐‘ฆ) ln ๐‘ƒ (๐‘ )๐‘ƒ (๐‘ฆ) apply to the individual ID scores. NMI(๐‘†, ๐‘Œ ) = โˆ’ โˆ‘๏ธ€ โˆ‘๏ธ€ , ๐‘ฆโˆˆ๐‘Œ ๐‘ โˆˆ๐‘† ๐‘ƒ (๐‘ , ๐‘ฆ) ln ๐‘ƒ (๐‘ , ๐‘ฆ) (6) 4. Case Study where ๐‘† denotes all the demographic groups studied and ๐‘Œ the set of classes of the problem. The NMI value lies 4.1. Dataset in the range [0, 1], with 0 being no bias and 1 being total bias. Affectnet [21] is a large FER dataset, composed of 420,299 images of facial expressions. 291, 651 of these 3.3. Model Bias Metric images are classified in basic emotions, namely neutral, happy, sad, surprise, fear, disgust, angry and contempt. Based on the metrics presented in Section 2.1.1, we re- The dataset is divided into two partitions, one for training quire a new metric to measure the model bias in prob- and one for validation, with 287, 651 and 4, 000 images lems like FER, where we operate over multiple demo- respectively. graphic groups and multiple classes, with an unbalanced The dataset was created from Internet image searches, test dataset. For the calculation of the model bias metric, based on queries related to both the emotional labels con- we expect the model to perform with a similar recall for sidered and gender, age, and ethnicity descriptors. The each demographic group and for each target class to be search query strings were also translated to 6 different considered fair. The Recall R(๐‘ฆ, ๐‘ ) for a class ๐‘ฆ and a languages. demographic group ๐‘  is defined as The final images have 425 by 425 pixels of resolution. An additional Appendix to this work with the results R(๐‘ฆ, ๐‘ ) = ๐‘ƒ (๐‘ฆห† = ๐‘ฆ|๐‘ฆ, ๐‘ ) . (7) for a second dataset, FER+ [10] is available2 . It is important to note that each of the classes in the problem can have different inherent difficulties. Therefore, 4.2. Determination of Demographic we first calculate the intraclass disparity for each class by Labels aggregating the recalls of that class for all demographic groups. Later, we aggregate those intraclass disparities Most large FER datasets [22, 21] are gathered from regu- into a final dataset metric. The Intraclass Disparity (ID) lar Internet image searches, with relatively low data cura- for each class ๐‘ฆ, aggregated for ๐‘† demographic groups, is tion. Outside FER problems, many datasets have already defined as: undergone heavy bias analysis [4], but the lack of meta- โŽ› โŽž information and demographic labels on subjects for in 1 โˆ‘๏ธ โŽ R(๐‘ฆ, ๐‘ ) โŽ  the wild (ITW) FER datasets has hindered the bias anal- ID(๐‘ฆ) = 1โˆ’ , (8)ysis in this context. Affectnet is not an exception and, ๐‘› โˆ’ 1 ๐‘ โˆˆ๐‘† max โ€ฒ R(๐‘ฆ, ๐‘ โ€ฒ ) ๐‘  โˆˆ๐‘† despite the diversity considerations during its generation, โ€ฒ demographics labels are not available. with the convention that ID(๐‘ฆ) = 0 if max R(๐‘ฆ, ๐‘  ) = 0. ๐‘ โ€ฒ โˆˆ๐‘† In recent years, the development of new datasets for This metric uses the maximum recall R(๐‘ฆ, ๐‘ ) of any demographic annotation, such as Fairface [9], has enabled demographic group (๐‘ ) for the class (๐‘ฆ) as the baseline the demographic relabeling of existing datasets. It is im- to obtain a relative value between 0 (same performance portant to note that this kind of annotation is highly sub- as the maximum recall group) and 1 (recall 0 relative to jective and imperfect, and any demographic label obtained the maximum recall group) for each demographic group. constitutes only a proxy measure for the real demographic These values are then aggregated, obtaining the final met- characteristics of the subjects, which is already a sub- ric. This measure considers all the groups, maximizing jective and complex concept. For example, the gender the ID metric to 1 when all groups except the one with classification of Fairface is binary (Male, Female), which the highest recall have an accuracy of 0 (situation of max- already constitutes a bias against nonbinary people [1]. imum privilege or bias). The race classification only uses the seven most common Finally, we can study the class disparities by themselves or aggregate them with a simple mean over the set of 2 https://github.com/irisdominguez/Dataset-Bias-Metrics descriptors (White, Latino/Hispanic, East Asian, Black, โ€ข Two artificially biased gender subsets, that con- Middle Eastern, Indian, and Southeast Asian). Further- tain examples of only one gender. more, both categories are treated as single-label classifica- tions. Despite these issues, the Fairface model gives us an The two gender biased and the gender balanced subsets approximation of the demographic profile of the dataset, are generated with exactly the same number of examples enough for a general bias analysis in the absence of more for each class to enable the comparison between the three accurate data. of them. 4.3. Experimental Setup 4.4.2. Model bias We employ a simple VGG11 [23] network with no pre- All trained models are evaluated on the whole Affect- training as the base test model. This is a classical convo- net validation partition, and the OD metric described in lutional architecture that is often used as a baseline for Section 3.3 is used as a measure of bias. machine learning applications. In addition to the subsets proposed in the previous ex- The experiments are developed in PyTorch 1.8.0 and periment, a series of stratified subsets is generated to Fastai 2.3.1. The hardware used is a machine equipped evaluate the influence of the size of the dataset on the with a GeForce RTX 2080 Ti GPU, 128 GB of RAM, residual bias of the model. Each subset will contain a an Intelยฎ Xeonยฎ Silver 4214 CPU, and running CentOS fixed percentage of the original train partition, maintain- Linux 7.7. ing the same demographic distribution. Hence, they will All the models are trained under the same conditions preserve the dataset biases of Affectnet. and hyperparameters, namely, a maximum learning rate In summary, we want to study the behavior of the model of 1๐‘’โˆ’2 with a 1cycle policy (as described in [24] and bias metric when increasing the number of training exam- implemented in Fastai) for 100 iterations. This parameter ples and when training on balanced subsets. If balancing was decided using the lr_finder tool in Fastai. The batch the dataset is an efficient mitigation strategy, we expect a size is set to 256, the maximum allowed by the hardware lower metric with little to no impact on general accuracy, setup. For each dataset, we train the model 10 times and for the same training data size. average the results over them. We have also applied the basic data augmentation pro- vided by Fastai through the aug_transforms method, in- 5. Results And Discussion cluding left-right flipping, warping, rotation, zoom, bright- 5.1. Dataset bias ness, and contrast alterations. Figure 1 shows the apparent race distribution of the dataset 4.4. Experiments in Affectnet. We can observe a strong imbalance in favor of the white race, which comprises 64.4% of the train- 4.4.1. Dataset bias ing data. The apparent gender distribution (not shown) is much more balanced, with 49.7% of the training data To analyze the dataset, first we process it using the Fair- classified as male and the rest as female. In the case of face model to obtain a demographic estimation for each gender, imbalances arise in the analysis per label, summa- image. We then proceed to calculate the demographic rep- rized in Figure 2, reaching an imbalance 72% โˆ’ 28% in resentation profile of the dataset, and compute the NSD the case of the angry class. bias metric presented in Section 3.1. After that, we can These intuitive indicators of bias are also reflected in also use the NPMI metric to highlight any stereotypical the NSD and NMI metrics results presented in Table 1, bias inherent in the distribution of labels for each demo- where both the original dataset and some variations are an- graphic group, as explained in Section 3.2. alyzed. Representational bias is measured with the NSD The demographic information added to the dataset metric and stereotypical bias with the NMI. Note that through this relabeling process also enables the creation the NSD and NMI metrics do not share the same scales of derived datasets that can be used to simulate different and are not comparable to each other. The number in bias situations. In particular, we generate: parentheses denotes the number of demographic groups โ€ข Two balanced subsets for the racial and gender considered, 2 for gender and 7 for race. These results demographics. These datasets have the same rep- show a zero representational and stereotypical bias for the resentation of each demographic group consid- artificially balanced datasets in their respective categories, ered for each of the target labels. This balancing as expected. Note that the gender biased datasets are ex- removes both representational and stereotypical cluded from the gender bias calculations, as they only biases. include examples of one gender. For the original dataset, we can observe a relatively low representational bias in the gender category (NSD = 0.0057) compared to the race overrepresentation of people recognized as male), and the category (NSD = 0.5902). On the contrary, the stereotyp- opposite for the happy class, with an overrepresentation ical bias is higher in the gender category (NMI = 0.0089) of people recognized as female and underrepresentation compared to the race category (NMI = 0.0021). This of people recognized as male. The results for the gender agrees with the imbalance perceived in Figure 2. category are consistent with the angry-men-happy-women social bias already known in the literature [25]. 0.6 train Data proportion angry -0.16 0.13 -0.00 0.01 0.07 0.00 -0.13 0.04 -0.03 val 0.4 contempt -0.03 0.03 0.01 -0.01 0.03 -0.07 -0.03 -0.06 -0.06 0.2 0.2 disgust -0.02 0.02 -0.00 -0.01 0.05 -0.00 -0.03 -0.02 -0.01 0.1 fear -0.00 0.00 0.02 -0.06 0.00 -0.04 -0.03 -0.00 -0.05 0.0 0.0 W LH ME B EA I SA happy 0.13 -0.12 0.03 0.01 -0.08 -0.02 0.01 -0.02 0.00 neutral -0.05 0.05 -0.04 0.00 0.06 0.04 0.02 0.03 -0.00 โˆ’0.1 Figure 1: Apparent race distribution of Affectnet (W: sad -0.04 0.04 -0.02 -0.01 0.03 0.01 0.04 0.02 0.06 โˆ’0.2 White, LH: Latino / Hispanic, ME: Middle Eastern, B: surprise -0.03 0.03 0.02 -0.04 -0.00 -0.02 0.02 -0.05 -0.04 Black, EA: East Asian, I: Indian, SA: Southeast Asian). F M W LH ME B EA I SA Figure 3: NPMI analysis of Affectnet. (F: Female, M: Male, W: White, LH: Latino / Hispanic, ME: Middle East- Train dataset proportion ern, B: Black, EA: East Asian, I: Indian, SA: Southeast 0.6 Asian) 0.4 0.2 Female Male 5.2. Model bias 0.0 The results regarding the trained model bias metric OD fear sad angry happy disgust surprise neutral contempt and accuracy are shown in Table 2, and graphically in Figure 4. Regarding racial bias, across the stratified subsets Figure 2: Apparent gender distribution of Affectnet for we can observe consistently high model bias values, each label. clearly decreasing from the highest bias in the small- est dataset (0.422 ยฑ 0.043) to the lowest in the largest dataset (0.268 ยฑ 0.023). The race balanced dataset, when Representational Stereotypical compared to an imbalanced one of similar size (Strati- bias (NSD) bias (NMI) fied / 0.13), does not appear to improve the bias metric Dataset Race (7) Gender (2) Race (7) Gender (2) (0.297 ยฑ 0.016 vs. 0.284 ยฑ 0.017), while decreasing ac- Original 0.5902 0.0057 0.0021 0.0089 curacy (45.7ยฑ0.3 vs. 48.4ยฑ0.5). In this case, increasing Balanced Race 0.0000 0.0159 0.0000 0.0091 Gender 0.5929 0.0000 0.0017 0.0000 the dataset size (Stratified / 0.22 and higher) improves the Gender biased M 0.5702 โˆ’ 0.0020 โˆ’ accuracy without impacting the racial bias metric (stable F 0.6129 โˆ’ 0.0020 โˆ’ around 0.29 and decreasing for larger sizes). Note that Table 1 the accuracy is calculated over the whole test partition of Representational (NSD) and stereotypical (NMI) bias Affectnet, which is not balanced in terms of labels and the metrics for the original dataset and the considered sub- demographic groups studied. sets. In the gender category, we observe comparatively lower bias values overall, from 0.264 ยฑ 0.053 to 0.133 ยฑ 0.015, The stereotypical bias detected through the NMI can be consistent with the lower gender bias of the original further analyzed with NPMI. Figure 3 shows the NPMI dataset (NSD = 0.0057, NMI = 0.0089). Balancing matrices for both the apparent gender and race analysis. In the gender does not significantly improve the accuracy re- the race category, the most prominent value indicates an sults for a similar sized dataset (Stratified / 0.36, accuracy underrepresentation (โˆ’0.13) of the East Asian group in 51.8 ยฑ 0.4 vs. 51.1 ยฑ 0.6), but in this case the gender bias the angry class. In the gender category two stereotypical value improves significantly (0.091 ยฑ 0.014, lower than biases stand out: for the angry class, an underrepresenta- all other models). Additionally, although the artificially tion of people recognized as female (and a corresponding gender biased datasets have a similar accuracy to the bal- anced one (50.7 ยฑ 0.5 and 50.2 ยฑ 0.4 vs. 51.8 ยฑ 0.4), successful mitigation approach in this context (because their bias metric is substantially higher (0.185 ยฑ 0.017 of quality differences between the images associated with and 0.242 ยฑ 0.018 vs. 0.091 ยฑ 0.014). each race, for example). Regarding the gender bias, the metrics reveal a com- Bias paratively lower representational bias, but a much more Train data Size Accuracy Race OD Gender OD pronounced stereotypical bias. In this case, balancing the Original 1% 2,839 33.5 ยฑ 0.9 0.422 ยฑ 0.043 0.264 ยฑ 0.053 2% 5,678 39.2 ยฑ 0.9 0.362 ยฑ 0.027 0.214 ยฑ 0.013 dataset seems to substantially improve the trained model 3% 8,517 41.5 ยฑ 0.8 0.347 ยฑ 0.019 0.186 ยฑ 0.033 5% 14,195 44.4 ยฑ 0.4 0.315 ยฑ 0.033 0.192 ยฑ 0.022 bias, mitigating the bias transfer. This suggests that the 8% 22,712 46.4 ยฑ 0.5 0.288 ยฑ 0.017 0.174 ยฑ 0.029 stereotypical bias detected in the dataset has a large im- 13% 36,907 48.4 ยฑ 0.5 0.284 ยฑ 0.028 0.144 ยฑ 0.026 22% 62,458 49.6 ยฑ 0.6 0.292 ยฑ 0.024 0.166 ยฑ 0.018 pact on the trained model, but the balancing of the dataset 36% 102,204 51.1 ยฑ 0.6 0.289 ยฑ 0.014 0.157 ยฑ 0.025 60% 170,340 53.6 ยฑ 0.5 0.279 ยฑ 0.030 0.149 ยฑ 0.018 corrects it properly. 100% 283,901 55.8 ยฑ 0.2 0.268 ยฑ 0.023 0.133 ยฑ 0.015 Additionally, we observe a strong tendency to reduce Balanced Race 32,452 45.7 ยฑ 0.3 0.297 ยฑ 0.016 0.177 ยฑ 0.017 the bias scores as the size of the training dataset increases, Gender 117,790 51.8 ยฑ 0.4 0.273 ยฑ 0.026 0.091 ยฑ 0.014 Gender biased M 117,790 50.7 ยฑ 0.5 0.277 ยฑ 0.015 0.185 ยฑ 0.017 even when the datasets have the same representational and F 117,790 50.2 ยฑ 0.4 0.315 ยฑ 0.022 0.242 ยฑ 0.018 stereotypical biases. Table 2 Although the metrics have unveiled both gender and Bias metric summary for the model when trained on racial bias in the source dataset, these bias transference re- dataset variations. sults suggest that dataset gender bias has a greater impact in the final model. Thus, dataset gender bias seems more susceptible to mitigation measures in the early stages of the AI life-cycle, whereas racial bias may require different mitigation measures in later stages. Further studies would 0.5 Accuracy be required to evaluate the impact of different mitigation 0.4 techniques in this case, but are out of the scope of this paper. Stratified subsets 6. Conclusion Model racial bias Race balanced Gender balanced 0.4 Gender biased (M) The metrics presented have been shown to be useful, re- Gender biased (F) 0.3 flecting some of the biases present in both real and manip- ulated datasets through easily interpretable values. The analysis of these metrics allows the study of the bias trans- Model gender bias 0.3 fer from dataset to trained model, which can be useful for understanding the bias in different stages of a ma- 0.2 chine learning pipeline, and consequently in the study of 0.1 mitigation strategies. In our case study, we have revealed the heavy racial rep- 0 50,000 100,000 150,000 200,000 250,000 resentational bias of a popular FER dataset, Affectnet, and Training size the presence of stereotypical gender biases. The experi- Figure 4: Accuracy and apparent race and gender biases ments also show how the resulting model seems almost for models. invariant to the removal of the racial bias, while being severely impacted by any gender bias, either induced or corrected, in the source dataset. In the Appendix to this document3 the results for a second dataset, FER+, are pro- 5.3. Bias transfer vided, showing similar tendencies to the ones found for Affectnet. These results, while specific to this model and Regarding the apparent race analysis of the system, the training setup, expose the complexity of the bias analysis proposed metrics reveal an important representational im- and its impact in real world problems. balance of the dataset coupled with some stereotypical As future work lines, the same analysis could be per- bias. Despite this, when comparing a stratified subset to a formed for more datasets, models, and training setups. dataset of the same size but balanced by race, we observe Further work in the development of new demographic no improvement in the trained model bias. This suggests datasets and models could also improve the accuracy and that either the source of the model bias is not the dataset bias (inherent differences between racial expressions, for example) or that directly balancing the dataset is not a 3 https://github.com/irisdominguez/Dataset-Bias-Metrics detail of this bias analysis and extend it to other prob- [7] G. Bouma, Normalized (pointwise) mutual infor- lems. Different mitigation techniques, especially in the mation in collocation extraction, in: Proceedings of dataset preprocessing stage, could differ wildly in their GSCL, volume 30, GSCL, 2009, pp. 31โ€“40. [8] H. A. Elfenbein, N. Ambady, On the univer- impact. Finally, the metrics still require further analysis of their properties and potential application to other contexts. sality and cultural specificity of emotion recog- For example, although our proposed OD reflects both nition: A meta-analysis, Psychological Bulletin representational and stereotypical biases, having metrics 128 (2002) 203โ€“235. doi:10.1037/0033-2909. capable of decoupling them could enable a more in-depth 128.2.203. bias analysis. [9] K. Karkkainen, J. Joo, FairFace: Face At- Furthermore, new application areas could be researched tribute Dataset for Balanced Race, Gender, and in other multi-class multi-demographic AI systems, such Age for Bias Measurement and Mitigation, in: as age, gender and race recognition, AI-based medical 2021 IEEE Winter Conference on Applications diagnosis and sign language gesture recognition, to name of Computer Vision (WACV), IEEE, Waikoloa, a few. HI, USA, 2021, pp. 1547โ€“1557. doi:10.1109/ WACV48630.2021.00159. [10] E. Barsoum, C. Zhang, C. C. Ferrer, Z. Zhang, Acknowledgments Training Deep Networks for Facial Expres- sion Recognition with Crowd-Sourced Label This work was funded by a predoctoral fellow- Distribution, arXiv:1608.01041 [cs] (2016). ship from the Research Service of the Universidad arXiv:1608.01041. Publica de Navarra, the Spanish MICIN (PID2019- [11] H. Suresh, J. V. Guttag, A Framework for Under- 108392GB-I00 and PID2020-118014RB-I00 / AEI standing Sources of Harm throughout the Machine / 10.13039/501100011033), and the Government of Learning Life Cycle, arXiv:1901.10002 [cs, stat] Navarre (0011-1411-2020-000079 - Emotional Films). (2021). arXiv:1901.10002. [12] M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, References S. Venkatasubramanian, Certifying and removing disparate impact, arXiv:1412.3756 [cs, stat] (2015). [1] O. Keyes, The Misgendering Machines: Trans/HCI arXiv:1412.3756. Implications of Automatic Gender Recognition, Pro- [13] R. Berk, H. Heidari, S. Jabbari, M. Kearns, ceedings of the ACM on Human-Computer Interac- A. Roth, Fairness in Criminal Justice Risk Assess- tion 2 (2018) 1โ€“22. doi:10.1145/3274357. ments: The State of the Art, Sociological Meth- [2] V. U. Prabhu, A. Birhane, Large image datasets: A ods & Research 50 (2018) 3โ€“44. doi:10.1177/ pyrrhic win for computer vision?, arXiv:2006.16923 0049124118782533. [cs, stat] (2020). arXiv:2006.16923. [14] T. Kamishima, S. Akaho, H. Asoh, J. Sakuma, En- [3] J. Buolamwini, T. Gebru, Gender shades: Inter- hancement of the Neutrality in Recommendation, sectional accuracy disparities in commercial gender Proc. of the 2nd Workshop on Human Decision classification, in: S. A. Friedler, C. Wilson (Eds.), Making in Recommender Systems 893 (2012) 8โ€“ Proceedings of the 1st Conference on Fairness, Ac- 14. countability and Transparency, volume 81 of Pro- [15] R. E. Jack, C. Blais, C. Scheepers, P. G. Schyns, ceedings of Machine Learning Research, PMLR, R. Caldara, Cultural Confusions Show that Facial 2018, pp. 77โ€“91. Expressions Are Not Universal, Current Biology 19 [4] C. Dulhanty, A. Wong, Auditing ImageNet: To- (2009) 1543โ€“1548. doi:10.1016/j.cub.2009. wards a Model-driven Framework for Annotat- 07.051. ing Demographic Attributes of Large-Scale Im- [16] J. Gonzalez-Sanchez, M. Baydogan, M. E. Chavez- age Datasets, arXiv:1905.01347 [cs] (2019). Echeagaray, R. K. Atkinson, W. Burleson, Chapter arXiv:1905.01347. 11 - Affect Measurement: A Roadmap Through [5] E. Denton, A. Hanna, R. Amironesei, A. Smart, Approaches, Technologies, and Data Analysis, in: H. Nicole, On the genealogy of machine learn- M. Jeon (Ed.), Emotions and Affect in Human ing datasets: A critical history of ImageNet, Big Factors and Human-Computer Interaction, Aca- Data & Society 8 (2021) 205395172110359. doi:10. demic Press, San Diego, 2017, pp. 255โ€“288. doi:10. 1177/20539517211035955. 1016/B978-0-12-801851-4.00011-2. [6] D. Pessach, E. Shmueli, Algorithmic Fair- [17] E. B. Sonmez, H. Han, O. Karadeniz, T. Dalyan, ness, arXiv:2001.09784 [cs, stat] (2020). B. Sarioglu, EMRES: A new EMotional RESpon- arXiv:2001.09784. dent robot, IEEE Transactions on Cognitive and De- velopmental Systems (2021) 1โ€“1. doi:10.1109/ TCDS.2021.3120562. [18] J. L. Joseph, S. P. Mathew, Facial Expres- sion Recognition for the Blind Using Deep Learning, in: 2021 IEEE 4th International Conference on Computing, Power and Commu- nication Technologies (GUCON), IEEE, Kuala Lumpur, Malaysia, 2021, pp. 1โ€“5. doi:10.1109/ GUCON50781.2021.9574035. [19] A. Mehrabian, Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament, Current Psychology 14 (1996) 261โ€“292. doi:10.1007/ BF02686918. [20] P. Ekman, W. V. Friesen, Constants across cultures in the face and emotion., Journal of Personality and Social Psychology 17 (1971) 124โ€“129. doi:10. 1037/h0030377. [21] A. Mollahosseini, B. Hasani, M. H. Mahoor, Af- fectNet: A Database for Facial Expression, Va- lence, and Arousal Computing in the Wild, IEEE Transactions on Affective Computing 10 (2019) 18โ€“31. doi:10.1109/TAFFC.2017.2740923. arXiv:1708.03985. [22] C. F. Benitez-Quiroz, R. Srinivasan, A. M. Martinez, EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Ex- pressions in the Wild, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 2016, pp. 5562โ€“5570. doi:10.1109/CVPR.2016.600. [23] K. Simonyan, A. Zisserman, Very Deep Convolu- tional Networks for Large-Scale Image Recognition, 2015. arXiv:1409.1556. [24] L. N. Smith, A disciplined approach to neu- ral network hyper-parameters: Part 1 โ€“ learn- ing rate, batch size, momentum, and weight decay, arXiv:1803.09820 [cs, stat] (2018). arXiv:1803.09820. [25] A. P. Atkinson, J. Tipples, D. M. Burt, A. W. Young, Asymmetric interference between sex and emotion in face perception, Perception & Psychophysics 67 (2005) 1199โ€“1213. doi:10.3758/BF03193553.