Investigating Bias in Affective State Detection Using Eye Biometrics⋆ Yuxin Zhi1,∗ , Bilal Taha1 and Dimitrios Hatzinakos1 1 University of Toronto, ON, Canada Abstract This study delves into the exploration of pupillometry as a modality for affect state recognition. It examines the propensity for bias in both feature-based and learning-based machine learning models that interpret affect through pupil responses. Our research lies at the intersection of affective computing and mental health, recognizing the paramount importance of accurately identifying affect states for effective mental health interventions. We rigorously evaluate the performance of these pupillometry-based models across diverse demographic groups, including variables such as ethnicity, gender, age, vision problems, and iris color. Our findings reveal notable disparities, particularly in gender and ethnicity. Bias levels are pronounced in both feature-based and learning-based models, with F1 score differentials reaching up to 36.28%. Additionally, our analysis uncovers a slight bias related to iris color, significantly impacting the efficacy of affect state recognition models that rely on pupil responses. This underscores the critical need for fairness and accuracy in developing machine learning models within affective computing. By highlighting these areas of potential bias, our study contributes to the broader discourse on creating equitable AI systems and advancing mental health care, education, and social robotics. It emphasizes the ethical imperative of developing unbiased, inclusive technologies in healthcare systems. Keywords Pupillometry, Affect state recognition, Mental health interventions, Bias, Fairness 1. Introduction nition software and mood-predicting algorithms, has opened new avenues in the monitoring and treatment of In the emerging field of affective computing and men- mental health conditions [9]. However, this brings forth tal health, the intricate relationship between affect state the challenge of bias in machine learning models [10]. recognition and cognitive and mental health outcomes The accuracy and reliability of these models in affect presents a domain of significant research interest [1, 2]. state recognition are paramount, as biases can lead to Affect state recognition, central to understanding and misinterpretations, potentially worsening mental health managing various mental health disorders, encompasses conditions or leading to inappropriate treatment method- the complex process of identifying and interpreting emo- ologies. tional states [3]. This process is crucial in disorders Pupil response has been employed in diverse stud- such as depression and anxiety, where impairments in ies within psychiatry and psychology, particularly in emotional awareness and regulation are prevalent [4]. assessing cognitive load for memory-based tasks [11]. The advancement of cognitive and mental health thera- It has also been utilized in analyzing the emotional im- pies, including cognitive-behavioral therapy (CBT) and pact of stimuli on individuals [12]. One investigation mindfulness-based strategies, hinges on the nuanced un- focused on the confounding effects of eye blinking in derstanding and regulation of affect states [5, 6]. These pupillometry and proposed remedies [13]. Additionally, emotional states profoundly influence core cognitive the utility of pupillometry in psychiatry was reviewed, processes, including attention, memory, and decision- highlighting its role in understanding patients’ informa- making [7]. This underscores the importance of affect tion processing styles, predicting treatment outcomes, state recognition in therapeutic interventions. Further- and examining cognitive functions [14]. A separate study more, the predictive nature of affect state recognition in employed pupillometry to assess atypical pupillary light mental health conditions paves the way for early and reflexes and the LC-NE system in Autism Spectrum Dis- more effective intervention strategies [8]. order (ASD)[15]. The potential clinical use of pupil- The advent of technological solutions, such as recog- lometry in diagnosing nonconvulsive status epilepticus (NCSE) has also been explored[16]. Although physiolog- Machine Learning for Cognitive and Mental Health Workshop ical responses such as pupillometry are generally con- (ML4CMH), AAAI 2024, Vancouver, BC, Canada sidered less biased than other modalities, hidden biases ∗ Corresponding author. can emerge from factors like stimuli selection and de- Envelope-Open yuxin.zhi@mail.utoronto.ca (Y. Zhi); bilal.taha@mail.utoronto.ca (B. Taha); dimitris@comm.utoronto.ca\ mographic influences [17, 18]. For instance, responses protect\protect\leavevmode@ifvmode\kern+.1667em\relax to visual stimuli may vary significantly across different (D. Hatzinakos) cultural backgrounds, orientations, and age groups. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings This work aims to investigate the bias that exists in Several features can be extracted from the pupil re- affect state recognition models based on physiological sponse, including mean and variance of the pupil re- signals, specifically pupillometry, which plays a signifi- sponse, maximum dilation, minimum contraction, dila- cant role in understanding cognitive and mental health tion speed, dilation duration, contraction duration, and applications. The main goal of this study is to shed light the difference between dilation and contraction. In total, on the potential bias that may exist in common learn- 30 features were manually extracted and used to train ing methods. The structure of the paper is as follows: a kernel SVM classifier. Different kernels were tested, first, the methodology, which includes preprocessing and and the Gaussian Kernel showed the best performance learning models, is explained. Then, the experiments in general. and results are presented and validated using a dataset collected for this work. Finally, we discuss the findings 2.3. Learned-Based Model and conclude at the end. The long short-term memory (LSTM) [21] model is com- monly used in machine learning for modeling sequential 2. Methodology data. In this approach, the LSTM model has been im- plemented as seen in Figure1 with 128 LSTM units, a The framework focuses on the use of pupillary responses dropout rate of 0.5, a 128-unit dense layer, and a rectified and approaches the task of affect state recognition as linear unit (ReLU) activation function. Finally, a dense a binary classification problem based on the targeted layer at the end is added with a SoftMax function to pro- group. The first step is preprocessing the pupillometry duce the classification output. The cross-entropy loss data to mitigate the effect of noisy samples. Then, the function and RMSprop optimizer are used for training data is used to develop the classification model either the model. from handcrafted features specific to the pupillometry The use of deep learning methods such as LSTM for data or using a learned model. Finally, model training feature learning and affect state recognition is effective and testing are described at the end to investigate the in various machine learning tasks. This approach can different cases. improve the performance of the model, as it can capture temporal dependencies and relationships in the data that 2.1. PreProcessing might be missed by manual feature extraction. The initial processing of the pupillometry data is paramount to remove any irrelevant and noisy samples that may impact pupil size analysis. The raw data can be contaminated with various outliers like system errors, blinks, eye-tracker glitches, and eyelid occlusion, which can be identified and eliminated during this stage. Pre- vious studies [19, 20] have proposed a robust method for detecting such invalid samples, which we have used in our study. The method uses dilation speed as a met- ric to determine whether a data point is an outlier. If a data sample exhibits a dilation speed greater than a pre-defined threshold, it is removed as an anomaly. After that, to ensure the continuity of the data, the filtered data is modeled using a Gaussian process. 2.2. Feature-Based Models Figure 1: The LSTM structure used for modeling the learned-based approach. The feature-based method is a common approach in ma- chine learning where specific features are extracted from the data and used to train a the algorithm. In this study, the pupil responses for each participant were divided into 2.4. Model Training 150 sets of sequences, with each sequence corresponding to the pupil response for each image. Each sequence has In both approaches, feature-based and learned-based, we a length of 300 samples, which were used to extract the divided the data into training and testing datasets, allocat- features. ing 80% to training and 20% to testing, respectively. While constructing the model, we utilized data from all demo- graphic groups with the intention of creating a model that captures feature representations from all these groups. • Iris Color: The eye color case is a unique factor To assess the model’s fairness and prevent bias towards relevant to models using pupillometry data for any particular group, we further divided the testing data their applications. Eye color affects the precision into subgroups during the evaluation phase and assessed of detecting pupils and measuring their dilation the model’s performance for each subgroup. and contraction. Thus, we categorize the data Due to the limited number of samples, we introduced into light (light brown, green, blue, hazel) versus augmentation to enhance the training data. This augmen- dark (black, brown, dark brown) iris colors. tation was applied later in the evaluation, allowing us to • Vision: This case evaluates the model’s effective- assess its impact on the results. The pupil data sequences ness in capturing emotional states in data from in- were augmented using noise injection and time-shifting dividuals wearing glasses versus those not wear- methods [22]. Specifically, we added white noise to the ing glasses. original pupil data and performed 50 sample shifts. Im- portantly, the augmentation was applied to samples from 3.2. Experimental Protocol the non-dominant group to ensure that our findings were not influenced by this imbalance. The proposed system was evaluated using a dataset col- lected at the University of Toronto. In the experiment, participants viewed a series of visual stimuli intended to 3. Experiments and Results elicit emotions spanning different valence and arousal values. The visual stimuli were selected from the Interna- Bias can be seen as the disparity in performance met- tional Affective Picture System (IAPS) dataset [23]. The rics across different groups for a given task. Assuming IAPS database provides normative ratings of emotional we have 𝐺 = {𝑔1 , 𝑔2 , … , 𝑔𝑛 } be the set of groups for bias valence and arousal for a large set of images. The rating investigation. For each group 𝑔𝑖 , we compute the per- scales are based on the Self-Assessment Manikin (SAM), formance metrics of a recognition model 𝑀(𝑔𝑖 ). Then, a 9-point rating scale where a score of 9 represents a the bias 𝐵 is identified for a pair of groups (𝑔𝑖 , 𝑔𝑗 ) as the high rating (i.e., high pleasure, high arousal), a score of 5 absolute difference in their metrics: indicates a neutral rating, and a rating of 1 represents a 𝐵(𝑔𝑖 , 𝑔𝑗 ) = |𝑀(𝑔𝑖 ) − 𝑀(𝑔𝑗 )| low rating (i.e., low pleasure, low arousal). The selected visual stimuli elicit the emotions of in- terest, which include the two quadrants of the VA di- 3.1. Data mensional model (i.e., HA, LA, or HV, LV). Each of the To conduct a thorough assessment of bias in pupillom- aforementioned emotional states is achieved by display- etry affect state recognition, we collected a dataset that ing 30 images of the same emotional target for 5 seconds encompasses pupillometry data in response to visual stim- each. The images were selected to statistically produce uli, taking into account a diverse range of demographics. the same response for different groups of people. All The study involved 35 university students aged between images were presented on a screen with a resolution of 18 and 40 years, with a mean age of 24.6 and a standard 1920 by 1080 pixels. Following the recommendations of deviation of 5.17. Participants were required to have the device manufacturers, the Gazepoint eye-tracking no history of vision disorders, and they were also asked system was placed approximately 45 cm in front of the about any medications they might be taking that could af- participant at an angle of around 30 degrees. The total fect their responses, such as depression medication. The number of participants was 35. data collected from the participants is categorized into The data collection process was approved by the re- different cases based on various demographic factors: search ethics committee at the University of Toronto. All participants signed a consent form that clearly explained • Gender: This case examines the algorithm’s abil- the data collection procedure and the privacy of their data. ity to fairly recognize emotional states in females Furthermore, all participants received compensation in versus males. the form of a gift card. • Ethnic Group: This case assesses the model’s ability to impartially detect emotional states 3.3. Metrics: based on participants’ ethnic groups, including Asian (Chinese), White (North American or Eu- In the evaluation process, two common metrics were ropean), Black (African American or Caribbean), employed: accuracy and F1 score. Accuracy gauges the and South Asian (Pakistani or Indian) [6]. proportion of correct predictions made by the algorithm. • Age: This case explores the impact of age on the The F1 score, on the other hand, assesses the balance model’s ability to detect emotional states, consid- between precision and recall. It offers a more nuanced ering age groups [17-24] versus [25-55]. evaluation of the algorithm’s performance, especially Table 2 when dealing with imbalanced datasets. Valence Result of SVM for the different Ethnic Groups. Ethnic Group Testing % Accuracy F1 Score 3.4. Results from the Feature-Based Asian 45.28% 52.4% 0.615 Model White 37.47% 51.1% 0.600 South Asian 9.43% 54.3% 0.667 We employed a feature-based algorithm for emotion Black 7.82% 41.4% 0.414 recognition and assessed the presence of bias among dif- ferent demographic groups, focusing on valence-based and arousal-based classifications. Our evaluation yielded results presented in Tables 1 and 2, along with Figures 2 and 3. Notably, our findings reveal significant performance differences between males and females in both arousal and valence. Specifically, our analysis indicated that males scored 20.28% higher in arousal and 17.46% higher in valence compared to females. The F1 score exhibited a similar gender-based pattern of differences. Further examination of the model based on ethnicity factors showed significant variations in accuracy and F1 scores across different groups. Notably, the Asian group, despite having the highest number of samples, displayed the lowest accuracy and F1 scores in terms of Figure 2: SVM F1 results for the remaining groups arousal classification. In contrast, the South Asian group, with the second-lowest number of samples, demonstrated the highest performance. The percentage difference be- tween the highest-performing group (South Asian) and the lowest-performing group (Asian) was 28.93% in accu- racy and 21% in F1 score for arousal classification. These findings suggest that obtaining accurate feature repre- sentations for the Asian group in terms of arousal classi- fication may be more challenging based on the provided stimuli. Regarding valence classification, our analysis revealed similar performance among the Asian, White, and South Asian groups, while the Black group exhibited signifi- cantly lower accuracy and F1 scores. Specifically, the percentage difference between the Black group and the group with the highest performance was 26.99% in accu- racy and 46.71% in F1 score, respectively. Table 1 Arousal Result of SVM for the different Ethnic Groups. Ethnic Group Testing % Accuracy F1 Score Asian 44.24% 51.48% 0.667 White 35.86% 67.88% 0.790 Figure 3: SVM Accuracy results for the remaining groups. South Asian 11.78% 68.89% 0.816 Black 8.12% 64.5% 0.784 our analysis are presented in Tables 3 and 4, and Figures 4 and 5. 3.5. Results from the LSTM Model Consistent with the findings of the feature-based model, we observed significant performance differences We employed an LSTM-based approach to investigate between genders and ethnic groups. Specifically, our bias across different demographic groups. The results of results revealed a significant 24.12% bias toward females in arousal accuracy and a 10.15% bias toward males in valence accuracy. Concerning ethnic groups, accuracy exhibited substantial variations across different ethnic- ities, as depicted in Tables 3 and 4. In terms of arousal, the Asian group had the lowest performance, while the White group achieved the highest accuracy, resulting in a significant 20.90% advantage favoring the White group. The other ethnic groups showed similar performance. In terms of valence, the Black group displayed the low- est performance, while the South Asian group achieved the highest, with a difference of 36.28%. In the remain- ing cases, there were no significant differences between individual groups, suggesting that these factors share common representations that can be captured by the algorithms. Table 3 LSTM Arousal Result for different Ethnic Groups. Ethnic Group Accuracy F1 Score Asian 52.7% 0.413 White 65.0% 0.581 Figure 5: LSTM Accuracy results for the remaining South Asian 62.2% 0.546 demographic groups Black 64.5% 0.506 3.6. Bias and Fairness Table 4 LSTM Valence Result for different Ethnic Groups. Based on the results presented above, it is evident that both models exhibit significant differences in accuracy Ethnic Group Accuracy F1 Score and F1 scores concerning ethnic groups and gender. This Asian 54.7% 0.404 indicates that these two factors play a pivotal role in White 50.4% 0.345 the development of affect recognition from pupillometry South Asian 54.3% 0.382 data, as the models struggled to find effective representa- Black 37.9% 0.209 tions for them. In contrast, the other four cases displayed minor differences in terms of accuracy and F1 scores, sug- gesting that these factors share common representations across all groups and do not adversely affect the data’s quality. For example, iris color had a limited impact on recognition performance, albeit not as pronounced as with gender and ethnic groups. Despite the dataset including diverse groups during model training, the quality of the representations failed to adequately capture the diverse group responses within the studied population. We acknowledge that the unbal- anced number of samples in each group might contribute to the bias observed in the results. To address this poten- tial issue, we implemented data augmentation techniques (see 2.4) for the non-dominant groups (groups with fewer samples) to increase their sample size. Subsequently, Figure 4: F1 results for the LSTM model for the remaining we followed the same procedure as in the original case. demographic groups. However, our results demonstrated that even with the implementation of data augmentation, the performance did not change significantly. The bias in performance per- sisted in both the ethnic groups and gender-based cases, while the remaining cases exhibited similar performance. 4. Conclusion model of emotion-specific influences on judgement and choice, Cognition & emotion 14 (2000) 473–493. In this study, we investigated the performance of feature- [8] R. e. Kaliouby, R. Picard, S. Baron-Cohen, Affective based and learned-based affect recognition models across computing and autism, Annals of the New York various group factors, including ethnicity, gender, vision, Academy of Sciences 1093 (2006) 228–248. iris color, and age, focusing on pupillometry as a the [9] M. Nouman, S. Y. Khoo, M. P. Mahmud, A. Z. modality. Our research, involving a dataset from 35 di- Kouzani, Recent advances in contactless sensing verse participants, revealed significant gender and ethnic technologies for mental health monitoring, IEEE biases in standard affect recognition algorithms, impact- Internet of Things Journal 9 (2021) 274–297. ing both arousal and valence-based classifications. We [10] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, also identified minor biases related to other factors, such A. Galstyan, A survey on bias and fairness in ma- as iris color. These findings emphasize the potential bias chine learning, ACM Computing Surveys (CSUR) in affect recognition systems, highlighting the need for 54 (2021) 1–35. more inclusive and representative training data, rigorous [11] E. Granholm, R. F. Asarnow, A. J. Sarkin, K. L. Dykes, fairness evaluation, and enhanced transparency in model Pupillary responses index cognitive resource limi- development. Our study not only sheds light on the in- tations, Psychophysiology 33 (1996) 457–461. herent biases in affective computing but also underscores [12] M. M. Bradley, L. Miccoli, M. A. Escrig, P. J. Lang, the importance of considering demographic factors in The pupil as a measure of emotional arousal and the development of more equitable and effective affect autonomic activation, Psychophysiology 45 (2008) recognition technologies, particularly given their direct 602–607. relation to cognitive and mental health. [13] K. Yoo, J. Ahn, S.-H. Lee, The confounding effects of eye blinking on pupillometry, and their remedy, Plos one 16 (2021) e0261463. References [14] S. Graur, G. Siegle, Pupillary motility: bringing neu- [1] R. Assabumrungrat, S. Sangnark, T. Charoen- roscience to the psychiatry clinic of the future, Cur- pattarawut, W. Polpakdee, T. Sudhawiyangkul, rent neurology and neuroscience reports 13 (2013) E. Boonchieng, T. Wilaiprasitporn, Ubiquitous af- 1–9. fective computing: A review, IEEE Sensors Journal [15] G. Lynch, Using pupillometry to assess the atyp- 22 (2021) 1867–1881. ical pupillary light reflex and lc-ne system in asd, [2] S. Greene, H. Thapliyal, A. Caban-Holt, A survey Behavioral Sciences 8 (2018) 108. of affective computing for stress detection: Eval- [16] S. Hocker, Pupillometry for diagnosing noncon- uating technologies in stress detection for better vulsive status epilepticus and assessing treatment health, IEEE Consumer Electronics Magazine 5 response?, Neurocritical Care 35 (2021) 304–305. (2016) 44–56. [17] K. Yang, C. Wang, Y. Gu, Z. Sarsenbayeva, B. Tag, [3] R. A. Calvo, K. Dinakar, R. Picard, P. Maes, Comput- T. Dingler, G. Wadley, J. Goncalves, Behavioral ing in mental health, in: Proceedings of the 2016 and physiological signals-based deep multimodal CHI Conference Extended Abstracts on Human Fac- approach for mobile emotion recognition, IEEE tors in Computing Systems, 2016, pp. 3438–3445. Transactions on Affective Computing (2021). [4] T. Nguyen, D. Phung, B. Dao, S. Venkatesh, M. Berk, [18] H.-C. Yang, C.-C. Lee, Annotation matters: A Affective and content analysis of online depression comprehensive study on recognizing intended, self- communities, IEEE transactions on affective com- reported, and observed emotion labels using physi- puting 5 (2014) 217–226. ology, in: 2019 8th International Conference on [5] C. Zucco, B. Calabrese, M. Cannataro, Sentiment Affective Computing and Intelligent Interaction analysis and affective computing for depression (ACII), IEEE, 2019, pp. 1–7. monitoring, in: 2017 IEEE international conference [19] M. E. Kret, E. E. Sjak-Shie, Preprocessing pupil on bioinformatics and biomedicine (BIBM), IEEE, size data: Guidelines and code, Behavior research 2017, pp. 1988–1995. methods 51 (2019) 1336–1342. [6] M. A. Kirk, B. Taha, K. Dang, H. McCague, D. Hatz- [20] B. Taha, M. Kirk, P. Ritvo, D. Hatzinakos, Detec- inakos, J. Katz, P. Ritvo, A web-based cognitive tion of post-traumatic stress disorder using learned behavioral therapy, mindfulness meditation, and time-frequency representations from pupillometry, yoga intervention for posttraumatic stress disorder: in: ICASSP 2021-2021 IEEE International Confer- Single-arm experimental clinical trial, JMIR Mental ence on Acoustics, Speech and Signal Processing Health 9 (2022) e26479. (ICASSP), IEEE, 2021, pp. 3950–3954. [7] J. S. Lerner, D. Keltner, Beyond valence: Toward a [21] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [22] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio augmentation for speech recognition, in: Sixteenth annual conference of the international speech com- munication association, 2015. [23] P. J. Lang, International Affective Picture System (IAPS): Affective ratings of pictures and instruction manual, Technical report (2005).