Investigating Bias in Affective State Detection Using Eye
                                Biometrics⋆
                                Yuxin Zhi1,∗ , Bilal Taha1 and Dimitrios Hatzinakos1
                                1
                                    University of Toronto, ON, Canada


                                                  Abstract
                                                  This study delves into the exploration of pupillometry as a modality for affect state recognition. It examines the propensity
                                                  for bias in both feature-based and learning-based machine learning models that interpret affect through pupil responses.
                                                  Our research lies at the intersection of affective computing and mental health, recognizing the paramount importance of
                                                  accurately identifying affect states for effective mental health interventions. We rigorously evaluate the performance of
                                                  these pupillometry-based models across diverse demographic groups, including variables such as ethnicity, gender, age,
                                                  vision problems, and iris color. Our findings reveal notable disparities, particularly in gender and ethnicity. Bias levels are
                                                  pronounced in both feature-based and learning-based models, with F1 score differentials reaching up to 36.28%. Additionally,
                                                  our analysis uncovers a slight bias related to iris color, significantly impacting the efficacy of affect state recognition models
                                                  that rely on pupil responses. This underscores the critical need for fairness and accuracy in developing machine learning
                                                  models within affective computing. By highlighting these areas of potential bias, our study contributes to the broader
                                                  discourse on creating equitable AI systems and advancing mental health care, education, and social robotics. It emphasizes
                                                  the ethical imperative of developing unbiased, inclusive technologies in healthcare systems.

                                                  Keywords
                                                  Pupillometry, Affect state recognition, Mental health interventions, Bias, Fairness


                                1. Introduction                                                                                            nition software and mood-predicting algorithms, has
                                                                                                                                           opened new avenues in the monitoring and treatment of
                                In the emerging field of affective computing and men-                                                      mental health conditions [9]. However, this brings forth
                                tal health, the intricate relationship between affect state                                                the challenge of bias in machine learning models [10].
                                recognition and cognitive and mental health outcomes                                                       The accuracy and reliability of these models in affect
                                presents a domain of significant research interest [1, 2].                                                 state recognition are paramount, as biases can lead to
                                Affect state recognition, central to understanding and                                                     misinterpretations, potentially worsening mental health
                                managing various mental health disorders, encompasses                                                      conditions or leading to inappropriate treatment method-
                                the complex process of identifying and interpreting emo-                                                   ologies.
                                tional states [3]. This process is crucial in disorders                                                       Pupil response has been employed in diverse stud-
                                such as depression and anxiety, where impairments in                                                       ies within psychiatry and psychology, particularly in
                                emotional awareness and regulation are prevalent [4].                                                      assessing cognitive load for memory-based tasks [11].
                                The advancement of cognitive and mental health thera-                                                      It has also been utilized in analyzing the emotional im-
                                pies, including cognitive-behavioral therapy (CBT) and                                                     pact of stimuli on individuals [12]. One investigation
                                mindfulness-based strategies, hinges on the nuanced un-                                                    focused on the confounding effects of eye blinking in
                                derstanding and regulation of affect states [5, 6]. These                                                  pupillometry and proposed remedies [13]. Additionally,
                                emotional states profoundly influence core cognitive                                                       the utility of pupillometry in psychiatry was reviewed,
                                processes, including attention, memory, and decision-                                                      highlighting its role in understanding patients’ informa-
                                making [7]. This underscores the importance of affect                                                      tion processing styles, predicting treatment outcomes,
                                state recognition in therapeutic interventions. Further-                                                   and examining cognitive functions [14]. A separate study
                                more, the predictive nature of affect state recognition in                                                 employed pupillometry to assess atypical pupillary light
                                mental health conditions paves the way for early and                                                       reflexes and the LC-NE system in Autism Spectrum Dis-
                                more effective intervention strategies [8].                                                                order (ASD)[15]. The potential clinical use of pupil-
                                   The advent of technological solutions, such as recog-                                                   lometry in diagnosing nonconvulsive status epilepticus
                                                                                                                                           (NCSE) has also been explored[16]. Although physiolog-
                                Machine Learning for Cognitive and Mental Health Workshop                                                  ical responses such as pupillometry are generally con-
                                (ML4CMH), AAAI 2024, Vancouver, BC, Canada                                                                 sidered less biased than other modalities, hidden biases
                                ∗
                                     Corresponding author.
                                                                                                                                           can emerge from factors like stimuli selection and de-
                                Envelope-Open yuxin.zhi@mail.utoronto.ca (Y. Zhi);
                                bilal.taha@mail.utoronto.ca (B. Taha); dimitris@comm.utoronto.ca\                                          mographic influences [17, 18]. For instance, responses
                                protect\protect\leavevmode@ifvmode\kern+.1667em\relax                                                      to visual stimuli may vary significantly across different
                                (D. Hatzinakos)                                                                                            cultural backgrounds, orientations, and age groups.
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   This work aims to investigate the bias that exists in           Several features can be extracted from the pupil re-
affect state recognition models based on physiological          sponse, including mean and variance of the pupil re-
signals, specifically pupillometry, which plays a signifi-      sponse, maximum dilation, minimum contraction, dila-
cant role in understanding cognitive and mental health          tion speed, dilation duration, contraction duration, and
applications. The main goal of this study is to shed light      the difference between dilation and contraction. In total,
on the potential bias that may exist in common learn-           30 features were manually extracted and used to train
ing methods. The structure of the paper is as follows:          a kernel SVM classifier. Different kernels were tested,
first, the methodology, which includes preprocessing and        and the Gaussian Kernel showed the best performance
learning models, is explained. Then, the experiments            in general.
and results are presented and validated using a dataset
collected for this work. Finally, we discuss the findings       2.3. Learned-Based Model
and conclude at the end.
                                                                The long short-term memory (LSTM) [21] model is com-
                                                                monly used in machine learning for modeling sequential
2. Methodology                                                  data. In this approach, the LSTM model has been im-
                                                                plemented as seen in Figure1 with 128 LSTM units, a
The framework focuses on the use of pupillary responses         dropout rate of 0.5, a 128-unit dense layer, and a rectified
and approaches the task of affect state recognition as          linear unit (ReLU) activation function. Finally, a dense
a binary classification problem based on the targeted           layer at the end is added with a SoftMax function to pro-
group. The first step is preprocessing the pupillometry         duce the classification output. The cross-entropy loss
data to mitigate the effect of noisy samples. Then, the         function and RMSprop optimizer are used for training
data is used to develop the classification model either         the model.
from handcrafted features specific to the pupillometry             The use of deep learning methods such as LSTM for
data or using a learned model. Finally, model training          feature learning and affect state recognition is effective
and testing are described at the end to investigate the         in various machine learning tasks. This approach can
different cases.                                                improve the performance of the model, as it can capture
                                                                temporal dependencies and relationships in the data that
2.1. PreProcessing                                              might be missed by manual feature extraction.
The initial processing of the pupillometry data is
paramount to remove any irrelevant and noisy samples
that may impact pupil size analysis. The raw data can
be contaminated with various outliers like system errors,
blinks, eye-tracker glitches, and eyelid occlusion, which
can be identified and eliminated during this stage. Pre-
vious studies [19, 20] have proposed a robust method
for detecting such invalid samples, which we have used
in our study. The method uses dilation speed as a met-
ric to determine whether a data point is an outlier. If
a data sample exhibits a dilation speed greater than a
pre-defined threshold, it is removed as an anomaly. After
that, to ensure the continuity of the data, the filtered data
is modeled using a Gaussian process.

2.2. Feature-Based Models                                          Figure 1: The LSTM structure used for modeling the
                                                                                learned-based approach.
The feature-based method is a common approach in ma-
chine learning where specific features are extracted from
the data and used to train a the algorithm. In this study,
the pupil responses for each participant were divided into      2.4. Model Training
150 sets of sequences, with each sequence corresponding
to the pupil response for each image. Each sequence has         In both approaches, feature-based and learned-based, we
a length of 300 samples, which were used to extract the         divided the data into training and testing datasets, allocat-
features.                                                       ing 80% to training and 20% to testing, respectively. While
                                                                constructing the model, we utilized data from all demo-
                                                                graphic groups with the intention of creating a model that
captures feature representations from all these groups.               • Iris Color: The eye color case is a unique factor
To assess the model’s fairness and prevent bias towards                 relevant to models using pupillometry data for
any particular group, we further divided the testing data               their applications. Eye color affects the precision
into subgroups during the evaluation phase and assessed                 of detecting pupils and measuring their dilation
the model’s performance for each subgroup.                              and contraction. Thus, we categorize the data
   Due to the limited number of samples, we introduced                  into light (light brown, green, blue, hazel) versus
augmentation to enhance the training data. This augmen-                 dark (black, brown, dark brown) iris colors.
tation was applied later in the evaluation, allowing us to            • Vision: This case evaluates the model’s effective-
assess its impact on the results. The pupil data sequences              ness in capturing emotional states in data from in-
were augmented using noise injection and time-shifting                  dividuals wearing glasses versus those not wear-
methods [22]. Specifically, we added white noise to the                 ing glasses.
original pupil data and performed 50 sample shifts. Im-
portantly, the augmentation was applied to samples from
                                                                3.2. Experimental Protocol
the non-dominant group to ensure that our findings were
not influenced by this imbalance.                               The proposed system was evaluated using a dataset col-
                                                                lected at the University of Toronto. In the experiment,
                                                                participants viewed a series of visual stimuli intended to
3. Experiments and Results                                      elicit emotions spanning different valence and arousal
                                                                values. The visual stimuli were selected from the Interna-
Bias can be seen as the disparity in performance met-
                                                                tional Affective Picture System (IAPS) dataset [23]. The
rics across different groups for a given task. Assuming
                                                                IAPS database provides normative ratings of emotional
we have 𝐺 = {𝑔1 , 𝑔2 , … , 𝑔𝑛 } be the set of groups for bias
                                                                valence and arousal for a large set of images. The rating
investigation. For each group 𝑔𝑖 , we compute the per-
                                                                scales are based on the Self-Assessment Manikin (SAM),
formance metrics of a recognition model 𝑀(𝑔𝑖 ). Then,
                                                                a 9-point rating scale where a score of 9 represents a
the bias 𝐵 is identified for a pair of groups (𝑔𝑖 , 𝑔𝑗 ) as the
                                                                high rating (i.e., high pleasure, high arousal), a score of 5
absolute difference in their metrics:
                                                                indicates a neutral rating, and a rating of 1 represents a
                  𝐵(𝑔𝑖 , 𝑔𝑗 ) = |𝑀(𝑔𝑖 ) − 𝑀(𝑔𝑗 )|               low rating (i.e., low pleasure, low arousal).
                                                                   The selected visual stimuli elicit the emotions of in-
                                                                terest, which include the two quadrants of the VA di-
3.1. Data                                                       mensional model (i.e., HA, LA, or HV, LV). Each of the
To conduct a thorough assessment of bias in pupillom- aforementioned emotional states is achieved by display-
etry affect state recognition, we collected a dataset that ing 30 images of the same emotional target for 5 seconds
encompasses pupillometry data in response to visual stim- each. The images were selected to statistically produce
uli, taking into account a diverse range of demographics. the same response for different groups of people. All
The study involved 35 university students aged between images were presented on a screen with a resolution of
18 and 40 years, with a mean age of 24.6 and a standard 1920 by 1080 pixels. Following the recommendations of
deviation of 5.17. Participants were required to have the device manufacturers, the Gazepoint eye-tracking
no history of vision disorders, and they were also asked system was placed approximately 45 cm in front of the
about any medications they might be taking that could af- participant at an angle of around 30 degrees. The total
fect their responses, such as depression medication. The number of participants was 35.
data collected from the participants is categorized into           The data collection process was approved by the re-
different cases based on various demographic factors:           search ethics committee at the University of Toronto. All
                                                                participants signed a consent form that clearly explained
       • Gender: This case examines the algorithm’s abil- the data collection procedure and the privacy of their data.
         ity to fairly recognize emotional states in females Furthermore, all participants received compensation in
         versus males.                                          the form of a gift card.
       • Ethnic Group: This case assesses the model’s
         ability to impartially detect emotional states 3.3. Metrics:
         based on participants’ ethnic groups, including
         Asian (Chinese), White (North American or Eu- In the evaluation process, two common metrics were
         ropean), Black (African American or Caribbean), employed: accuracy and F1 score. Accuracy gauges the
         and South Asian (Pakistani or Indian) [6].             proportion of correct predictions made by the algorithm.
       • Age: This case explores the impact of age on the The F1 score, on the other hand, assesses the balance
         model’s ability to detect emotional states, consid- between precision and recall. It offers a more nuanced
         ering age groups [17-24] versus [25-55].
evaluation of the algorithm’s performance, especially         Table 2
when dealing with imbalanced datasets.                        Valence Result of SVM for the different Ethnic Groups.
                                                                Ethnic Group      Testing %     Accuracy      F1 Score
3.4. Results from the Feature-Based                                 Asian          45.28%         52.4%         0.615
     Model                                                         White           37.47%         51.1%         0.600
                                                                 South Asian        9.43%         54.3%         0.667
We employed a feature-based algorithm for emotion                   Black           7.82%         41.4%         0.414
recognition and assessed the presence of bias among dif-
ferent demographic groups, focusing on valence-based
and arousal-based classifications. Our evaluation yielded
results presented in Tables 1 and 2, along with Figures 2
and 3.
   Notably, our findings reveal significant performance
differences between males and females in both arousal
and valence. Specifically, our analysis indicated that
males scored 20.28% higher in arousal and 17.46% higher
in valence compared to females. The F1 score exhibited
a similar gender-based pattern of differences.
   Further examination of the model based on ethnicity
factors showed significant variations in accuracy and
F1 scores across different groups. Notably, the Asian
group, despite having the highest number of samples,
displayed the lowest accuracy and F1 scores in terms of           Figure 2: SVM F1 results for the remaining groups
arousal classification. In contrast, the South Asian group,
with the second-lowest number of samples, demonstrated
the highest performance. The percentage difference be-
tween the highest-performing group (South Asian) and
the lowest-performing group (Asian) was 28.93% in accu-
racy and 21% in F1 score for arousal classification. These
findings suggest that obtaining accurate feature repre-
sentations for the Asian group in terms of arousal classi-
fication may be more challenging based on the provided
stimuli.
   Regarding valence classification, our analysis revealed
similar performance among the Asian, White, and South
Asian groups, while the Black group exhibited signifi-
cantly lower accuracy and F1 scores. Specifically, the
percentage difference between the Black group and the
group with the highest performance was 26.99% in accu-
racy and 46.71% in F1 score, respectively.

Table 1
Arousal Result of SVM for the different Ethnic Groups.
  Ethnic Group      Testing %     Accuracy     F1 Score
      Asian          44.24%        51.48%        0.667
     White           35.86%        67.88%        0.790         Figure 3: SVM Accuracy results for the remaining groups.
   South Asian       11.78%        68.89%        0.816
      Black           8.12%         64.5%        0.784
                                                              our analysis are presented in Tables 3 and 4, and Figures
                                                              4 and 5.
3.5. Results from the LSTM Model                                 Consistent with the findings of the feature-based
                                                              model, we observed significant performance differences
We employed an LSTM-based approach to investigate
                                                              between genders and ethnic groups. Specifically, our
bias across different demographic groups. The results of
                                                              results revealed a significant 24.12% bias toward females
in arousal accuracy and a 10.15% bias toward males in
valence accuracy. Concerning ethnic groups, accuracy
exhibited substantial variations across different ethnic-
ities, as depicted in Tables 3 and 4. In terms of arousal,
the Asian group had the lowest performance, while the
White group achieved the highest accuracy, resulting in
a significant 20.90% advantage favoring the White group.
The other ethnic groups showed similar performance.
In terms of valence, the Black group displayed the low-
est performance, while the South Asian group achieved
the highest, with a difference of 36.28%. In the remain-
ing cases, there were no significant differences between
individual groups, suggesting that these factors share
common representations that can be captured by the
algorithms.

Table 3
LSTM Arousal Result for different Ethnic Groups.
         Ethnic Group      Accuracy     F1 Score
             Asian           52.7%        0.413
            White            65.0%        0.581                  Figure 5: LSTM Accuracy results for the remaining
          South Asian        62.2%        0.546                                demographic groups
             Black           64.5%        0.506


                                                             3.6. Bias and Fairness
Table 4
LSTM Valence Result for different Ethnic Groups.             Based on the results presented above, it is evident that
                                                             both models exhibit significant differences in accuracy
        Ethnic Group      Accuracy      F1 Score             and F1 scores concerning ethnic groups and gender. This
            Asian           54.7%         0.404
                                                             indicates that these two factors play a pivotal role in
           White            50.4%         0.345
                                                             the development of affect recognition from pupillometry
         South Asian        54.3%         0.382
                                                             data, as the models struggled to find effective representa-
            Black           37.9%         0.209
                                                             tions for them. In contrast, the other four cases displayed
                                                             minor differences in terms of accuracy and F1 scores, sug-
                                                             gesting that these factors share common representations
                                                             across all groups and do not adversely affect the data’s
                                                             quality. For example, iris color had a limited impact on
                                                             recognition performance, albeit not as pronounced as
                                                             with gender and ethnic groups.
                                                                Despite the dataset including diverse groups during
                                                             model training, the quality of the representations failed
                                                             to adequately capture the diverse group responses within
                                                             the studied population. We acknowledge that the unbal-
                                                             anced number of samples in each group might contribute
                                                             to the bias observed in the results. To address this poten-
                                                             tial issue, we implemented data augmentation techniques
                                                             (see 2.4) for the non-dominant groups (groups with fewer
                                                             samples) to increase their sample size. Subsequently,
Figure 4: F1 results for the LSTM model for the remaining    we followed the same procedure as in the original case.
                   demographic groups.                       However, our results demonstrated that even with the
                                                             implementation of data augmentation, the performance
                                                             did not change significantly. The bias in performance per-
                                                             sisted in both the ethnic groups and gender-based cases,
                                                             while the remaining cases exhibited similar performance.
4. Conclusion                                                      model of emotion-specific influences on judgement
                                                                   and choice, Cognition & emotion 14 (2000) 473–493.
In this study, we investigated the performance of feature-     [8] R. e. Kaliouby, R. Picard, S. Baron-Cohen, Affective
based and learned-based affect recognition models across           computing and autism, Annals of the New York
various group factors, including ethnicity, gender, vision,        Academy of Sciences 1093 (2006) 228–248.
iris color, and age, focusing on pupillometry as a the         [9] M. Nouman, S. Y. Khoo, M. P. Mahmud, A. Z.
modality. Our research, involving a dataset from 35 di-            Kouzani, Recent advances in contactless sensing
verse participants, revealed significant gender and ethnic         technologies for mental health monitoring, IEEE
biases in standard affect recognition algorithms, impact-          Internet of Things Journal 9 (2021) 274–297.
ing both arousal and valence-based classifications. We        [10] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman,
also identified minor biases related to other factors, such        A. Galstyan, A survey on bias and fairness in ma-
as iris color. These findings emphasize the potential bias         chine learning, ACM Computing Surveys (CSUR)
in affect recognition systems, highlighting the need for           54 (2021) 1–35.
more inclusive and representative training data, rigorous     [11] E. Granholm, R. F. Asarnow, A. J. Sarkin, K. L. Dykes,
fairness evaluation, and enhanced transparency in model            Pupillary responses index cognitive resource limi-
development. Our study not only sheds light on the in-             tations, Psychophysiology 33 (1996) 457–461.
herent biases in affective computing but also underscores     [12] M. M. Bradley, L. Miccoli, M. A. Escrig, P. J. Lang,
the importance of considering demographic factors in               The pupil as a measure of emotional arousal and
the development of more equitable and effective affect             autonomic activation, Psychophysiology 45 (2008)
recognition technologies, particularly given their direct          602–607.
relation to cognitive and mental health.                      [13] K. Yoo, J. Ahn, S.-H. Lee, The confounding effects
                                                                   of eye blinking on pupillometry, and their remedy,
                                                                   Plos one 16 (2021) e0261463.
References                                                    [14] S. Graur, G. Siegle, Pupillary motility: bringing neu-
 [1] R. Assabumrungrat, S. Sangnark, T. Charoen-                   roscience to the psychiatry clinic of the future, Cur-
     pattarawut, W. Polpakdee, T. Sudhawiyangkul,                  rent neurology and neuroscience reports 13 (2013)
     E. Boonchieng, T. Wilaiprasitporn, Ubiquitous af-             1–9.
     fective computing: A review, IEEE Sensors Journal        [15] G. Lynch, Using pupillometry to assess the atyp-
     22 (2021) 1867–1881.                                          ical pupillary light reflex and lc-ne system in asd,
 [2] S. Greene, H. Thapliyal, A. Caban-Holt, A survey              Behavioral Sciences 8 (2018) 108.
     of affective computing for stress detection: Eval-       [16] S. Hocker, Pupillometry for diagnosing noncon-
     uating technologies in stress detection for better            vulsive status epilepticus and assessing treatment
     health, IEEE Consumer Electronics Magazine 5                  response?, Neurocritical Care 35 (2021) 304–305.
     (2016) 44–56.                                            [17] K. Yang, C. Wang, Y. Gu, Z. Sarsenbayeva, B. Tag,
 [3] R. A. Calvo, K. Dinakar, R. Picard, P. Maes, Comput-          T. Dingler, G. Wadley, J. Goncalves, Behavioral
     ing in mental health, in: Proceedings of the 2016             and physiological signals-based deep multimodal
     CHI Conference Extended Abstracts on Human Fac-               approach for mobile emotion recognition, IEEE
     tors in Computing Systems, 2016, pp. 3438–3445.               Transactions on Affective Computing (2021).
 [4] T. Nguyen, D. Phung, B. Dao, S. Venkatesh, M. Berk,      [18] H.-C. Yang, C.-C. Lee, Annotation matters: A
     Affective and content analysis of online depression           comprehensive study on recognizing intended, self-
     communities, IEEE transactions on affective com-              reported, and observed emotion labels using physi-
     puting 5 (2014) 217–226.                                      ology, in: 2019 8th International Conference on
 [5] C. Zucco, B. Calabrese, M. Cannataro, Sentiment               Affective Computing and Intelligent Interaction
     analysis and affective computing for depression               (ACII), IEEE, 2019, pp. 1–7.
     monitoring, in: 2017 IEEE international conference       [19] M. E. Kret, E. E. Sjak-Shie, Preprocessing pupil
     on bioinformatics and biomedicine (BIBM), IEEE,               size data: Guidelines and code, Behavior research
     2017, pp. 1988–1995.                                          methods 51 (2019) 1336–1342.
 [6] M. A. Kirk, B. Taha, K. Dang, H. McCague, D. Hatz-       [20] B. Taha, M. Kirk, P. Ritvo, D. Hatzinakos, Detec-
     inakos, J. Katz, P. Ritvo, A web-based cognitive              tion of post-traumatic stress disorder using learned
     behavioral therapy, mindfulness meditation, and               time-frequency representations from pupillometry,
     yoga intervention for posttraumatic stress disorder:          in: ICASSP 2021-2021 IEEE International Confer-
     Single-arm experimental clinical trial, JMIR Mental           ence on Acoustics, Speech and Signal Processing
     Health 9 (2022) e26479.                                       (ICASSP), IEEE, 2021, pp. 3950–3954.
 [7] J. S. Lerner, D. Keltner, Beyond valence: Toward a       [21] S. Hochreiter, J. Schmidhuber, Long short-term
                                                                   memory, Neural computation 9 (1997) 1735–1780.
[22] T. Ko, V. Peddinti, D. Povey, S. Khudanpur, Audio
     augmentation for speech recognition, in: Sixteenth
     annual conference of the international speech com-
     munication association, 2015.
[23] P. J. Lang, International Affective Picture System
     (IAPS): Affective ratings of pictures and instruction
     manual, Technical report (2005).