Exploring Fusion Techniques in Multimodal AI-Based
                                Recruitment: Insights from FairCVdb
                                Swati Swati1,* , Arjun Roy1,2 and Eirini Ntoutsi1
                                1
                                    Research Institute CODE, University of the Bundeswehr Munich, Germany
                                2
                                    Institute of Computer Science, Free University Berlin, Germany


                                                                         Abstract
                                                                         Despite the large body of work on fairness-aware learning for individual modalities like tabular data,
                                                                         images, and text, less work has been done on multimodal data, which fuses various modalities for a
                                                                         comprehensive analysis. In this work, we investigate the fairness and bias implications of multimodal
                                                                         fusion techniques in the context of multimodal AI-based recruitment systems using the FairCVdb
                                                                         dataset. Our results show that early-fusion closely matches the ground truth for both demographics,
                                                                         achieving the lowest MAEs by integrating each modality’s unique characteristics. In contrast, late-fusion
                                                                         leads to highly generalized mean scores an d higher MAEs. Our findings emphasise the significant
                                                                         potential of early-fusion for accurate and fair applications, even in the presence of demographic biases,
                                                                         compared to late-fusion. Future research could explore alternative fusion strategies and incorporate
                                                                         modality-related fairness constraints to improve fairness. For code and additional insights, visit: https:
                                                                         //github.com/Swati17293/Multimodal-AI-Based-Recruitment-FairCVdb

                                                                         Keywords
                                                                         Multimodal bias, Multimodal fairness, Algorithmic Fairness, Fairness, Early Fusion, Late Fusion


                                1. Introduction
                                The increasing popularity of decision-making algorithms has raised concerns about bias in
                                decision-making, especially towards specific social groups defined by protected attributes
                                such as gender and ethnicity [1]. Research on fairness-aware learning primarily focuses on
                                individual modalities, such as tabular data [2], text [3], images [4], and graphs [5]. However,
                                there has been less focus on bias in multimodal systems [6], which can result from integration
                                complexity, unbalanced representation, alignment, and the compounding effect of biases present
                                in each modality. To this end, in this work, we investigate the bias and fairness implications of
                                multimodal AI in automated recruitment systems using the FairCVdb [7] dataset. We use it as a
                                testbed, as it offers diverse data, including images, text, and structured data with intentionally
                                designed gender and ethnicity biases. We focus on fusion techniques for integrating information
                                from different modalities [8], specifically analysing early- and late- fusion techniques known for
                                their straightforward interpretability and widespread usage in multimodal AI systems [9, 10].


                                EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany
                                *
                                 Corresponding author.
                                $ swati.swati@unibw.de (S. Swati); arjun.roy@unibw.de (A. Roy); eirini.ntoutsi@unibw.de (E. Ntoutsi)
                                 0000-0002-7637-6640 (S. Swati); 0000-0002-4279-9442 (A. Roy); 0000-0001-5729-1003 (E. Ntoutsi)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Early-fusion typically concatenates features from different modalities early on, creating
a unified representation of the data [10], which simplifies training and effectively captures
interactions between modalities [11]. Late-fusion, on the other hand, processes each modality in-
dividually before combining their outputs at a later stage, offering flexibility by allowing different
processing pathways for individual modalities [12]. While late-fusion captures modality-specific
patterns more accurately, it may overlook lower-level interactions between modalities [13]. By
investigating these two fusion strategies, we aim to gain insight into how they impact bias and
fairness in automated recruitment processes.


2. Experimental Setup
Dataset: The FairCVdb dataset [7] comprises of 24, 000 synthetic resume profiles, each featuring
demographic characteristics (gender and ethnicity), textual data (a short biography), visual data
(a facial image), and tabular data (seven common resume attributes). The resume attributes
include occupation, suitability, education, previous experience, recommendation, availability,
and language proficiency. Each profile has been generated based on two gender categories
and three ethnic categories. The profiles in the dataset are scored based on the likelihood of a
candidate being invited to an interview, yielding a numerical score. These scores are assigned
either blindly (i.e., without any bias), leading to bias-neutral scores, or with a penalty factor
applied to specific individuals within a demographic group, resulting in biased scores. See [14]
for more details. This setup simulates scenarios where cognitive biases, introduced by humans,
protocols, or automated systems, influence the decision-making process.
Evaluation Metrics: Following [14], we use Mean Absolute Error (MAE) to measure prediction
error and Kullback-Leibler divergence (KL) to assess demographic bias. For gender, we compare
score distributions for males and females; for ethnicity, we perform pairwise comparisons and
report the average divergence.
Models: We extend the testbed [14] to facilitate multimodal recruitment learning by including
early-fusion and late-fusion techniques for all three modalities (textual, visual, tabular).
Simulated setups: We investigated both i) unbiased ideal world setup and ii) real-world setups
gender- and ethnicity- biased).


3. Evaluation Results
Figure 1 depicts the KL-divergence (KL), Mean Absolute Error (MAE), and score distributions
for gender and ethnicity across different modalities and bias setups. A smaller KL-divergence
indicates better alignment between distributions, implying less bias, while a lower MAE indicates
a smaller margin of error. Analysis of the score distributions offers additional insights into the
models’ predictive performance.

In unbiased ideal-world (Neutral): We note that the ground-truth distributions are closely
aligned for both demographics (c.f., Figure 1a). W.r.t. individual modalities (c.f., Figure 1b - 1d),
we can see that tabular modality exhibits a lower score distribution centred around a mean of
                              Hiring score distribution by Gender            Hiring score distribution by Ethnicity
                               KL = 0.005               KL = 0.262                KL = 0.007                KL = 0.526
    (a) Ground-Truth
                        (1)                       (2)                       (3)                       (4)


                       KL = 0.015, MAE = 0.060   KL = 0.011, MAE = 0.073   KL = 0.014, MAE = 0.060   KL = 0.048, MAE = 0.082
                        (1)                       (2)                       (3)                       (4)
    (b) Tabular


                       KL = 0.008, MAE = 0.092   KL = 0.829, MAE = 0.079   KL = 0.014, MAE = 0.092   KL = 0.017, MAE = 0.115
                        (1)                       (2)                       (3)                       (4)
    (c) Textual


                       KL = 0.012, MAE = 0.100   KL = 1.842, MAE = 0.104   KL = 0.032, MAE = 0.100   KL = 0.973, MAE = 0.111
                        (1)                       (2)                       (3)                       (4)
    (d) Visual


                       KL = 0.017, MAE = 0.026   KL = 0.261, MAE = 0.032   KL = 0.019, MAE = 0.026   KL = 0.174, MAE = 0.059
    (e) Early-Fusion


                        (1)                       (2)                       (3)                       (4)


                       KL = 0.004, MAE = 0.084   KL = 0.481, MAE = 0.090   KL = 0.011, MAE = 0.084   KL = 0.206, MAE = 0.103
    (f) Late-Fusion


                        (1)                       (2)                       (3)                       (4)


                               Neutral             Gender-Biased                  Neutral              Ethnicity-Biased
Figure 1: KL-divergence between score distributions across Gender and Ethnicity demographics for
different modalities and bias setups. Lower KL and MAE scores are better.


0.4 with a negatively-skewed distribution, indicating that it tends to underestimate the ground-
truth. The presence of a bimodal distribution in the textual modality is specially intriguing,
demonstrating its ability to differentiate between instances with high and low scores. The Visual
modality, on the other hand, exhibits extreme behaviour by concentrating the distribution of
nearly the entire population within a very narrow range [0.39–0.44] (c.f., Figure 1d), pointing
an over-generalization of the mean score to all instances. Interestingly, late-fusion produces
the least biased results for both demographics. However, while aggregating the decisions from
different modalities, its average decision gets affected by the extremity of the visual modality,
leading to over-generalization of the mean score, consequently resulting in higher MAEs (c.f.,
Figure 1f). In contrast, early-fusion delivers the most accurate predictions with the lowest MAEs
(c.f., Figure 1e) by effectively learning and resolving the unique peculiarities of each modality,
such as underestimation, over-generalization, and bimodal distribution, resulting in a shape
that resembles the ground-truth (c.f., Figure 1a, 1e).
In biased real-world setups (Gender/Ethnicity-Biased): We observe that the ground-truth
distributions are not aligned for both demographics (c.f., Figure 1a). W.r.t. individual modalities
(c.f., Figure 1b - 1d), we see that the tabular modality continues to exhibit underestimation across
all demographics, which leads to close alignment of the demographic specific distributions
(c.f., Figure 1b(2) and b(4)). With textual modality we notice a misalignment of distribution
w.r.t. gender demographics with a favourable skewness for males. However, no such bias
is observed w.r.t. ethnicity, indicating a possibility of gender-skewness being much higher
than the ethnicity-skewness for the job-related words in the embedded space. Conversely,
the visual modality demonstrates the most extreme bias for both demographics. Regarding
gender, it shows a positive bias towards males, while for ethnicity, it overgeneralizes Asians,
discriminates against Blacks, and favours Caucasians. Continuing the trend established in the
neutral setup, Early fusion consistently mimics the ground-truth for both demographics, yielding
the lowest MAEs while maintaining fairness. Late-fusion, while also following its trend, tends
to over-generalize the mean score, resulting in higher MAEs but also higher KL scores.
   In general, leveraging multimodal data can enhance performance and mitigate bias compared
to relying on a single modality. However, blindly fusing all modalities may not always yield
the best results. For instance, the tabular in gender-biased setup (c.f., Figure 1b(2)) and the
textual in ethnicity-biased setup (c.f., Figure 1c(4)) outperformed both fusion strategies. We
hypothesise that late-fusion exacerbates biases by independently learning biased models for each
modality, cumulatively impacting decision fairness, while early-fusion offers greater flexibility
and generally yields fairer outcomes with lower prediction error. Dataset diversity and biases
may have influenced these findings, highlighting the need to assess robustness across multiple
datasets, domains, and fusion strategies. We contemplate that in the future, exploring mid-fusion
strategies could enhance fairness and accuracy in decision-making through strategic selection
and a combination of modalities.


4. Conclusions
In our study, we used the FairCVdb dataset to investigate the bias implications of early- and
late- fusion strategies in multimodal AI-based recruitment. We assessed biases in gender and
ethnicity demographics across both unbiased (neutral) and real-world (gender/ethnicity-biased)
setups. Our findings reveal that early-fusion closely mimics the ground truth for both demo-
graphics, achieving the lowest MAEs by effectively incorporating the unique characteristics of
each modality. In contrast, late-fusion leads to highly over-generalized mean scores, resulting
in higher MAEs. Our evaluation underscores the significant potential of early-fusion for appli-
cations requiring both accuracy and fairness, providing robust solutions even in the presence
of demographic biases. Based on the results, we speculate that mid-fusion strategies may
enhance fairness and accuracy by strategically selecting and combining modalities. Exploring
these findings across diverse datasets and domains beyond hiring could further broaden the
study’s impact and relevance. Ethics statement: Understanding the risks of using simulated
or synthetic data is crucial for fairness, transparency, and effectiveness in automated hiring
processes.


Acknowledgments
This research work is funded by the European Union under the Horizon Europe MAMMOth
project, Grant Agreement ID: 101070285. UK participant in Horizon Europe Project MAMMOth
is supported by UKRI grant number 10041914 (Trilateral Research LTD). The research is also
supported by the EU Horizon Europe project STELAR, Grant Agreement ID: 101070122.


References
 [1] M. Raghavan, S. Barocas, J. Kleinberg, K. Levy, Mitigating bias in algorithmic hiring:
     Evaluating claims and practices, in: Proceedings of the 2020 conference on fairness,
     accountability, and transparency (ACM FAT*), 2020, pp. 469–481.
 [2] T. Le Quy, A. Roy, V. Iosifidis, W. Zhang, E. Ntoutsi, A survey on datasets for fairness-
     aware machine learning, Wiley Interdisciplinary Reviews: Data Mining and Knowledge
     Discovery 12 (2022) e1452.
 [3] Y. Cai, A. Zimek, G. Wunder, E. Ntoutsi, Power of explanations: Towards automatic
     debiasing in hate speech detection, in: 2022 IEEE 9th International Conference on Data
     Science and Advanced Analytics (DSAA), IEEE, 2022, pp. 1–10.
 [4] S. Fabbrizzi, S. Papadopoulos, E. Ntoutsi, I. Kompatsiaris, A survey on bias in visual
     datasets, Computer Vision and Image Understanding 223 (2022) 103552.
 [5] S. Ghodsi, S. A. Seyedi, E. Ntoutsi, Towards cohesion-fairness harmony: Contrastive
     regularization in individual fair graph clustering, in: Pacific-Asia Conference on Knowledge
     Discovery and Data Mining (PAKDD), Springer, 2024, pp. 284–296.
 [6] B. M. Booth, L. Hickman, S. K. Subburaj, L. Tay, S. E. Woo, S. K. D’Mello, Bias and fairness in
     multimodal machine learning: A case study of automated video interviews, in: Proceedings
     of the 2021 International Conference on Multimodal Interaction (ICMI), 2021, pp. 268–277.
 [7] A. Pena, I. Serna, A. Morales, J. Fierrez, Bias in multimodal ai: Testbed for fair automatic
     recruitment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition Workshops (Workshop@CVPR), 2020, pp. 28–29.
 [8] Z. Xue, R. Marculescu, Dynamic multimodal fusion, in: Proceedings of the IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2574–2583.
 [9] K. Gadzicki, R. Khamsehashari, C. Zetzsche, Early vs late fusion in multimodal convolu-
     tional neural networks, in: 2020 IEEE 23rd international conference on information fusion
     (FUSION), IEEE, 2020, pp. 1–6.
[10] L. M. Pereira, A. Salazar, L. Vergara, A comparative analysis of early and late fusion for
     the multimodal two-class problem, IEEE Access (2023).
[11] G. Barnum, S. Talukder, Y. Yue, On the benefits of early fusion in multimodal representation
     learning, arXiv preprint arXiv:2011.07191 (2020).
[12] L. M. Pereira, A. Salazar, L. Vergara, On comparing early and late fusion methods, in:
     International Work-Conference on Artificial Neural Networks (IWANN), Springer, 2023,
     pp. 365–378.
[13] K. Bayoudh, R. Knani, F. Hamdaoui, A. Mtibaa, A survey on deep multimodal learning for
     computer vision: advances, trends, applications, and datasets, The Visual Computer 38
     (2022) 2939–2970.
[14] A. Peña, I. Serna, A. Morales, J. Fierrez, A. Ortega, A. Herrarte, M. Alcantara, J. Ortega-
     Garcia, Human-centric multimodal machine learning: Recent advances and testbed on
     ai-based recruitment, SN Computer Science 4 (2023) 434.