<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>European Workshop on Algorithmic Fairness, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploring Fusion Techniques in Multimodal AI-Based Recruitment: Insights from FairCVdb</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Swati Swati</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjun Roy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eirini Ntoutsi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Free University Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Institute CODE, University of the Bundeswehr Munich</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>Despite the large body of work on fairness-aware learning for individual modalities like tabular data, images, and text, less work has been done on multimodal data, which fuses various modalities for a comprehensive analysis. In this work, we investigate the fairness and bias implications of multimodal fusion techniques in the context of multimodal AI-based recruitment systems using the FairCVdb dataset. Our results show that early-fusion closely matches the ground truth for both demographics, achieving the lowest MAEs by integrating each modality's unique characteristics. In contrast, late-fusion leads to highly generalized mean scores an d higher MAEs. Our findings emphasise the significant potential of early-fusion for accurate and fair applications, even in the presence of demographic biases, compared to late-fusion. Future research could explore alternative fusion strategies and incorporate modality-related fairness constraints to improve fairness. For code and additional insights, visit: https: //github.com/Swati17293/Multimodal-AI-Based-Recruitment-FairCVdb</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal bias</kwd>
        <kwd>Multimodal fairness</kwd>
        <kwd>Algorithmic Fairness</kwd>
        <kwd>Fairness</kwd>
        <kwd>Early Fusion</kwd>
        <kwd>Late Fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing popularity of decision-making algorithms has raised concerns about bias in
decision-making, especially towards specific social groups defined by protected attributes
such as gender and ethnicity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Research on fairness-aware learning primarily focuses on
individual modalities, such as tabular data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], images [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and graphs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However,
there has been less focus on bias in multimodal systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which can result from integration
complexity, unbalanced representation, alignment, and the compounding efect of biases present
in each modality. To this end, in this work, we investigate the bias and fairness implications of
multimodal AI in automated recruitment systems using the FairCVdb [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] dataset. We use it as a
testbed, as it ofers diverse data, including images, text, and structured data with intentionally
designed gender and ethnicity biases. We focus on fusion techniques for integrating information
from diferent modalities [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], specifically analysing early- and late- fusion techniques known for
their straightforward interpretability and widespread usage in multimodal AI systems [
        <xref ref-type="bibr" rid="ref9">9, 10</xref>
        ].
      </p>
      <p>Early-fusion typically concatenates features from diferent modalities early on, creating
a unified representation of the data [ 10], which simplifies training and efectively captures
interactions between modalities [11]. Late-fusion, on the other hand, processes each modality
individually before combining their outputs at a later stage, ofering flexibility by allowing diferent
processing pathways for individual modalities [12]. While late-fusion captures modality-specific
patterns more accurately, it may overlook lower-level interactions between modalities [13]. By
investigating these two fusion strategies, we aim to gain insight into how they impact bias and
fairness in automated recruitment processes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <p>
        Dataset: The FairCVdb dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] comprises of 24, 000 synthetic resume profiles, each featuring
demographic characteristics (gender and ethnicity), textual data (a short biography), visual data
(a facial image), and tabular data (seven common resume attributes). The resume attributes
include occupation, suitability, education, previous experience, recommendation, availability,
and language proficiency. Each profile has been generated based on two gender categories
and three ethnic categories. The profiles in the dataset are scored based on the likelihood of a
candidate being invited to an interview, yielding a numerical score. These scores are assigned
either blindly (i.e., without any bias), leading to bias-neutral scores, or with a penalty factor
applied to specific individuals within a demographic group, resulting in biased scores. See [ 14]
for more details. This setup simulates scenarios where cognitive biases, introduced by humans,
protocols, or automated systems, influence the decision-making process.
      </p>
      <p>Evaluation Metrics: Following [14], we use Mean Absolute Error (MAE) to measure prediction
error and Kullback-Leibler divergence (KL) to assess demographic bias. For gender, we compare
score distributions for males and females; for ethnicity, we perform pairwise comparisons and
report the average divergence.</p>
      <p>Models: We extend the testbed [14] to facilitate multimodal recruitment learning by including
early-fusion and late-fusion techniques for all three modalities (textual, visual, tabular).
Simulated setups: We investigated both i) unbiased ideal world setup and ii) real-world setups
gender- and ethnicity- biased).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation Results</title>
      <p>In unbiased ideal-world (Neutral): We note that the ground-truth distributions are closely
aligned for both demographics (c.f., Figure 1a). W.r.t. individual modalities (c.f., Figure 1b - 1d),
we can see that tabular modality exhibits a lower score distribution centred around a mean of
h
t
u
r
T
d
n
u
o
r
G
)
a
(
r
a
l
u
b
a
T
)
b
(
l
a
u
t
x
e
T
)
c
(
l
a
u
s
i
V
)
d
(
n
o
i
s
u
F
y
l
r
a
E
)
e
(
n
o
i
s
u
F
e
t
a
L
)
f
(</p>
      <sec id="sec-3-1">
        <title>Hiring score distribution by Gender</title>
      </sec>
      <sec id="sec-3-2">
        <title>Hiring score distribution by Ethnicity</title>
        <p>0.4 with a negatively-skewed distribution, indicating that it tends to underestimate the
groundtruth. The presence of a bimodal distribution in the textual modality is specially intriguing,
demonstrating its ability to diferentiate between instances with high and low scores. The Visual
modality, on the other hand, exhibits extreme behaviour by concentrating the distribution of
nearly the entire population within a very narrow range [0.39–0.44] (c.f., Figure 1d), pointing
an over-generalization of the mean score to all instances. Interestingly, late-fusion produces
the least biased results for both demographics. However, while aggregating the decisions from
diferent modalities, its average decision gets afected by the extremity of the visual modality,
leading to over-generalization of the mean score, consequently resulting in higher MAEs (c.f.,
Figure 1f). In contrast, early-fusion delivers the most accurate predictions with the lowest MAEs
(c.f., Figure 1e) by efectively learning and resolving the unique peculiarities of each modality,
such as underestimation, over-generalization, and bimodal distribution, resulting in a shape
that resembles the ground-truth (c.f., Figure 1a, 1e).</p>
        <p>In biased real-world setups (Gender/Ethnicity-Biased): We observe that the ground-truth
distributions are not aligned for both demographics (c.f., Figure 1a). W.r.t. individual modalities
(c.f., Figure 1b - 1d), we see that the tabular modality continues to exhibit underestimation across
all demographics, which leads to close alignment of the demographic specific distributions
(c.f., Figure 1b(2) and b(4)). With textual modality we notice a misalignment of distribution
w.r.t. gender demographics with a favourable skewness for males. However, no such bias
is observed w.r.t. ethnicity, indicating a possibility of gender-skewness being much higher
than the ethnicity-skewness for the job-related words in the embedded space. Conversely,
the visual modality demonstrates the most extreme bias for both demographics. Regarding
gender, it shows a positive bias towards males, while for ethnicity, it overgeneralizes Asians,
discriminates against Blacks, and favours Caucasians. Continuing the trend established in the
neutral setup, Early fusion consistently mimics the ground-truth for both demographics, yielding
the lowest MAEs while maintaining fairness. Late-fusion, while also following its trend, tends
to over-generalize the mean score, resulting in higher MAEs but also higher KL scores.</p>
        <p>In general, leveraging multimodal data can enhance performance and mitigate bias compared
to relying on a single modality. However, blindly fusing all modalities may not always yield
the best results. For instance, the tabular in gender-biased setup (c.f., Figure 1b(2)) and the
textual in ethnicity-biased setup (c.f., Figure 1c(4)) outperformed both fusion strategies. We
hypothesise that late-fusion exacerbates biases by independently learning biased models for each
modality, cumulatively impacting decision fairness, while early-fusion ofers greater flexibility
and generally yields fairer outcomes with lower prediction error. Dataset diversity and biases
may have influenced these findings, highlighting the need to assess robustness across multiple
datasets, domains, and fusion strategies. We contemplate that in the future, exploring mid-fusion
strategies could enhance fairness and accuracy in decision-making through strategic selection
and a combination of modalities.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In our study, we used the FairCVdb dataset to investigate the bias implications of early- and
late- fusion strategies in multimodal AI-based recruitment. We assessed biases in gender and
ethnicity demographics across both unbiased (neutral) and real-world (gender/ethnicity-biased)
setups. Our findings reveal that early-fusion closely mimics the ground truth for both
demographics, achieving the lowest MAEs by efectively incorporating the unique characteristics of
each modality. In contrast, late-fusion leads to highly over-generalized mean scores, resulting
in higher MAEs. Our evaluation underscores the significant potential of early-fusion for
applications requiring both accuracy and fairness, providing robust solutions even in the presence
of demographic biases. Based on the results, we speculate that mid-fusion strategies may
enhance fairness and accuracy by strategically selecting and combining modalities. Exploring
these findings across diverse datasets and domains beyond hiring could further broaden the
study’s impact and relevance. Ethics statement: Understanding the risks of using simulated
or synthetic data is crucial for fairness, transparency, and efectiveness in automated hiring
processes.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research work is funded by the European Union under the Horizon Europe MAMMOth
project, Grant Agreement ID: 101070285. UK participant in Horizon Europe Project MAMMOth
is supported by UKRI grant number 10041914 (Trilateral Research LTD). The research is also
supported by the EU Horizon Europe project STELAR, Grant Agreement ID: 101070122.
tional neural networks, in: 2020 IEEE 23rd international conference on information fusion
(FUSION), IEEE, 2020, pp. 1–6.
[10] L. M. Pereira, A. Salazar, L. Vergara, A comparative analysis of early and late fusion for
the multimodal two-class problem, IEEE Access (2023).
[11] G. Barnum, S. Talukder, Y. Yue, On the benefits of early fusion in multimodal representation
learning, arXiv preprint arXiv:2011.07191 (2020).
[12] L. M. Pereira, A. Salazar, L. Vergara, On comparing early and late fusion methods, in:
International Work-Conference on Artificial Neural Networks (IWANN), Springer, 2023,
pp. 365–378.
[13] K. Bayoudh, R. Knani, F. Hamdaoui, A. Mtibaa, A survey on deep multimodal learning for
computer vision: advances, trends, applications, and datasets, The Visual Computer 38
(2022) 2939–2970.
[14] A. Peña, I. Serna, A. Morales, J. Fierrez, A. Ortega, A. Herrarte, M. Alcantara, J.
OrtegaGarcia, Human-centric multimodal machine learning: Recent advances and testbed on
ai-based recruitment, SN Computer Science 4 (2023) 434.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Levy</surname>
          </string-name>
          ,
          <article-title>Mitigating bias in algorithmic hiring: Evaluating claims and practices</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on fairness, accountability, and transparency (ACM FAT*)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Le Quy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Iosifidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Ntoutsi,
          <article-title>A survey on datasets for fairnessaware machine learning</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <article-title>e1452</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wunder</surname>
          </string-name>
          , E. Ntoutsi,
          <article-title>Power of explanations: Towards automatic debiasing in hate speech detection</article-title>
          ,
          <source>in: 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fabbrizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ntoutsi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris,</surname>
          </string-name>
          <article-title>A survey on bias in visual datasets</article-title>
          ,
          <source>Computer Vision and Image Understanding</source>
          <volume>223</volume>
          (
          <year>2022</year>
          )
          <fpage>103552</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghodsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Seyedi</surname>
          </string-name>
          , E. Ntoutsi,
          <article-title>Towards cohesion-fairness harmony: Contrastive regularization in individual fair graph clustering</article-title>
          ,
          <source>in: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>284</fpage>
          -
          <lpage>296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Booth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hickman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Subburaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>K. D'Mello</surname>
          </string-name>
          ,
          <article-title>Bias and fairness in multimodal machine learning: A case study of automated video interviews</article-title>
          ,
          <source>in: Proceedings of the 2021 International Conference on Multimodal Interaction (ICMI)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pena</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fierrez</surname>
          </string-name>
          ,
          <article-title>Bias in multimodal ai: Testbed for fair automatic recruitment</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (Workshop@CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Marculescu</surname>
          </string-name>
          ,
          <article-title>Dynamic multimodal fusion</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>2574</fpage>
          -
          <lpage>2583</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gadzicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khamsehashari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zetzsche</surname>
          </string-name>
          ,
          <article-title>Early vs late fusion in multimodal convolu-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>