<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bologna, Italy
* Corresponding author.
$ gsarridis@iti.gr (I. Sarridis); ckoutlis@iti.gr (C. Koutlis); papadop@iti.gr (S. Papadopoulos); cdiou@iti.gr (C. Diou)
 https://gsarridis.github.io/ (I. Sarridis); https://mever.gr/author/koutlis-christos (C. Koutlis);
https://mever.gr/author/papadopoulos-symeon (S. Papadopoulos); https://diou.github.io (C. Diou)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>VB-Mitigator: An Open-source Framework for Evaluating and Advancing Visual Bias Mitigation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioannis Sarridis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christos Koutlis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Symeon Papadopoulos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christos Diou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics and Telematics, Harokopio University of Athens Omirou 9</institution>
          ,
          <addr-line>Tavros, 17778, Attika</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Technologies Institute, Centre for Research and Technology Hellas</institution>
          ,
          <addr-line>6th km Charilaou-Thermi Rd, Thessaloniki, 57001</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Bias in computer vision models remains a significant challenge, often resulting in unfair, unreliable, and nongeneralizable AI systems. Although research into bias mitigation has intensified, progress continues to be hindered by fragmented implementations and inconsistent evaluation practices. Disparate datasets and metrics used across studies complicate reproducibility, making it dificult to fairly assess and compare the efectiveness of various approaches. To overcome these limitations, we introduce the Visual Bias Mitigator (VB-Mitigator), an open-source framework designed to streamline the development, evaluation, and comparative analysis of visual bias mitigation techniques. VB-Mitigator ofers a unified research environment encompassing 12 established mitigation methods, and 7 diverse benchmark datasets. A key strength of VB-Mitigator is its extensibility, allowing for seamless integration of additional methods, datasets, metrics, and models. VB-Mitigator aims to accelerate research toward bias-aware computer vision models by serving as a foundational library for the research community to develop and assess their approaches. To this end, we also recommend best evaluation practices and provide a comprehensive performance comparison among state-of-the-art methodologies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AI fairness</kwd>
        <kwd>AI bias</kwd>
        <kwd>bias mitigation</kwd>
        <kwd>computer vision</kwd>
        <kwd>spurious correlations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Computer vision (CV) systems have experienced significant growth and adoption across various fields
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. CV advancements have substantially improved automation, eficiency, and accuracy in
numerous applications. However, the widespread presence of biases in CV models remains a critical and
open challenge [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8">4, 5, 6, 7, 8</xref>
        ]. These biases, often arising from imbalanced training datasets, cause models
to learn spurious correlations rather than meaningful, generalizable patterns [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. Consequently,
such models often lead to unreliable predictions, reduced generalizability, and outcomes that perpetuate
data bias and stereotypes. For instance, facial recognition systems trained on skewed demographic
distributions have exhibited racial biases, leading to harmful real-world consequences [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>Although researchers have increasingly recognized these issues and dedicated significant efort
toward bias mitigation strategies, the field sufers from fragmentation in implementation and
evaluation practices, making it challenging to fairly assess and compare the eficacy of diferent mitigation
approaches. This complicates reproducibility and slows the development of robust solutions.</p>
      <p>To address these challenges, we introduce the Visual Bias Mitigator (VB-Mitigator), the first
opensource1 library specifically created to facilitate standardized development, evaluation, and comparative
analysis of visual bias mitigation methods. VB-Mitigator provides an integrated benchmarking
environment currently supporting 12 established bias mitigation approaches, 7 commonly used datasets
— including synthetic datasets, datasets involving standard protected attributes (such as gender, age,
or race), background-related biases, and general-purpose CV datasets — and evaluation metrics to
comprehensively assess bias mitigation performance. A key advantage of VB-Mitigator is extensibility.
Its modular architecture facilitates efortless integration of new methods, datasets, evaluation metrics,
and models. Furthermore, in the context of this work, we provide an extensive comparative evaluation
of the bias mitigation methods included in VB-Mitigator.</p>
      <p>The main contributions of this work are the following:
• An open-source library designed to standardize and simplify the development, evaluation, and
comparison of visual bias mitigation methods.
• A modular and extensible library architecture that allows for the easy integration of new bias
mitigation methods, datasets, evaluation metrics, and models.</p>
      <p>• An extensive comparative evaluation of 12 established bias mitigation methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The study of bias in machine learning and computer vision has been the subject of extensive research.
Surveys such as those by Mehrabi et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Ntoutsi et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provide broad overviews of fairness
challenges across AI systems, while Fabbrizzi et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] analyze bias in visual datasets, and Ye et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
survey focuses specifically on spurious correlations. In this work, we also focus on spurious correlations,
which can either be associated with societal bias when they involve protected attributes (e.g., gender,
race, or age) or represent more generic dataset biases (e.g., the tendency of waterbird images to co-occur
with aquatic backgrounds [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
      </p>
      <p>
        Bias mitigation methods can be broadly categorized based on their reliance on explicit bias annotations.
Specifically, we distinguish between bias label unaware (BLU) methods, which operate without access to
information about attributes inducing spurious correlations, and bias label aware (BLA) methods, which
leverage such annotations [
        <xref ref-type="bibr" rid="ref15">15, 16</xref>
        ]. While BLA techniques often demonstrate superior performance in
controlled settings, BLU approaches exhibit broader applicability and are therefore more suitable for
real-world scenarios where bias annotations may be unavailable or unreliable [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Methods such as Group Distributionally Robust Optimization (GroupDro) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Domain Independent
(DI) [17], Entangling and Disentangling (EnD) [18], Bias Balance (BB) [19], and Bias Addition (BAdd)
[20] leverage bias annotations to guide the learning process. In particular, GroupDRO minimizes
the worst-case loss across predefined groups, ensuring robust performance on minority groups. DI
employs domain-specific classification heads to encourage domain-invariant representations. EnD
explicitly disentangles bias representations from class representations, while BB balances bias in the
logit space using prior bias knowledge. Finally, BAdd introduces bias-capturing features during training
to discourage reliance on them.
      </p>
      <p>
        A complementary line of work does not require bias labels and instead relies on bias-capturing
models or other mechanisms to identify spurious correlations. This set of methods include Learning
from Failure (LfF) [21], Just Train Twice (JTT) [22], Soft Contrastive (SoftCon) [19], Debiasing Alternate
Networks (Debian) [23], Fairness Aware Representation Learning (FLAC) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and Mitigate Any Visual
Bias (MAVias) [16]. Specifically, LfF reweights samples based on bias-conflicting predictions, whereas
JTT focuses on misclassified samples under the assumption that they are bias-conflicting. SoftCon
leverages a weighted contrastive loss derived from bias-capturing features. Debian alternates the
training of main and auxiliary models to progressively reduce bias influence. FLAC minimizes the
mutual information between representations and biased attributes, and MAVias uses foundation models
combined with a regularization loss to infer and mitigate visual biases automatically. Some methods,
such as Spectral Decouple (SD) [24], neither rely on bias labels nor auxiliary bias models. In particular,
SD achieves bias robustness by regularizing the network logits, thereby reducing overfitting to spurious
correlations.
      </p>
      <p>Despite these advances, reproducibility and comparison across methods remain challenging.
Evaluation practices are not standardized, with each study adopting its own choice of datasets and metrics. This
fragmentation hinders systematic assessment. To address this gap, our work contributes an open-source
Configuration</p>
      <p>Launcher</p>
      <p>Trainer</p>
      <p>BaseTrainer
_setup_optimizer() _set_eval()
_setup_criterion() _train_epoch()
_setup_scheduler() _train_iter()
_setup_models() _val_epoch()
_setup_device() _val_iter()
_setup_logger() _log_epoch()
_save_checkpoint() _update_best()
_load_checkpoint() _optimizer_step()
_method_specific_setups() _loss_backward()
_metric_specific_setups() train()
_set_train() eval()
Metric</p>
      <p>Logger
library that integrates a diverse set of bias mitigation methods, datasets, and evaluation metrics, under
a standardized environment for comparative evaluation and future extensions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. VB-Mitigator Architecture</title>
      <p>Building an integrated environment for visual bias mitigation presents significant challenges. Existing
techniques vary widely, intervening at diverse stages of the training pipeline—from data manipulation
within data loaders and adjustments to loss functions, to the integration of external bias detection
models and complex multi-stage training protocols. Compounding this, metric implementations are
often dataset-specific, and limited to single or dual bias scenarios, hindering the creation of a general
evaluation library. VB-Mitigator directly tackles these obstacles through an abstract and modular
architecture built on PyTorch, chosen for its flexibility and robust ecosystem. By providing standardized
interfaces for datasets, models, mitigation strategies, and evaluation metrics, the library enables seamless
integration and comparison of diverse approaches. Overall, the design of VB-Mitigator incorporates the
following key properties:
• Modularity: VB-Mitigator adopts a modular architecture, where each component (i.e., datasets,
models, mitigators, evaluation metrics, and logging) is encapsulated in separate modules, which
allows for experimentation with diferent configurations without altering the core functionality
of the system.
• Extensibility: VB-Mitigator is designed to facilitate seamless integration of new bias mitigation
techniques, datasets, and evaluation metrics. As previously discussed, the abstract class definitions
allow for introducing new methods with minimal efort, as they only need to define the pipeline
components where their approach intervenes.
• Reproducibility: Reproducibility in experiments is one of the objectives of VB-Mitigator. To
achieve this, all operations involving stochasticity are explicitly seeded, while CUDA algorithms
are configured to operate in a deterministic mode. However, complete determinism cannot be
guaranteed, as it may be afected by diferences in hardware and CUDA versions, as well as by
certain CUDA operations that inherently lack deterministic support.</p>
      <p>Datasets. The dataset component (datasets/) encapsulates PyTorch Dataset classes, engineered to
return dictionaries containing input images, targets, biases (or protected attributes), and sample indices
via the __getitem__ method. Furthermore, a builder.py module facilitates dataset construction,
generating a comprehensive dictionary that includes critical metadata such as the number of classes, a
list of protected attributes, the number of subgroups, class names, data subsets, and initial data loaders.
This metadata is essential for model initialization, metric computation, training orchestration, and
dynamic data loader updates.</p>
      <p>Mitigators. The mitigator component (mitigators/) in VB-Mitigator acts as the core algorithmic
engine, ofering a flexible and standardized platform for bias mitigation algorithms. The BaseTrainer
class establishes a comprehensive foundation for method implementation, defining functions for every
stage of the training pipeline, including dataset handling, model training, metric computation, and
logging (Figure 1). Additionally, the library supports method-specific configurations, facilitating the
inclusion of pre-processing steps such as bias pseudo-label generation. This modular architecture allows
each bias mitigation strategy to implement only the pipeline components where it actively intervenes,
ensuring ease of integration and maintainability.</p>
      <p>
        Models. The models component (models/) contains a diverse collection of neural network
architectures commonly used in visual bias mitigation research. It supports a range of architectures, including
lightweight Convolutional Neural Networks (CNNs) designed for small-scale datasets such as
BiasedMNIST[25], widely adopted CNN architectures like ResNets [26] and EficientNets [ 27], and modern
vision transformers [
        <xref ref-type="bibr" rid="ref1">1, 28</xref>
        ] for more complex tasks. Additionally, it accommodates custom architectures
tailored to specific bias mitigation methods, ensuring flexibility for diverse experimental setups.
Metrics. The metric component (metrics/) provides a comprehensive suite of evaluation metrics,
tailored to bias assessment. Recognizing the common practice of employing multiple metrics for fairness
evaluation (e.g., worst-group accuracy alongside average group accuracy), our implementation supports
metric classes that encompass multiple measurements. Each metric class is further configured with two
attributes: (i) an indicator specifying whether the metric is error-based or higher-is-better, and (ii) a
designation of the primary evaluation metric that should be used for checkpoint selection.
Tools and utilities. This component (tools/) encompasses essential utilities for experiment
management, ensuring a streamlined workflow within the framework. It includes launcher scripts for
experiment execution and critical functionalities such as logging and model checkpointing. The logging
system supports both Wandb2 and TensorBoard3 for real-time monitoring while also generating detailed
logs in human-readable format and structured CSV files that can be used for visualization purposes.
Additionally, it manages checkpointing, storing model states at various stages, including the latest
epoch, the best-performing model, and intermediate checkpoints, enabling experiment resumption.
Configuration and execution scripts. The configuration files act as a central control mechanism,
allowing researchers to seamlessly switch between datasets, bias mitigation methods, and
hyperparameters. Given the extensive number of configurable variables, we utilize YACS 4, which provides a
structured and eficient library to manage configurations. This enhances flexibility while maintaining
clarity in experimental setups. Finally, the scripts/ directory contains a collection of shell scripts
designed to simplify and automate the execution of experiments across diferent bias mitigation methods
and datasets.
2https://wandb.ai/site
3https://www.tensorflow.org/tensorboard
4https://github.com/rbgirshick/yacs
approaches.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodologies</title>
      <sec id="sec-4-1">
        <title>4.1. Preliminary Notation</title>
        <p>Here, we introduce the notation used throughout the paper to ensure consistency across all methods. Let
 (x;  ) denote a neural network parameterized by  , which maps an input sample x ∈  to an output
prediction ˆ ∈ . The dataset consists of  training samples, where each sample is associated with a
target label  ∈  and may also include a tuple of the bias attributes labels, e.g.,  = (female, black, 30) ∈
 for the attributes gender, race, and age. The learned feature representation of an input is denoted
as z = ℎ(x;  ), where ℎ represents the feature extractor component of the model. The objective of
a vanilla model is to learn the model parameters  that minimize the average loss over the training
samples, formed as:
min



=1
1 ∑︁ ( (x;  ), ).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Method Descriptions</title>
        <p>
          VB-Mitigator encompasses several approaches of both categories, featuring the following BLA methods:
GroupDro [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], DI [17], EnD [18], BB [19], and BAdd [20], as well as the following BLU methods: LfF
[21], SD [24], JTT [22], SoftCon [19], Debian [23], FLAC (or FLAC-B) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and MAVias [16]. The key
characteristics of these methods are reported in Table 1. Below, we briefly present the considered
Summary of the integrated bias mitigation methods.
        </p>
        <p>Bias Labels Bias-Capturing Model Summary
✓
✓
✓
✓
✓
×
×
×
×
×
×
×
×
×
×
×
✓
✓
✓
✓
✓
✓
✓
✓</p>
        <p>Minimizes worst-case loss across predefined groups.</p>
        <p>Uses domain-specific classification heads for domain invariance.</p>
        <p>Disentangles bias representations and entangles class representations.</p>
        <p>Balances bias in the logit space using prior bias information.</p>
        <p>Adds bias-capturing features to training to discourage their use.</p>
        <p>Reweights samples based on bias-conflicting predictions from an auxiliary model.</p>
        <p>Regularizes network logits for spectral decoupling and bias robustness.</p>
        <p>Reweights misclassified samples, assuming they are bias-conflicting.</p>
        <p>Uses a weighted contrastive loss based on bias-capturing features.</p>
        <p>Alternates training of main and auxiliary models for debiasing.</p>
        <p>Minimizes mutual information between representations and bias-capturing features.</p>
        <p>Infers and mitigates visual biases using foundation models and regularization.</p>
        <sec id="sec-4-2-1">
          <title>Group Distributionally Robust Optimization (GroupDro).</title>
          <p>GroupDro addresses bias by
minimizing the worst-case loss across predefined groups, ensuring robustness to group-level biases. Groups are
defined as combinations of  and . The objective is to minimize the maximum loss across groups,
1

min max
∈  :=</p>
          <p>∑︁ ( (x;  ), )
where  is the set of groups,  is the number of samples in group , and  is the group assignment of</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Domain Independent (DI).</title>
          <p>Domain Independent (DI) aims to mitigate bias by learning
representations that are invariant across diferent domains. Unlike traditional methods, DI employs multiple
classification heads, each corresponding to a distinct domain. For each input sample
x, belonging to
domain  , the model selects and utilizes only the logits produced by the classification head associated
with domain  . Let  (x;  ) denote the output logits from the classification head corresponding to
domain  , where  represents the model parameters. The objective is to minimize the domain-specific
loss while encouraging domain-invariant representations. This can be expressed as:
min</p>
          <p>=1
[︃ 1 ∑︁ (  (x;  ), ) .</p>
          <p>]︃</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Entangling and Disentangling (EnD).</title>
          <p>EnD explicitly disentangles representations of samples
sharing the same bias label and entangles representations of samples under the same class with diferent
bias labels. The loss function is:
min

where  is the disentanglement loss,  is the entanglement loss, and the regularization terms are
denoted as  EnD,1 and  EnD,2.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Bias Balance (BB).</title>
          <p>BB infers the unbiased distribution from the skewed one and it uses a loss
function that balances the impact of biases in the logit space, expressed as:
where p denotes the prior related to bias.</p>
          <p>Bias Addition (BAdd). BAdd explicitly adds bias to the training procedure and encourages the model
not to learn features related to that bias. The objective has the following form:


min ( (x;  ) + p, ),
min ( (x;  ), b, ),
where b denotes the bias features derived by a bias-capturing model.</p>
          <p>Learning from Failure (LfF). LfF employs a dual-model training paradigm, simultaneously training
a main classification model and an auxiliary bias prediction model. The auxiliary model is trained to
predict the biased attribute . By comparing the losses of both models within each mini-batch, LfF
identifies bias-conflicting samples. These samples are subsequently re-weighted, adjusting their impact
on the primary model’s training. The optimization objective is defined as:
min



=1
1 ∑︁ ( (x;  ), ),
where  denotes the weight assigned sample .</p>
          <p>Spectral Decouple (SD). SD shows that adding a regularization term to the network logits leads to
a spectral decouple that enhances the network robustness on spurious correlations. The objective can
be defined as:</p>
          <p>min ( (x;  ), ) +  SD‖ˆ‖2,
where  SD is the regularization weight. In contrast to other BLU methods, SD employs a generic
approach to bias mitigation, operating without aiming at any bias-specific inference.
Just Train Twice (JTT). JTT focuses on learning from misclassified examples, as they tend to be
bias-conflicting samples. It uses a re-weighting scheme to prioritize these examples during
traincorresponding targets as ′ with  ∈ [0,  ′], then the loss is calculated as follows:
ing by modifying the data loaders accordingly. If we denote the re-weighted samples as x′ and the
min</p>
          <p>′
 ′ =1
1 ∑︁ ( (x′;  ), ′).</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>Soft Contrastive (SoftCon).</title>
          <p>SoftCon is a weighted SupCon [29] loss that encourages feature
similarity between samples with the same target label, weighted by the cosine distance between their feature
derived by a bias-capturing model and representing the attribute introducing the spurious correlation.
The weights can be expressed as , = 1 − ‖b‖‖b‖
bb</p>
          <p>and the optimization objective is defined as:

min ( (x;  ), w, ).</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>Debiasing Alternate Networks (Debian).</title>
          <p>Similarly to LfF, Debian uses an auxiliary model to
encapsulate bias-related information. In particular, it employs a scheme that alternates between training
a main network and an auxiliary network to mitigate bias. The predictions of the auxiliary model are
used to assign weights to the main network’s loss. Then, similar to LfF, the objective is:
min



=1
1 ∑︁ ( (x;  ), ).</p>
          <p>Fairness Aware Representation Learning (FLAC). FLAC focuses on learning fair representations
by minimizing the dependence between features and sensitive attributes. The objective is to minimize
the mutual information between representations and sensitive attributes, defined as:

min ( (x;  ), ) +  FLAC(z,  ),
where  FLAC is a hyperparameter, (z,  ) is the mutual information between representations and
sensitive attributes, without accessing the  labels. To this end, FLAC employs a bias-capturing model
trained to derive features b. In scenarios where training a dedicated bias-capturing model is impractical
(e.g., unknown biases), the biased vanilla model can be employed, resulting in the FLAC-B variant. 
represents the learnable parameters of the main model, and z and a are vector representations of the
features and sensitive attributes, respectively.</p>
        </sec>
        <sec id="sec-4-2-7">
          <title>Mitigate Any Visual Bias (MAVias).</title>
          <p>MAVias employs a two-stage approach. First, it infers potential
visual biases by leveraging foundational models to generate descriptive tags for input images and
assess their relevance to the target class. Subsequently, these potential biases are encoded using a
vision-language model and incorporated into the training procedure as regularization, discouraging the
model from learning spurious correlations. The minimization objective is defined as:
 ,
min ( (x;  ), ) +  MAVias,1(x, b,  MAVias,2),
where  represents the parameters of a projection layer that maps bias embeddings to the vision
space,  is a regularization term, and  MAVias,1 and  MAVias,2 are hyperparameters.  represents the
learnable parameters of the main model.</p>
          <p>Note that, although diferent bias mitigation methods rely on distinct loss formulations, their ultimate
goal is aligned, allowing for a direct comparison of their performance on standard benchmarks in this
domain.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Datasets</title>
      <p>VB-Mitigator supports diverse datasets to evaluate bias mitigation techniques across a spectrum of
scenarios. These datasets encompass synthetic data with controlled biases, manually injected biases
in established benchmarks, and general-purpose images, facilitating a comprehensive assessment of
the compared methods. The bias types considered for each dataset are those defined by the dataset
providers and consistently adopted in the related literature.
Biased-MNIST. Biased-MNIST [25], a modified version of the MNIST dataset, introduces background
color correlations with digit labels. This dataset ofers varying bias strengths, with 99%, 99.5%, 99.7%,
and 99.9% co-occurrence levels commonly used to assess bias mitigation performance.
FB-Biased MNIST. Building upon Biased-MNIST, FB-Biased MNIST [20] introduces multiple biases
by correlating both foreground and background colors with digit labels. This dataset, with suggested
co-occurrence levels of 90%, 95%, and 99%, provides a more challenging benchmark.
Biased-UTKFace. The UTKFace dataset [31], comprising facial images with age, gender, and ethnicity
annotations, serves as a foundation for Biased-UTKFace [19]. In this biased variant, gender is designated
as the target variable, while race or age act as biasing attributes, exhibiting a 90% co-occurrence with the
target. This dataset is instrumental in examining demographic biases in facial attribute classification.
Biased-CelebA. Leveraging the large-scale CelebA dataset [32], which provides annotations for
diverse facial attributes, Biased-CelebA [19] focuses on gender-related biases. Here, blonde hair or
heavy makeup are the target attributes, with gender demonstrating a 90% co-occurrence, enabling the
study of attribute-specific gender biases.</p>
      <p>
        Waterbirds. The Waterbirds dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] facilitates the investigation of spurious correlations between
bird species and their background environment. Featuring images of landbirds and waterbirds against
corresponding terrestrial or aquatic backgrounds, it presents a 95% co-occurrence between bird species
and background, serving as a benchmark for evaluating context-dependent bias mitigation.
UrbanCars. UrbanCars [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a dataset of car images, is designed to explore biases in object recognition.
It examines biases in car type classification, where background (rural or urban) and co-occurring objects
introduce biases with a 95% co-occurrence rate with the target.
      </p>
      <p>ImageNet-9. ImageNet-9 [30], a subset of ImageNet [33], focuses on nine object categories. It is used
to investigate background-related biases in large-scale object recognition, ofering a more complex and
realistic evaluation scenario where biases are unknown.</p>
      <p>The progression of datasets used in visual bias mitigation reflects an increasing complexity, mirroring
the evolution of research in this domain. Early studies predominantly focused on single-attribute biases,
often utilizing synthetic or simplified datasets like Biased-MNIST. Recent research has shifted towards
exploring multi-attribute biases, as exemplified by FB-Biased MNIST and UrbanCars, where many
existing methods struggle to achieve high performance. Furthermore, the inclusion of general CV
datasets like ImageNet-9 signifies a move towards addressing the challenges of real-world scenarios,
where intersectional biases may occur. Table 2 provides an overview of the considered datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation Metrics</title>
      <p>Evaluating the eficacy of visual bias mitigation techniques necessitates careful consideration of
appropriate performance metrics. While standard accuracy, defined as:</p>
      <p>Acc = |{x ∈  : ˆ = }| ,
| |
remains a foundational measure, its utility is context-dependent. Even when bias is uniformly distributed
within a dataset, models may exhibit varying sensitivities to diferent bias attributes, rendering accuracy
on a balanced test set (such as in Biased-MNIST and FB-Biased-MNIST) insuficient to capture the
nuanced behavior of mitigation methods. On the other hand, in scenarios where biases are unknown
and test sets are debiased through augmentations, such as by removing confounding background
information in ImageNet9, accuracy constitutes a suitable measure.</p>
      <p>Furthermore, Bias-Conflict Accuracy (BCA) is typically employed for Biased-CelebA and
BiasedUTKFace datasets. BCA, defined as BCA = Acc( ) where  = {x ∈  :  conflicts with },
attempts to focus on the underrepresented groups in the data. While this metric provides insights into
performance on bias-conflicting samples, it cannot generalize to multiple, intersectional biases.</p>
      <p>Recognizing the limitations of these metrics, we advocate for the adoption of Worst Group Accuracy
(WGA) and Average Accuracy (AvgAcc). WGA, defined as WGA = min∈ Acc(), and AvgAcc,
defined as AvgAcc = |1| ∑︀∈ Acc(), are more robust evaluation metrics, as they efectively capture
performance disparities across subgroups, and are therefore more suitable for datasets with complex
bias distributions and multiple bias attributes.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Experiments</title>
      <p>This section presents a comparative evaluation of the methods implemented in VB-Mitigator on
BiasedCelebA, Waterbirds, UrbanCars, and ImageNet9. These datasets were selected to provide a representative
evaluation across diverse bias scenarios, encompassing demographic, background scene, multi-attribute,
and unknown biases, respectively.</p>
      <sec id="sec-7-1">
        <title>7.1. Evaluation Protocol</title>
        <p>Given the suitability of WGA and AvgAcc for datasets with explicitly defined biases, as outlined in
Section 6, these have been employed for the evaluation of models on Biased-CelebA, Waterbirds, and
UrbanCars. For ImageNet9, we utilized accuracy across its seven oficial test set variations, which
facilitate a more thorough assessment of a model’s dependence on non-target object features. These
variations are:
• ORIGINAL: The standard ImageNet9 test set, serving as the baseline.
• ONLY-BG-B: Images where only the background is visible, with the foreground object replaced
by a black bounding box.
• ONLY-BG-T: Images where only the background is visible, with the foreground object replaced
by an inpainted bounding box.
• NO-FG: Images where the foreground object has been segmented and removed.
• ONLY-FG: Images where only the foreground object is visible, with a black background.
• MIXED-RAND: Images with the foreground object placed on a random background from a
random class.
• MIXED-NEXT: Images with the foreground object placed on a random background from the
next class in the dataset.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Implementation Details</title>
        <p>Below, we outline the data pre-processing, model architectures, and hyperparameters common to all
methods, unless stated otherwise.</p>
        <p>Biased-CelebA. We utilized the Biased-CelebA dataset with “blond hair" as the target attribute and
“gender" as the spurious correlation. Images were resized to 224 × 224 pixels and normalized using
ImageNet statistics. A ResNet18 architecture was employed, trained for 10 epochs with a batch size of
128. The Adam optimizer was used with an initial learning rate of 0.001, which was reduced by a factor
of 0.1 at epochs 3 and 6. A weight decay of 0.0001 was applied.</p>
        <p>Waterbirds. Images were resized to 256 × 256 pixels, center-cropped to 224 × 224 pixels, and
normalized using ImageNet statistics. A ResNet50 model was trained for 100 epochs with a batch size
of 64. The Stochastic Gradient Descent (SGD) optimizer was used with a learning rate of 0.001 and a
weight decay of 0.0001.</p>
        <p>UrbanCars. Images were resized to 256 × 256 pixels, center-cropped to 224 × 224 pixels, and
subjected to random rotations (up to 45 degrees) and horizontal flips. Normalization was performed
using ImageNet statistics. A ResNet50 architecture was trained using SGD for 150 epochs with a batch
size of 64 and a weight decay of 0.0001.</p>
        <p>ImageNet9. Images were resized to 256 × 256 pixels, center-cropped to 224 × 224 pixels, and
normalized using ImageNet statistics. A ResNet50 model was trained for 30 epochs with a batch size of
64. The SGD optimizer was used with an initial learning rate of 0.001, which was reduced by a factor
of 0.5 at epoch 25.</p>
        <p>Method-specific hyperparameters were configured following the values recommended in their
respective original publications. For GroupDro, a robust step size of 0.01 was used. In SoftCon, the
Cross-Entropy loss was weighted by 0.01. The  FLAC parameter was set to 30,000, 10,000, 10,000, and
100 for Biased-CelebA, Waterbirds, UrbanCars, and ImageNet9, respectively. For JTT, bias-conflicting
samples were upweighted by a factor of 100, and a learning rate of 0.00001 with a weight decay of 1 was
employed. For MAVias, the ( MAVias,1,  MAVias,2) parameters were set to (0.01, 0.5), (0.05, 0.6), (0.01, 0.4),
and (0.001, 0.7) for Biased-CelebA, Waterbirds, UrbanCars, and ImageNet9, respectively. For SD, the  SD
parameter was set to 0.1. For EnD, the  EnD,1 and  EnD,2 parameters were both set to 1. Experiments
were repeated for 5 random seeds on an NVIDIA A100 GPU.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Results</title>
        <p>This section presents a comparative evaluation of bias mitigation methods across diverse datasets,
encompassing varying bias scenarios. As shown in Table 3 BLA methods, such as DI and BAdd,
generally exhibit superior WGA across all datasets, demonstrating the eficacy of leveraging explicit
bias information. However, GroupDRO, EnD, and BB display performance variability, particularly on
the UrbanCars dataset, where they all underperform. This stems from UrbanCars’ multi-attribute bias
structure, which contrasts with these methods’ design for single-attribute biases. BLU methods, while
typically achieving lower performance than BLA methods, reveal similar trends. Notably, MAVias
demonstrates consistent performance across datasets, whereas SoftCon training is unstable, likely due
to its strong dependence on the auxiliary model.</p>
        <p>For ImageNet9, where biases are inherently unknown, only BLU methods are applicable. Note
that here, we use the BLU variant of FLAC, denoted as FLAC-B, employing the vanilla model as the
bias-capturing component. As shown in Table 4, SD and MAVias showcase robust performance across
the various test set configurations of ImageNet9. Conversely, SoftCon again fails to converge.</p>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Limitations &amp; Future Work</title>
        <p>While VB-Mitigator’s architecture is designed to facilitate the seamless integration of existing visual bias
mitigation approaches, it is important to acknowledge that future methods may introduce unforeseen
integration challenges. Furthermore, it should be stressed that bias mitigation is an open research
problem. The methods currently supported by VB-Mitigator, while efective across the employed
datasets, do not guarantee the mitigation of bias in any possible data setup. It is worth noting that
at present, VB-Mitigator focuses on in-processing methods, which attract greater research interest
due to their efectiveness and broad applicability. As future work, we aim to extend its scope by also
incorporating pre-processing and post-processing methodologies. Also, a key direction is integrating
methods that leverage foundation models for deriving potential biases [16, 34]. Such methods will allow
for evaluating fairness in a wide range of general-purpose computer vision datasets, where bias has
not yet been explored, thereby enhancing this way the library’s impact towards addressing bias in
real-world applications.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>This paper introduced VB-Mitigator, an open-source library designed to assist research in the emerging
area of bias in computer vision models. The fragmentation of implementation and evaluation practices
hinders progress in bias mitigation. VB-Mitigator addresses this by providing a standardized
environment, promoting reproducibility and fair comparisons. The library serves as a basis for the research
community, for eficient development and assessment of bias-aware approaches.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This research was supported by the EU Horizon Europe projects MAMMOth (Grant Agreement
101070285) and ELIAS (Grant Agreement 101120237).</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini in order to: Improve writing style. After
using this service, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
by suppressing attribute-class associations, IEEE Transactions on Pattern Analysis and Machine
Intelligence (2024).
[16] I. Sarridis, C. Koutlis, S. Papadopoulos, C. Diou, Mavias: Mitigate any visual bias, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.
[17] Z. Wang, K. Qinami, I. C. Karakozis, K. Genova, P. Nair, K. Hata, O. Russakovsky, Towards fairness
in visual recognition: Efective strategies for bias mitigation, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 8919–8928.
[18] E. Tartaglione, C. A. Barbano, M. Grangetto, End: Entangling and disentangling deep
representations for bias correction, in: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2021, pp. 13508–13517.
[19] Y. Hong, E. Yang, Unbiased classification through bias-contrastive and bias-balanced learning,</p>
      <p>Advances in Neural Information Processing Systems 34 (2021) 26449–26461.
[20] I. Sarridis, C. Koutlis, S. Papadopoulos, C. Diou, Badd: Bias mitigation through bias addition, in:
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops,
2025.
[21] J. Nam, H. Cha, S. Ahn, J. Lee, J. Shin, Learning from failure: De-biasing classifier from biased
classifier, Advances in Neural Information Processing Systems 33 (2020) 20673–20684.
[22] E. Z. Liu, B. Haghgoo, A. S. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, C. Finn, Just
train twice: Improving group robustness without training group information, in: International
Conference on Machine Learning, PMLR, 2021, pp. 6781–6792.
[23] Z. Li, A. Hoogs, C. Xu, Discover and mitigate unknown biases with debiasing alternate networks,
in: European Conference on Computer Vision, Springer, 2022, pp. 270–288.
[24] M. Pezeshki, O. Kaba, Y. Bengio, A. C. Courville, D. Precup, G. Lajoie, Gradient starvation: A
learning proclivity in neural networks, Advances in Neural Information Processing Systems 34
(2021) 1256–1272.
[25] H. Bahng, S. Chun, S. Yun, J. Choo, S. J. Oh, Learning de-biased representations with biased
representations, in: International Conference on Machine Learning, PMLR, 2020, pp. 528–539.
[26] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[27] M. Tan, Q. Le, Eficientnet: Rethinking model scaling for convolutional neural networks, in:</p>
      <p>International conference on machine learning, PMLR, 2019, pp. 6105–6114.
[28] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision
transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2021, pp. 10012–10022.
[29] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan,
Supervised contrastive learning, Advances in neural information processing systems 33 (2020)
18661–18673.
[30] K. Y. Xiao, L. Engstrom, A. Ilyas, A. Madry, Noise or signal: The role of image backgrounds in
object recognition, in: International Conference on Learning Representations, 2021.
[31] Z. Zhifei, S. Yang, Q. Hairong, Age progression/regression by conditional adversarial autoencoder,
in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
[32] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of</p>
      <p>International Conference on Computer Vision (ICCV), 2015.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image
database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp.
248–255.
[34] M. Ciranni, L. Molinaro, C. A. Barbano, A. Fiandrotti, V. Murino, V. P. Pastore, E. Tartaglione, Say
my name: a model’s bias discovery framework, arXiv preprint arXiv:2408.09570 (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>A simple framework for contrastive learning of visual representations</article-title>
          ,
          <source>in: International conference on machine learning</source>
          ,
          <source>PmLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1597</fpage>
          -
          <lpage>1607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <article-title>Generative adversarial networks in computer vision: A survey and taxonomy</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehrabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galstyan</surname>
          </string-name>
          ,
          <article-title>A survey on bias and fairness in machine learning</article-title>
          ,
          <source>ACM computing surveys (CSUR) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fabbrizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ntoutsi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kompatsiaris,</surname>
          </string-name>
          <article-title>A survey on bias in visual datasets</article-title>
          ,
          <source>Computer Vision and Image Understanding</source>
          <volume>223</volume>
          (
          <year>2022</year>
          )
          <fpage>103552</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>DeAndres-Tame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tolosana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Melzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vera-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rathgeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ortega-Garcia</surname>
          </string-name>
          , et al.,
          <source>Frcsyn challenge at cvpr</source>
          <year>2024</year>
          :
          <article-title>Face recognition challenge in the era of synthetic data</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>3173</fpage>
          -
          <lpage>3183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ntoutsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fafalios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Iosifidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-E. Vidal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ruggieri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Turini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Papadopoulos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Krasanakis</surname>
          </string-name>
          , et al.,
          <article-title>Bias in data-driven artificial intelligence systems-an introductory survey</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>10</volume>
          (
          <year>2020</year>
          )
          <article-title>e1356</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sarridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutlis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diou</surname>
          </string-name>
          ,
          <article-title>Facex: Understanding face attribute classifiers through summary model explanations</article-title>
          ,
          <source>in: Proceedings of the 2024 International Conference on Multimedia Retrieval</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>758</fpage>
          -
          <lpage>766</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Zhang,</surname>
          </string-name>
          <article-title>Spurious correlations in machine learning: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2402.12715</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Barbano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dufumier</surname>
          </string-name>
          , E. Tartaglione,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grangetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <article-title>Unbiased supervised contrastive learning</article-title>
          ,
          <source>arXiv preprint arXiv:2211.05568</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Evtimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hazirbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <article-title>A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one ampliefis others</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>20071</fpage>
          -
          <lpage>20082</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tolosana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vera-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rathgeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , I.
          <string-name>
            <surname>DeAndres-Tame</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Morales</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fierrez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ortega-Garcia</surname>
          </string-name>
          , et al.,
          <article-title>FRCSyn-onGoing: Benchmarking and comprehensive evaluation of real and synthetic data to improve face recognition systems</article-title>
          ,
          <source>Information Fusion</source>
          <volume>107</volume>
          (
          <year>2024</year>
          )
          <fpage>102322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sarridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutlis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diou</surname>
          </string-name>
          ,
          <article-title>Towards fair face verification: An in-depth analysis of demographic biases</article-title>
          ,
          <source>in: Proceedings of the International Workshops of ECML PKDD</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Hashimoto</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>08731</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sarridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutlis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diou</surname>
          </string-name>
          , Flac:
          <article-title>Fairness-aware representation learning</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>