1. Introduction

Ital-IA

Fairness, Debiasing and Privacy in Computer Vision and Medical Imaging

Carlo Alberto Barbano

0 1

Edouard Duchesnay

Benoit Dufumier

Pietro Gori

Marco Grangetto

0 0 Computer Science Dept., University of Turin , Italy 1 LTCI , Télécom Paris, IP Paris , France 2 NeuroSpin, CEA, Université Paris-Saclay , France

2023

3 29 31

Deep Learning (DL) has become one of the predominant tools for solving a variety of issue, often with superior performance compared to previous state-of-the-art methods. DL models are often able to learn meaningful and abstract representations of the underlying data; however, they have also been shown to often learn additional features in the data, which are not necessarily relevant or required for the desired task. This could pose a number of issues, as the additional features can contain bias, sensitive or private information, that should not be taken into account (e.g. gender, race, age, etc.) by the model. We refer to this information as collateral. The presence of collateral information translates into practical issues when deploying DL models, especially if they involve users' data. Learning robust representations which are free of biased, private, and collateral information can be very relevant for a variety of fields and applications, for example for medical applications and decision support systems. In this work we present our group's activities aiming at devising methods to ensure that representations learned by DL models are robust to collateral features, biases and privacy-preserving with respect to sensitive information.

eol>Fairness Debiasing Privacy Deep Learning Representation Learning

1. Introduction

associated with the information they learn and how it is with novel contrastive losses that show increased robusthandled. Referring to all the above cases, we define as ness compared to the current literature [ 9 ]. We provide collateral any information that is not necessarily required a unified framework to analyze and compare existing for the desired task, but that is picked-up by the model. formulations of contrastive losses, such as the InfoNCE This concept which was conceptualized by John Dewey loss [ 4, 6 ], the InfoL1O loss [ 7 ], and the SupCon loss [ 5 ]. as Collateral Learning, describes the accidental learning Using our proposed metric learning approach, we can that occurs in and outside the classroom [ 3 ]. Based on reformulate each loss as a set of highly explainable metthis definition, and extending it to the deep learning con- ric conditions. Our analysis provides a comprehensive text, we say that collateral learning occurs when a model understanding of the diferent loss functions, explaining learns more information than intended. In order to be their behavior from a metric point of view. Furthermore, robust, DL models should not be afected by the collateral leveraging our metric learning approach, we investigate learning problem. the issue of biased learning. We point out the limitations of the studied contrastive loss functions when dealing 1.1. Representation Learning with biased data, especially when the loss on the training set is apparently minimized. By analyzing such cases, we A more throughout understanding of how deep mod- provide a more formal characterization of bias, which els can learn powerful representations can certainly be eventually allows us to derive a new set of general reguhelpful in all the above cases. Learning fair and robust larization constraints for debiasing that can be added to representations of the underlying samples, especially any contrastive or non-contrastive loss. when dealing with biased data or sensitive information, is the main objective of the activities described in this Foundamentals Let ∈ be an original sample work. In the recent years, the topic of representation (i.e., anchor), + a similar (positive) sample, − a dislearning has increasingly gained traction in the deep similar (negative) sample and and the number of learning community. Contrastive learning has become positive and negative samples respectively. Contrastive the most widespread approach for this purpose, and many learning methods look for a parametric mapping funclosses and frameworks have been proposed [ 4, 5, 6, 7 ]. tion : → S− 1 that maps “semantically” similar Contrastive learning approaches aim at pulling positive samples close together in the representation space, a (samples representations (e.g. of the same class) closer 1)-sphere, and dissimilar samples far away from each together while repelling representations of negative ones other. Once pre-trained, is fixed and its representa(e.g. diferent classes) apart from each other. It has also tion is evaluated on a downstream task, such as clasbeen shown that, in a supervised setting, this kind of opti- sification, through linear evaluation on a test set. In mization can sometimes yield better results than standard general, positive samples + can be defined in difercross-entropy [ 5 ], and is also more robust against label ent ways depending on the problem: using transformacorruption [ 8 ] which can be seen as an instance of collat- tions of (unsupervised setting), samples belonging to eral features. However, a lot remains to be done about the same class as (supervised) or with similar image this matter, and research should focus on how to provide attributes of (weakly-supervised). The definition of reliable guarantees for avoiding collateral features learn- negative samples − varies accordingly. Here, we foing. Furthermore, another relevant line of research is cus on the supervised case, thus samples belonging to addressing this issue from an unsupervised perspective the same/diferent class, but the proposed framework (i.e. automatically recognizing and excluding all bias and could be easily applied to the other cases. We define collateral information without any prior knowledge). ( (), ()) as a similarity measure (e.g., cosine simi

In summary, there is a need for a reliable way to learn larity) between the representation of two samples and robust representations which are free of biased, private . Please note that since || ()||2 = || ()||2 = 1, and collateral information. using a cosine similarity is equivalent to using a L2distance (( (), ()) = || () − ()||22). Similarly 2. Metric framework for to [ 10, 11, 12, 13, 14 ], we propose to use a metric learning approach which allows us to better formalize recent concontrastive learning trastive losses, such as InfoNCE [ 4, 6 ], InfoL1O [ 7 ] and

SupCon [5], and derive new losses that better approximate the mutual information and can take into account data biases. In our research activities, we explore representation

learning from a theoretical perspective. We propose a metric-learning based framework for supervised representation learning, which allows us to derive and formalize a more robust set of debiasing constraints, along Derivation of -SupInfoNCE Using an -margin metric learning point of view, probably the simplest contrastive learning formulation is looking for a mapping function such that the following -condition is always satisfied: ( (), (− )) − ( (), (+) ≤ − ∀, (1) − ⏞ ⏟

⏞ + where ≥ 0 represents a margin between positive and negative samples, as shown in Fig. 1. The constraint of Eq. 1 can be transformed into an optimization problem using, as it is common in contrastive learning, the max operator and its smooth approximation LogSumExp.

The can lead to the derivation of diferent loss functions. Some of them can be found in [9]. We propose to use the

following one, that we call -SupInfoNCE: ∑︁ max(− , {− − +}=1,..., ) ≈ ∑︁ log exp(− ) + ∑︁ exp(− − +)

Experiments and Results Results on general com

puter vision datasets are presented in Tab. 1, in terms of top-1 accuracy. We report the performance for the best value of ; the complete results can be found in [ 9 ].

The results are averaged across 3 trials for every config

uration, and we also report the standard deviation. We obtain significant improvement with respect to all baselines and, most importantly, SupCon, on all benchmarks: on CIFAR-10 (+0.5%), on CIFAR-100 (+0.63%), and on

ImageNet-100 (+1.31%). For the experiments, we use the

original setup from SupCon [ 5 ], employing a ResNet-50.

The complete experimental setup is provided in [9]. 3. Debiasing with FairKL

good downstream performance, however, it does not take into account the presence of biases (e.g. selection biases). To tackle this issue, we propose FairKL, a set of debiasing constraints that prevent the use of the bias features within the proposed metric learning approach. In order to give a more in-depth explanation of the -InfoNCE failure case, we employ the notion of bias-aligned and bias-conflicting

samples as in Nam et al. [15]. In our context, a bias-aligned sample shares the same bias attribute of the anchor, while a bias-conflicting sample does not.

In this work, we assume that the bias attributes are ei

ther known a priori or that they can be estimated using a bias-capturing model, such as in [16]. minimal margin , between the distance + of a positive sample + (+ symbol inside) from an anchor and the distance − of the closest negative sample − (

− symbol inside). By tween positive and negative samples. increasing the margin, we can achieve a better separation be- Satisfying the -condition (1) can generally guarantee ⏟ ≈ = − ⏟ ︃(

︃( ∑︁ log ⏞ exp(+) exp(+ − ) + ∑︀ exp(− ) − )︃ )︃

Characterization of bias We denote bias-aligned sam

ples with · , and bias-conflicting samples with · ,′ .

Given an anchor , if the bias is “strong” and easy-tolearn, a positive bias-aligned sample +, will probably be closer to the anchor in the representation space than a positive bias-coniflcting

sample (of course, the same reasoning can be applied for the negative samples). This is why even in the case in which the -condition is satisfied (2) and the -SupInfoNCE is minimized, we could still be able to distinguish between bias-aligned and bias-conflicting samples. Hence, we say that there is a bias if we can identify an ordering on the learned representations, e.g.: Here, we can notice that when = 0, we retrieve a generalization of InfoNCE loss, whereas when → ∞ we obtain a generalization of InfoL1O loss. It has been shown in [ 7 ] that these two losses are the lower and upper bound of the Mutual Information (+, ) respectively: InfoNCE ≤ (+, ) ≤ InfoL1O (3) By using a value of ∈

[0, ∞), one might find a decreases as increases. tighter approximation of (+, ) since the exponential function at the denominator exp(− ) monotonically − − ≤ + ,′ < +, ∀, , (4)

This represents the worst-case scenario, where the order

ing is total (i.e., ∀, , , ). Of course, there can also be cases in which the bias is not as strong, and the ordering may be partial.

FairKL regularization for debiasing

Ideally, we would enforce the conditions +,′ − +, = 0 ∀, and, meaning that every positive bias-conflicting sample should have the same distance from the anchor as any other positive bias-aligned sample. However, in practice, this condition is very strict, as it would enforce uniform distance among all positive samples. A more relaxed condition would instead force the distributions of distances, {·,′ } and {·,}, to be similar. Here, we propose two new debiasing constraints using either the ifrst moment (mean) of the distributions or the first two moments (mean and variance). Using only the average of the distributions, we obtain: 4. Multi-site acquisition noise in brain age prediction 1 ∑︁ |

+,′ | − 1 ∑︁ |+,| = 0 where and are the number of positive bias-aligned and bias-conflicting samples, respectively 2. Coincidentally, this constraint is also known as EnD [17], which we proposed in 2021. Denoting the first moments with +, = 1 ∑︀ +,, +,′ = 1 ∑︀ + ,′ , and the second moments of the distance distributions with +2, = 1 ∑︀(+, − +,)2, +2,′ = 1 ∑︀(+ ,′ − +,′ )2, and making the hypothesis that the distance distributions follow a normal distribution, we can define a new debiasing constraint ℛ using, for example, the

Kullback–Leibler divergence: In this section, we present our recent work in the field

21 [︃ +2, + ( ++2,,′− +,′ )2 − log +2+2,,′ − 1]︃ = 0 aoMcfRcnuIe.ruTarthoeiismmioasgdaienclsgh,cafalolpecanubgsliiennoggftoganseknbertrahaianltizariegnqegupairrceerdsoisrcsotibdouinfesrtferanontmd (6) imaging sites. Dealing with multi-site dataset is a delicate The proposed debiasing constraint can be easily added matter in biomedical imaging in general, as the collatto any contrastive loss using the method of the Lagrange eral noise related to the diferent acquisition sites often multipliers, as a regularization term. Thus, our final loss limits the generalization capability of DL models. In this function is: context, together with our partners at Télécom Paris (IP

Paris) and NeuroSpin (CEA), we have developed a novel

ℒ = ℒ − + ℛ (7) contrastive learning loss for regression of brain age from

MRI [19], which is based on our metric learning frame

where and are positive hyperparameters. work. We validated it on the OpenBHB challenge [20], a recently released3 public challenge, which provides one Experiments and results We perform experiments on of the largest datasets of healthy brain MRIs. Based on our proposed loss on five biased datasets: Biased-MNIST, the framework presented in Sec. 2, we propose a novel Corrupted-CIFAR10, bFFHQ, and 9-Class ImageNet along contrastive learning regression loss for brain age prewith ImageNet-A. For brevity, in this presentation we diction, achieving state-of-the-art performance on the report Biased-MNIST only, the results are reported in OpenBHB challenge.

2The same reasoning can be applied to negative samples (omitted

for brevity.)

3https://baobablab.github.io/bhb/

0 ≤ ≤

Contrastive Learning Regression Loss The notion

of negative and positive samples is rooted in the contrastive learning framework. The loss formulation of Sec. 2 is thus not adapted for regression (i.e. continuous labels), as it is not possible to determine a hard boundary between positive and negative samples. All samples are somehow positive and negative at the same time. Given the continuous label for the anchor and for a sample , one could threshold the diference ∆ between and at a certain value in order to create positive and negative samples (i.e. k is positive if ∆( , ) < ). The problem would then be how to choose . Diferently, we propose to define a degree of “positiveness” between samples using a kernel function = ( − ), where 1. Our goal is thus to learn a parametric function : → S− 1 that maps samples with a high degree of positiveness ( ∼ and samples with a low degree ( ∼ 1) close in the latent space 0) far away from each other. To adapt such a framework to continuous labels, we propose to use a kernel function , and we develop multiple formulations. A first approach would be to consider as “positive” only the samples that have a degree of positiveness greater than 0, and align them with a strength proportional to the degree: ( − ) ≤ 0 ∀, , ̸= ∈ () (8) ∑︀ where we have normalized the kernel so that the sum over all samples is equal to 1 and we denote with () the indices of samples in the minibatch distinct from . From Eq. 8 we can derive the following loss: ℒ − = − ∑︁

∑︀ log ︃(

exp() ∑︀ =1 exp() )︃ . However, it still focuses more on the closest sample “less positive” than , i.e. s.t > and ≤ gin with respect to the closest “negative” sample works ∀ ̸= . As noted in [ 9, 5 ], increasing the marwell for classification; however we argue it might not be ℒ best suited for regression. For this reason, we propose a second formulation ( ) that takes an opposite approach. Instead of focusing on repelling the closest “less positive” sample, we increase the repulsion strength for samples proportionally to their distance from the anchor in the kernel space: [(1 − ) − ] ≤ 0

∀, ̸= ∈ () 1

∑︁ = − ∑︀ ∈() log ∑︀

exp() ̸= exp((1 − )) (11) formulation, the weighting factor In the resulting ℒ

acts like a temperature value, by giving more weight to the samples which are farther away from the anchor in the kernel space. Also, for a proper kernel choice, samples closer than will be repelled with very low strength (∼ 0). We argue that this approach is more suited for continuous attributes (i.e., regression task), as (9) it enforces that samples close in the kernel space will be close in the representation space.

Interestingly, this is exactly the y-aware loss proposed in

[21] for classification with weak continuous attributes.

Results With our proposed loss, we achieve the best

Due to the non-hard boundary between positive and neg- results (at this time) [ 9 ] on the OpenBHB leaderboard, as ative samples, both and are defined over the entire shown in Tab. 3 (ℒ). Compared to the L1 and ComBat minibatch. The kernel is used to avoid aligning sam- baselines [19], we achieve a lower generalization error ples not similar to the anchor (i.e. ≈ noted that, while the numerator aligns , in the denominator, the uniformity term (as defined in [ 22]) focuses 0). It can be to unseen sites (Ext. MAE), meaning that our method is more robust to the collateral information related to the site noise. We are currently carrying out further research more on the closest samples in the representation space: to gain further insights on the reasons of this behavior. this could be undesirable, as these samples might have a greater degree of positiveness than the considered . To avoid that, we formulate a first extension ( ℒ which limits the uniformity term (i.e., denominator) to

ℎ) of (8), the samples that are at least more distant from the anchor than the considered in the kernel space (omitting the normalization in the starting condition): ( − ) ≤ 0 if − ≤ 0

∀, ̸= ∈ () ℒ ℎ = − ∑︀

∑︀ < log

exp() ∑︀̸= < exp() ︁( ︁)

5. Privacy in deep learning

We investigated the possibility of utilizing debiasing technique also to prevent privacy leakage. In this context, we are interested in recovering some private attribute of the data, starting from the model outputs or embeddings. These kind of private attributes can be, in the example of natural or facial images, age, gender, race, etc. We observed that, under certain conditions, some of the debiasing approaches are also suitable for privacy preservation. We discovered the determining condition with Multi-class N-pair Loss Objective, in: to be the capability of efectively suppressing the bias re- Advances in Neural Information Processing lated information inside of the model, rather than simply Systems, volume 29, Curran Associates, Inc., re-weighting it. We show in [23] that debiasing tech- 2016. URL: https://papers.nips.cc/paper/2016/hash/ niques can be used for privacy preservation purposes 6b180037abbebea991d8b1232f8a8ca9-Abstract. when they allow to retain a high accuracy on the target html. class, while making it harder to determine the private [12] J. Wang, Y. song, T. Leung, C. Rosenberg, J. Wang, attributes. In our work, we successfully remove collateral J. Philbin, B. Chen, Y. Wu, Learning Fine-grained private information, e.g. gender or age, from the latent Image Similarity with Deep Ranking, in: CVPR, representation of the DL models on a variety of datasets, 2014. including medical images; thus ensuring that they cannot [13] X. Wang, Y. Hua, E. Kodirov, N. M. Robertson, leak from the model outputs. Ranked List Loss for Deep Metric Learning, in: CVPR, 2019. [14] B. Yu, D. Tao, Deep Metric Learning With Tuplet References Margin Loss, in: IEEE ICCV, 2019, pp. 6489–6498. [15] J. Nam, H. Cha, S. Ahn, J. Lee, J. Shin, Learning from failure: Training debiased classifier from biased classifier, in: Advances in Neural Information

Processing Systems, 2020.

[16] Y. Hong, E. Yang, Unbiased classification through bias-contrastive and bias-balanced learning, in:

Thirty-Fifth Conference on Neural Information Pro

cessing Systems, 2021. URL: https://openreview.net/ forum?id=2OqZZAqxnn. [17] E. Tartaglione, C. A. Barbano, M. Grangetto, End: Entangling and disentangling deep representations for bias correction, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 2021. [18] B. Kim, H. Kim, K. Kim, S. Kim, J. Kim, Learning not to learn: Training deep neural networks with biased data, in: The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2019.

[19] C. A. Barbano, B. Dufumier, E. Duchesnay,

M. Grangetto, P. Gori, Contrastive learning for

regression in multi-site brain age prediction, in: International Symposium on Biomedical Imaging (ISBI), 2023. [20] B. Dufumier, et al., Openbhb: a large-scale multisite brain mri data-set for age prediction and debiasing, NeuroImage (2022). [21] B. Dufumier, et al., Contrastive learning with continuous proxy meta-data for 3d mri classification, in: MICCAI, Springer, 2021. [22] T. Wang, P. Isola, Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, ICML (2020). URL: http://arxiv.org/abs/2005.10242. [23] C. A. Barbano, E. Tartaglione, M. Grangetto, Bridging the gap between debiasing and privacy for deep learning, in: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, 2021, pp. 3799–3808.

[1] EidosLab, Image proecessing, computer vision and virtual reality, https://eidos.di.unito.it, 2021 .

[2] CVPL, Italian Association for Computer Vision, Pattern Recognition and Machine Learning , http: //www.cvpl.it, 2021 .

[3]

Dewey , Experience And Education, Free Press, 1997 . URL: https://books.google.fr/books?id= UWbuAAAAMAAJ.

[4]

Chen ,

Kornblith ,

Norouzi ,

Hinton , A Simple Framework for Contrastive Learning of Visual Representations , in: International Conference on Machine Learning, PMLR , 2020 , pp. 1597 - 1607 . URL: http://proceedings.mlr.press/v119/ chen20j.html, iSSN: 2640 - 3498 .

[5]

Khosla , et al., Supervised contrastive learning , in: NeurIPS , 2020 .

[6] A. v. d. Oord,

Li ,

Vinyals , Representation Learning with Contrastive Predictive Coding , arXiv: 1807 .03748 [cs, stat] ( 2019 ). URL: http: //arxiv.org/abs/ 1807 .03748, arXiv: 1807 .03748.

[7]

Poole ,

Ozair , A. v. d. Oord,

A. A.

Alemi , G. Tucker, On Variational Bounds of Mutual Information, in: ICML, 2019 .

[8]

Graf , et al., Dissecting supervised contrastive learning , in: ICML , 2021 . URL: https://proceedings. mlr.press/v139/graf21a.html.

[9]

C. A.

Barbano ,

Dufumier , E. Tartaglione,

Grangetto ,

Gori , Unbiased supervised contrastive learning , in: The Eleventh International Conference on Learning Representations (ICLR) , 2023 . URL: https://openreview.net/forum? id= Ph5cJSfD2XN .

[10]

Chopra ,

Hadsell , Y. LeCun, Learning a Similarity Metric Discriminatively, with Application to Face Verification , in: CVPR, volume 1 , IEEE, 2005 , pp. 539 - 546 . URL: http://ieeexplore. ieee.org/document/1467314/. doi: 10 .1109/CVPR. 2005 . 202 .

[11]

Sohn , Improved Deep Metric Learning