Stochastic Adversarial Gradient Embedding for Active Domain Adaptation Victor Bouvier1,2 , Philippe Very3 , Clément Chastagnol4 , Myriam Tami1 , and Céline Hudelot1 1 Université Paris-Saclay, CentraleSupélec, firstname.name@centralesupelec.fr 2 Sidetrade, vbouvier@sidetrade.com 3 Talan, philippe.very@talan.com 4 IQVIA, clement.chastagnol@iqvia.com Abstract. Unsupervised Domain Adaptation (UDA) bridges the gap between a labelled source domain and an unlabelled target domain. In this paper, we improve adaptation by guiding the model with actively annotated target data. This problem, named Active Domain Adapta- tion (ADA) is of practical interest as it is sometimes possible to anno- tate a small budget of target data in many applications. We introduce Stochastic adversarial gradient embedding (Sage), an embedding for es- timating the impact of annotating a target sample on adaptation. Sage measures the variation of the transferability loss gradient, before and af- ter annotation. Additionally, we investigate various procedures for incor- porating a small subset of labelled target samples when learning domain invariant representations. Our experiments on challenging benchmarks demonstrate that a small effort of active annotation with Sage improves adaptation substantially. Importantly, with a comparable labelling bud- get, Sage performs better than its semi-supervised counterpart while having more realistic assumptions. Keywords: Domain Adaptation · Active Learning · Invariant Repre- sentations. 1 Introduction When provided with a large amount of labelled data, deep neural networks have dramatically improved the state-of-the-art in various applications [19]. How- ever, when deployed in the real world, where data may slightly differ from the training data, deep models often fail to generalize out of the training distribu- tion [5]. Nevertheless, deep nets can learn data representations transferable to new tasks or new domains if some labelled data from the new distribution are available [36]. Acquiring a sufficient amount of labelled data is often impossi- ble, and large scale annotation is often cost-prohibitive. In contrast, unlabelled data are much more convenient to obtain. This observation has motivated the field of Unsupervised Domain Adaptation (UDA) [23,26] for bridging the gap between a labelled source domain and an unlabelled target domain. Learning © 2021 for this paper by its authors. Use permitted under CC BY 4.0. 2 122 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline Before annotation After annotation Decision boundary Decision boundary 1 1 1 1 2 2 2 2 3 3 3 3 High variation of the gradient Labelled source Unlabelled target Poorly aligned target Active target annotation Gradient if annotation returns Gradient if annotation returns Decision boundary update Gradient of samples induced by the annotation Fig. 1. Effect of annotation of a target sample selected by Sage (best viewed in colors). Binary classification problem (• vs ?) where source samples are blue and target samples are orange. Before annotation, the class-level alignment is not satisfactory leading to a potential negative transfer (poorly aligned target samples tagged as 1, 2 and 3). We estimate which sample should be primarily annotated by measuring the variation of the representations’ transferability gradient, before and after annotation. We observe the highest variation is obtained for target sample 3, which is sent to an oracle. The oracle annotation returns class ?, validating the suspicion of negative transfer. This leads to an update of the decision boundary, which pushes 1, 2, and 3 into class ?, resulting in a better class-level alignment of representations. domain Invariant Representations has led to significant progress [13,21,22]. By fooling a discriminator trained to separate the source from the target domain, the feature extractor removes domain-specific information from representations. Thus, a classifier trained from those representations with source labelled data is expected to perform reasonably well in the target domain [6]. However, those methods perform significantly worse than their fully su- pervised counterparts. To this purpose, Semi-Supervised Domain Adaptation (SSDA) has been studied in [31] through a Mini-Max Entropy objective (MME). Nevertheless, assuming that at least one target labelled sample represents a class, thus involving information about target labels, SSDA is built on assumptions that are unlikely to be met in practice. A more realistic scenario would be to guide adaptation by selecting for annotation a pool target unlabelled instances. This new paradigm referred to as Active Domain Adaptation (ADA), is often encountered in real-world applications. To our knowledge, only a few prior works address ADA [9,27,30,34]. In particular, the recent work of Su et al. [34] is the first that uses domain adversarial learning for Active Learning (AL). In this paper, we address ADA reserving the annotation budget for target samples for which their annotations are likely to guide adaptation. In contrast to [34], which selects a diverse set of poorly adapted target samples based on a classical criterion of uncertainty, we estimate the impact of annotation on the representations’ transferability. To this purpose, we introduce Stochastic adversarial gradient embedding (Sage), an embedding of target samples, whose norm estimates precisely this impact. Our approach also promotes diversity in the annotation. We follow [3] and select target samples for which Sage spans on Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 3 123 diverse directions using the k-means++ initialization [1]. Since access to some labelled data from the target domain brings us back to SSDA, we investigate the role of MME in this context. We organize the rest of the paper as follows. First, we provide a brief overview of Domain Adversarial Learning for UDA. Importantly, we expose a soft-class conditioning adversarial loss, which reflects the transferability error of domain invariant representations [7]. Second, we present the details of Sage while provid- ing theoretical insights that AL can improve representations’ transferability in the third section. Finally, we conduct an empirical study on several benchmarks that support our ADA approach. 2 Background Notations. Let us consider three random variables; X the input data, Z the representations and Y the labels, defined on spaces X , Z ⊂ Rm where m is the dimension of the representation, and Y such that |Y| = C for some integers C, respectively. We note realizations with lower cases, x, z and y, respectively. Those random variables may be sampled from two and different distributions: the source distribution pS i.e., data where the model is trained and the target distribution pT i.e., data where P the model is evaluated. Labels are one-hot en- coded i.e., y ∈ [0, 1]C with c yc = 1 where C is the number of classes. We use the index notation S and T to differentiate source and target quantities. We define the hypothesis class H as a subset of functions from X to Y which are the composition of a representation class Φ (mappings from X to Z) and a classifier class F (mappings from Z to Y) i.e., h := f ϕ := f ◦ϕ ∈ H where f ∈ F and ϕ ∈ Φ. For D ∈ {S, T } and a hypothesis h ∈ H, we introduce the error in domain D, εD (h) := ED [`(h(X), Y )] where ` is the L2 loss `(y, y 0 ) = ||y − y 0 ||2 and h(x)c is the probability of x to belong to class c. We note the source domain data (xSi , yiS )1≤i≤nS and the target domain data (xTj )1≤j≤nT . Domain Adversarial Learning. The seminal works from [13,21], and their the- oretical ground [6], have led to a wide variety of methods based on domain invariant representations [22,20,10]. A representation ϕ and a classifier f are learned by achieving a trade-off between source classification error and domain invariance of representations by fooling a discriminator trained to separate the source from the target domain: L(ϕ, f ) := LS (ϕ, f ) − λ · inf LINV (ϕ, d) (1) d∈D where LS (ϕ, f ) := ES [−Y · log(f ϕ(X))] is the cross-entropy loss in the source domain, LINV (ϕ, d) := ES [log(1−d(ϕ(X)))]+ET [log(d(ϕ(X))] is the adversarial loss and D is the set of discriminators i.e. mapping from Z to [0, 1]. In practice, inf d∈D is approximated using a Gradient Reversal Layer [13]. 4 124 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline Transferability loss for class-level invariance. Promoting class-level domain in- variance improves the transferability of representations [22]. Recently, the work [7] introduces the transferability loss, noted LTSF , which adds class-conditioning in the adversarial loss by computing a scalar product between labels y and a class-level discriminator d defined as a mapping from Z to [0, 1]C . Since labels are not available in the target domain at train time, predicted labels ŷ := f ϕ(x) are used. This approach is referred to as soft-class conditioning: L(ϕ, f ) := LS (ϕ, f ) − λ · inf LTSF (ϕ, ŷ, d) (2) d∈D where LTSF (ϕ, ŷ, d) := ES [Y · log(1 − d(ϕ(X)))] + ET [Ŷ · log(d(ϕ(X))] is the transferability loss and D is the set of class-level discriminators i.e., mappings from Z to [0, 1]C . In this work, we explore the role of active annotation on a small subset of the target domain in order to improve the transferability of representations. Methods based on LTSF as adaptation loss are flagged as TSF. 3 Proposed Method 3.1 Motivations Gradient-based selection, as shown in Badge [3], is promising in AL. In contrast to Badge, which focuses on the network’s predictions, we discuss the role of representations’ transferability. To this purpose, we introduce, in the following, the adversarial gradient that reflects the lack of transferability of a target sample. From this gradient, we expose a query that efficiently incorporates the domain shift problem in ADA. Let a target sample x ∼ pT with representation z := ϕ(x) ∈ Rm , we start by describing the effect of annotating the sample x on the gradient descent update of 2. We define the adversarial gradient gx of x as the gradient of the discriminator loss w.r.t the representation z: ∂ log(d(z)) gx := − ∈ RC×m , where d(z) ∈ [0, 1]C (3) ∂z Following the expression of the transferability loss LTSF , the contribution of a sample x to the gradient update (2), before and after its annotation, is:     ∂z ∂z θ ←θ−α · (ŷ · gx ) −→ θ ← θ − α · (y · gx ) ∂θ ∂θ | {z } | {z } Before annotation After annotation y∼Oracle(x) where ∂z/∂θ is the jacobian of the representations with respect to the deep network parameters θ i.e., z := ϕθ (x), ŷ := f ϕθ (x) is the current label estimation and α is some scaling parameter. Before the annotation, the gradient vector can be written as a weighted sum of gx i.e., ŷ · gx ∈ Rm , reflecting the class probability of x. Annotating the sample x has the effect of setting, once and for Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 5 125 all, a direction of the gradient (y ·gx ). Based on this observation, we can measure the annotation procedure’s ability to learn more transferable representations by its tendency to change the path of the gradient descent i.e., how y · gx may differ with ŷ · gx . 3.2 Positive Orthogonal Projection (POP) In the rest of the paper, we consider gx ∈ RC×m as a stochastic vector of Rm with realizations lying in Gx := {gx1 , ..., gxC } where gxc = (−∂ log(d(z))/∂z)c . When provided the label through an oracle i.e., y ∼ Oracle(x), we obtain gxy ∈ Gx , a realization of gx . Before annotation, the direction of the gradient is the mean of gx where Gx is provided with the class probability given by the classifier’s output h(x). More precisely, the probability of observing g̃x = g̃xc is h(x)c , then, the mean of gx , noted Eh [gx ] is defined as follows: Eh [gx ] := Ey∼h(x) [gxy ] = h(x) · gx ∈ Rm (4) Therefore, the tendency to modify the direction of the gradient is reflected by a high discrepancy between Eh [gx ] and gxy for y ∼ Oracle(x). To quantify this discrepancy, we consider variations in both direction and magnitude. To find a good trade-off between these two requirements, we remove the mean direction of the gradient Eh [gx ] to gx by computing a Positive Orthogonal Projection (POP), noting λ := |gx · Eh [gx ]|/||Eh [gx ]||2 ; g̃x := gx − λEh [gx ] (5) We motivate the use |gx · Eh [gx ]| rather than gx · Eh [gx ] for the standard or- thogonal projection. On the one hand, if the annotation provides a gradient with the same direction as the expected gradient i.e., the annotation reinforces the prediction, g̃x is null. On the other hand, if the annotation provides a gradient with an opposite direction to the expected gradient i.e., the annotation contra- dicts the prediction, the norm of g̃x increases. Therefore, target samples x for which we expect the highest impact on the transferability, are those with the highest norm of g̃x . Since λ involves an absolute value, we refer to it as a pos- itive orthogonal projection. An illustration is provided in Figure 2. Since g̃x is stochastic, we need additional tools to define a norm operator properly on it. 3.3 Stochastic Adversarial Gradient Embedding It seems natural to quantify the norm of the stochastic vector g̃x as the square root of the mean of g̃x ’s norm: ||g̃x ||h := (Ey∼h(x) ||g̃xy ||2 )1/2 . However, given x1 and x2 , how to quantify the discrepancy between gx1 and gx2 ? The difficulty re- sults from the fact that h(x1 ) 6= h(x2 ) in general. Simply using Ey1 ∼h(x1 ),y2 ∼h(x2 )  y  ||gx11 − gxy22 ||2 )1/2 leads to an operator that returns a non-null discrepancy be- tween x and itself if h(x) is not a one-hot vector. To address this issue, we 6 126 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline ŷ = (0.25, 0.25, 0.5) ŷ = (p1 , p2 ) ŷ · gx 2 Rd |gx2 · gx | gx gx1 ŷ · gx 2 Rd ||gx ||2 gx2 gx2 h at gx1 tp en di ra G g̃x1 p p p1 g̃x1 |gx1 · gx | p2 g̃ 2 g̃x2 gx3 ||gx ||2 gx x (a) Sage (b) Poor local min √ √ Fig. 2. (a) Visualisation of S(x) = ( p1 g̃x1 , p2 g̃x2 ). Here g̃x2 ⊥ (ŷ · gx ) since ŷ · gx and gx have a similar direction while |g̃x · gx | ≥ |gx1 · gx | since gx1 as a component in the 2 1 opposite direction of ŷ · gx . (b) Illustration of a case where the transferability loss is close to a local minimum (ŷ · gx ≈ 0), but the stochastic gradients (gxy for y ∈ {1, 2, 3}) have a high norm. Here, the annotation chooses one of the gradients resulting in a strong update of the model. suggest to embed x, through a mapping S named Stochastic adversarial gradient embedding (Sage): p p S(x) := ( h(x)1 g̃x1 , ..., h(x)C g̃xC ) ∈ RC×m (6) √ By choosing h, we guarantee that ||S(x)|| = ||g̃x ||h while offering a proper discrepancy between gx1 and gx2 with ||S(x1 )−S(x2 )||. Crucially, both the norm and the distance computed on Sage do not involve the target labels, making it relevant for UDA since target labels are unknown. An illustration of Sage is provided in Figure 2. 3.4 Increasing Diversity of Sage (k-means++) As aforementioned, the higher the norm of ||S(x)||, the greater the expected impact of annotating sample x on the transferability of representations. A naive strategy of annotation would be to rank target samples by their Sage norm (||S(x)||). The drawback is to acquire labels for a not IID batch from the target distribution, a problem referred to as the challenge of diversity in AL [33]. In certain pathological cases (e.g. the selection of very similar samples or samples of the same class), the IID violation may degrade the performance in the target domain. To label useful target samples (i.e., high ||S(x)||) while acquiring a representative batch of the target distribution, we follow [3] by selecting samples with high ||S(x)|| which span in various directions. This is performed using the k-means++ initialization [1]. The procedure Sage is detailed in Algorithm 1 for a given budget b of annotation. Importantly, sampling diverse target samples with high impact on transferability results from the construction of an embedding (Sage) suitable with k-means++. Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 7 127 Algorithm 1 Sage(UT , b, f, ϕ, d): Sage with diversity (k-means++) Input: UT : Unlabelled target data, budget b, representation ϕ, classifier f , discrimi- nator d 1: Computes S(xu ) for xu ∈ UT . Depends on both f and ϕ. 2: A ← {argmaxxu ∈UT ||S(xu )||} . Select sample with the highest Sage norm. 3: while |A| < b do . Apply k-means++ on Sage embedding. 4: A ← A ∪ {argmax min ||S(xu ) − S(xa )||} xu ∈UT xa ∈A 5: end while 6: Return A 3.5 Semi-Supervised Domain Adaptation (SSDA) When acquiring labels in the target domain, we are in the Semi-Supervised Domain Adaptation (SSDA) setting. To this purpose, we note LS and LT the sets of labelled samples from the source and the target domains, respectively. We study three strategies, referred to as S∪T , S+T and MME [31]. They incorporate labelled samples into adaptation through an additional loss Ω, called a SSDA regularizer: ΩS∪T (f, ϕ) := LLS ∪LT (f, ϕ) (7) ΩS+T (f, ϕ) := LLS (f, ϕ) + LLT (f, ϕ) (8) noting LL (f, ϕ) the empirical cross-entropy of f ϕ computed on some labelled dataset L. Note that ΩS+T gives more importance to target labelled samples compared to ΩS∪T , especially in the small budget regime (i.e., when the budget b is such that b  |LS |). As a strong baseline exists in SSDA, we design P Ω fol- lowing the minimax entropy (MME) [31]. Noting HUT (h) := − |U1T | x∈UT h(x) · log h(x), the entropy of unlabelled samples UT , the MME objective is:  ΩMME (f ) := ΩS+T (f, ϕ) − λHUT (f ϕ) (9) ΩMME (ϕ) := ΩS+T (f, ϕ) + λHUT (f ϕ)  where f := σ T1 W ◦ `2 (`2 (f ) := f /||f ||2 is the L2 normalization of features and W ∈ RC×m is a linear layer), λ = 0.1, T = 0.05 and σ is the softmax layer. 3.6 Training procedure The training procedure is described in Algorithm 2. First, we train the model by UDA following the training procedure from [7]. Second, for a given number of iterations, we select by Sage (See Algorithm 1) b samples to send to the Oracle. Then, we perform UDA provided with the knowledge of newly labelled samples, that is using a SSDA regularizer Ω combined with soft-class conditioning loss LTSF . We describe the gradient descent step in the following. First, given a loss L, Given a SSDA regularizer Ω (See Section 3.5), the gradient descent step is defined as follows, for some α > 0:   (f, ϕ, d) ← (f, ϕ, d) − α∇(f,ϕ,d) Ω̂(f, ϕ) + λL̂TSF (f, ϕ) (10) 8 128 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline Algorithm 2 Training procedure Input: Labelled source samples LS , Unlabelled target samples UT , budget b, annota- tion rounds r, iterations nit , SSDA regularizer Ω: 1: LT ← {}, UT0 ← UT . Initializes the labelled target samples. 2: f, ϕ, d ← UDA as described in [7] . Pretraining before Active Learning. 3: for b rounds of annotations do 4: A ← Sage(UT0 , b, f, ϕ, d) . Selects samples for annotation. 5: L ← Oracle(A) . Sends samples to an Oracle. 6: LT ← LT ∪ L . Adds newly labelled samples. 7: UT0 ← UT0 \A . Removes newly labelled samples. 8: for nit iterations do 9: Sample a source labelled batch BS` from LS 10: Sample a source labelled batch BT` from LT 11: Sample a source labelled batch BTu from UT . (Not from UT0 ). 12: f, ϕ, d ← Gradient descent update from Equation 10. 13: end for 14: end for 15: Return: f, ϕ where for a given loss L, we note its batch-wise computation L̂ when provided with batches of source labelled samples BS` from LS , a source labelled samples BT` from LT , a source labelled samples BTu from UT . Notably, BS` and BT` are involved for computing Ω̂ (eventually BTu for Ω̂MME ) while BS` and BTu are involved for computing L̂TSF . 4 Theoretical Analysis 4.1 General bound We provide a theoretical analysis of guiding adaptation with AL. It leverages recent results from [7]. Our insight is that some labelled data from the target domain, when combined with source labelled data, are likely to improve the target error. For instance, minimizing ΩS+T may result in a better perform- ing classifier than simply minimizing the source cross-entropy loss LLS . The theoretical framework from Bouvier et al. [7] allows to quantify precisely how it impacts representations’ transferability. Noting hS := argminh∈H εS (h) and hΩ := argminh∈H Ω(h) such that εT (hΩ ) ≤ βεT (hS ) for some β < 1 i.e., hΩ improves the target error compared to hS , the work [7] bounds the target error; β εT (hΩ ) ≤ ρ(εS (hS ) + 8τ + η) where ρ := (11) 1−β where τ := supf∈F {ET [hΩ (X) · f(ϕ(X))] − ES [Y · f(ϕ(X))]} is the transferability error, F is the set of continuous functions from Z to [−1, 1]C , η := inf f∈F εT (fϕ). Thus, to guarantee a small target error, the following conditions have to be met: a small source error of hS (small εS (hS )), a small transferability error of Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 9 129 hΩ (small τ ) and a strong inductive bias (small β) while we assume η small [7]. ADA incorporates a small set of target labelled samples into Ω to strengthen the inductive bias while enforcing a small transferability error of hΩ . More details on the choice of Ω are given in Section 3.5. 4.2 A particular case with closed form Setup and additional notations. In this section, we provide a simple example where the bound presented in Section 4 has a closed form. To conduct the anal- ysis, we consider X as a measurable set provided with a probability measure noted pT . We present an extension of an annotation selection to a measurable set. Selecting samples for annotation with budget b consists in determining some measurable P subset B such that pT (X ∈ B) = b. In the particular case where pT := x∈DT δx (δx is the Dirac distribution in x) is an empirical distribution, determining some measurable subset B such that pT (X ∈ B) = b consists in determining a subset of b samples of DT . Naive Active Classifier. Given a classifier h and an annotated subset B (with probability b), we suggest a slight modification of the classifier h based on the annotation provided by the Oracle of B. To this purpose, we introduce the naive active classifier, noted hB (x), and defined as follows: hB (x) = Oracle(x) if x ∈ B, h(x) otherwise. (12) Thus, hB (x) returns the classifier’s output h(x) if x is not annotated and returns the oracle’s output Oracle(x) if x is annotated. A closed bound. We want to exhibit a closed form of ρ when considering the active classifier. To this purpose, we introduce the purity π of B, π := pT (hS (X) 6= Oracle(X)|X ∈ B). It reflects our capacity to identify misclassified target samples. With this notion, we observe that the naive classifier improves the target error; εT (hB ) ≤ εT (hS ) − bπ. Put simply, the error is reduced by bπ corresponding to annotated samples for which the prediction is different from the Oracle output. The higher the budget of annotation b and the higher the purity π, the  lower the target  error of the naive classifier. It corresponds to bπ εT (hS )−bπ = 1 − εS (hS ) εS (hS ) ≤ (1−bπ)εS (hS ); resulting into β = (1−bπ), and finally:   1 εT (hB ) ≤ − 1 (εS (hS ) + 8τ + η) (13) bπ The target error of the active classifier is a decreasing function of both the purity and the annotation budget and an increasing function of the transferability error. The budget b, the purity π and the transferability of representations τ are levers to improve the naive classifier target error. The budget b must be considered as a cost constraint and not as a parameter to be optimized. The purity of π is not tractable since it involves labels in the target domain. Some proxy measures, 10 130 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline (a) A→W (b) W→A (c) A→D (d) D→A (e) VisDA(b = 16) (f) VisDA(b = 128) Fig. 3. Annotation of target samples improves adaptation drastically for the considered tasks. TSF+Sage (in blue) improves upon the state-of-the-art of ADA (AADA, in green), except for task A→D. AL (Badge, in red) performs poorly in this context (Badge without adaptation does not appear on VisDA tasks since it performs poorly: 47.0% and 63.4% after 10 rounds of annotation for b = 16 and b = 128, respectively) showing the importance of addressing the problem of adaptation for AL under distribution shift. Naively combining Badge with TSF (TSF+Badge, in orange) performs worsen than Sage. Sage takes into account the problem of domain shift when querying samples. such as the entropy of predictions [14], can provide a fair estimation of purity. However, it is known that deep nets tend to be overconfident on misclassified samples [12]. Therefore, we focus our efforts on understanding the role of active annotation in improving transferability error τ . 5 Experiments 5.1 Setup Tasks. We evaluate our approach on Office-31 [29], VisDA-2017 [25] and DomainNet [24]. Office-31 contains 4,652 images classified in 31 categories across three domains: Amazon (A), Webcam (W), and DSLR (D). We explore tasks A → W, W → A, A → D and D → A. We do not report results for tasks D → W and W → D since these tasks have already nearly perfect results in UDA [22]. For VisDA, we explore Synthetic: 3D models with different Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 11 131 lighting conditions and different angles; Real: real-world images. We explore the Synthetic → Real task. DomainNet [24] is a large scale dataset with six domains and 345 classes (Clipart (C), Infograph (I), Painting (P), Quickdraw (Q), Real (R) and Sketch (S)). As DomainNet suffers of noisy labels, thus violates the assumption of a perfect Oracle, we focus on the subset of 126 classes and the 7 tasks R→C, R→P, P→C, C→S, S→P, R→S and P→R [31]. Protocol. The standard protocol in UDA uses the same target samples during train and test phases. In AL’s context, this induces an undesirable effect where sample annotation mechanically increases the accuracy. At train time, the model has access to input and label of annotated samples which are also present at test time. We suggest instead to split the target domain into a train target domain (samples used for adaptation and pool of data used for annotation) and test target domain (samples used for evaluating the model) with a ratio of 1/2. Therefore, samples from the test target domain have never been seen at train time. As a result, our protocol evaluates the model generalization in an inductive scenario. Reported results are based on 8 seeds for each method. Budget, rounds and backbone. As the selected datasets are of different volumetry and difficulty, we used different budgets b: b = 8 for A→W and A→D (referred to as easy tasks), b = 16 for W→A and D→A (referred to as medium tasks), both b = 16 and b = 128 for VisDA (referred to as hard tasks). This allows to appreciate versatility of methods in small (b = 8), medium (b = 16) and high (b = 128) budget regimes. We conduct 10 rounds of annotation for these tasks. Additional details for DomainNet experiments are provided in comparison with SSDA in Section 5.2. Our backbone is a ResNet50 [16] trained by 10k steps of SGD by UDA before annotation. We use DANN [13] for AADA, MME [31] for MME based methods and TSF [7] for TSF based methods. Baselines. AADA [34] is the closest algorithm to Sage. AADA adapts repre- sentations by fooling a domain discriminator d trained to output 1 for source data and 0 for target data [13] and scores target samples x; s(x) := H(ŷ)w(z) where H(ŷ) is the entropy of predictions ŷ and w(z) = (1 − d(z))/d(z). H(ŷ) brings information about uncertainty while w(z) brings diversity to the score. We have reproduced the implementation of AADA. To demonstrate the effectiveness of Sage for Active DA, we report TSF with Badge query [3] (TSF+Badge), which is the state-of-the-art query in AL. For these methods, we used Ω = ΩS+T . To compare Sage with an AL method which ignores domain shift between la- belled samples and queried samples, we report Badge with ΩS∪T . Finally, to compare with SSDA approaches, we build two methods upon MME [31] with Entropy query (selection samples with highest prediction entropy [35]), noted MME+Entropy, which is the most natural query for MME since it relies on max/min entropy, and with Random query noted MME+Random. We have reproduced the implementation of MME. 12 132 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline 5.2 Results Comparison with SOTA. Results are reported in Figure 3. First, active anno- tation brings substantial improvements to UDA (round 0 of annotation). This validates the effort and the focus that should be put on ADA, in our opinion. Sage outperforms the current state-of-the-art (AADA) with a comfortable mar- gin for tasks with medium or hard difficulty, except for tasks A→D after the 5-th round. Importantly, Sage performs similarly or better than naively combining TSF with a state-of-the-art query in AL (Badge) demonstrating that Sage takes into account the problem of domain shift in the query process. Finally, using a direct AL method (Badge) fails in the context of domain shift. Ablation of Sage. We ablate the core components of Sage i.e., POP and the k-means++ in Figures 4(a) and 4(b). Interestingly, Sage without POP fails to improve performances in the target domain. This demonstrates that POP brings information about uncertainty into the embedding. Sage without diversity per- forms poorly on VisDA(b = 128), demonstrating that k-means++ based sampling brings diversity. Diversity on Sage has a small effect on W→A. Ablation of queries. We ablate in Figures 4(c) and 4(d) more AL strategies : (Random), where target samples are selected at random, Clusters that se- lects the closest samples to b clusters of representations obtained with k-means, Entropy based on the highest entropy maxx∈UT −h(x) · log h(x) [35] and Con- fidence that used the smallest confidence minx∈UT maxc h(x)c [35]. Sage is com- pared with a wide spectrum of AL queries based on representative (Random), diversity (Clusters) and uncertainty sampling (Entropy, Confidence). Sage out- performs them substantially on the two tasks, demonstrating it is well-suited for ADA. Ablation of Ω. We report TSF+ΩS∪T and TSF+ΩMME which consists in adding MME as a regularization of TSF i.e., Ω used here is ΩS∪T and ΩMME , respectively. Results are reported in Figures 4(e) and 4(f). We observe that us- ing ΩS+T and ΩMME improve consistently wrt ΩS∪T on VisDA(b = 128) while performing similarly on W→A. Furthermore, we observe that adding MME to TSF+Sage achieves the best performances on VisDA(b = 128). Importantly, MME+Entropy is already strong for VisDA(b = 128) explaining the substantial improvement when adding MME to TSF for this task. ADA vs SSDA: ADA is a more realistic setting. We compare SSDA (a fixed number of labelled target samples per class are available) with ADA (an Oracle provides ground-truth for queried target samples) when the number of target labelled samples are equal. Crucially, enforcing a fix number of labelled samples per class is unrealistic in practice. We report performances on DomainNet of MME (1 and 3 shot) [31] and Sage (here we used TSF + Sage + ΩMME ). AL is performed during 6 rounds with b = 21 and b = 63 for 1 and 3 shot respectively, Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 13 133 (a) W→A (b) VisDA(b = 128) (c) W→A (d) VisDA(b = 128) (e) W→A (f) VisDA(b = 128) Fig. 4. (a) and (b): Both the POP and k-means++ are crucial components for the em- pirical success of Sage. (c) and (d): Sage outperforms AL query based on representative, diversity and uncertainty samplings. (e) and (f): Effect of adding MME to TSF+Sage. leading to the same number of target labelled samples5 . Results are presented in Table 1. In the 3-shot scenario Sage improves upon MME on all the tasks, except P→R. In the 1-shot scenario, Sage and MME perform similarly. This demonstrates that active annotation with Sage performs equally, or better, than MME, and benefits from more realistic assumptions. 6 Related works Transferability of Invariant Representations. Recent works warn that domain invariance may deteriorate transferability of invariant representations [18,38]. Prior works enhance their transferability with multi-linear conditioning of rep- resentations with predictions [22], by introducing weights [8,37,11], by penalizing high singular value of representations batch [10] or by hallucinating consistent target samples for bridging the domain gap [20]. Active Learning. There is an extensive literature on Active Learning [33] that can be divided into two schools; uncertainty and diversity. The first aims to annotate samples for which the model has uncertain prediction e.g., samples are selected 5 |LT | = 21 × 6 = 126 (1 shot) and |LT | = 63 × 6 = 3 × 126 (3 shot) 14 134 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline 1-shot 3-shot Tasks MME AADA Sage MME AADA Sage R→C 67.5 64.4 69.3 70.1 68.8 73.9 R→P 69.6 65.5 69.4 70.8 67.0 71.4 P→C 69.0 63.2 69.9 71.4 67.3 74.1 C→S 62.2 57.4 61.5 64.7 60.1 65.4 S→P 67.9 62.6 67.9 69.6 64.9 69.8 R→S 61.2 57.0 62.1 63.6 59.9 65.8 P→R 79.3 74.9 79.0 80.9 76.9 81.2 Mean 68.1 63.6 68.5 70.2 66.3 71.7 Table 1. SSDA (MME) vs ADA (AADA and Sage) on DomainNet. MME’s results deviate from [31] due to train/test split, ResNet50 as backbone and minor implemen- tation changes. according to their entropy [35] or prediction margin [28], with some theoretical guarantees [15,4]. The second focuses on annotating a representative sample of the data distribution e.g., the Core-Set approach [32] selects samples that geometrically cover the distribution. Several approaches also propose a trade- off between uncertainty and diversity, e.g., [17] that is formulated as a bandit problem. Recently, the work [2] introduces Badge, a gradient embedding, which, like SAGE, takes the best of uncertainty and diversity. Our work is inspired by Badge and adapts the core ideas in the context of ADA. Active Domain Adaptation. Despite its great practical interest, only a few pre- vious works address the problem of Active Domain Adaptation. [9] annotates target samples by importance sampling while [27,30] annotates samples with high discrepancy with source samples based on the prediction of a domain dis- criminator. However, those strategies do not fit modern adaptation with deep nets. To our knowledge, AADA [34] is the only prior work that learns actively domain invariant representations and achieves the state-of-the-art for Active Domain Adaptation. AADA is the most relevant work to compare with Sage. 7 Conclusion We have introduced Sage, an efficient method for ADA which identifies target samples that are likely to improve representations’ transferability when anno- tated. It relies on two core components; a stochastic embedding of the gradient of the transferability loss and a k-means++ initialization, which guarantees that each annotation round annotates a diverse set of target samples. Through various experiments, we have demonstrated the effectiveness of Sage and its capacity to take the best of uncertainty, representative, and diversity sampling. New SSDA strategies when using Sage is an interesting direction for future works. Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 15 135 References 1. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. Tech. rep., Stanford (2006) 2. Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671 (2019) 3. Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net (2020) 4. Balcan, M.F., Beygelzimer, A., Langford, J.: Agnostic active learning. Journal of Computer and System Sciences 75(1), 78–89 (2009) 5. Beery, S., Van Horn, G., Perona, P.: Recognition in terra incognita. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 456–473 (2018) 6. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Advances in neural information processing systems. pp. 137–144 (2007) 7. Bouvier, V., Very, P., Chastagnol, C., Tami, M., Hudelot, C.: Robust domain adap- tation: Representations, weights and inductive bias. ECML-PKDD (2020) 8. Cao, Z., Ma, L., Long, M., Wang, J.: Partial adversarial domain adaptation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 135– 150 (2018) 9. Chattopadhyay, R., Fan, W., Davidson, I., Panchanathan, S., Ye, J.: Joint transfer and batch-mode active learning. In: International Conference on Machine Learning. pp. 253–261 (2013) 10. Chen, X., Wang, S., Long, M., Wang, J.: Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In: International Confer- ence on Machine Learning. pp. 1081–1090 (2019) 11. Tachet des Combes, R., Zhao, H., Wang, Y.X., Gordon, G.J.: Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems 33 (2020) 12. Corbière, C., THOME, N., Bar-Hen, A., Cord, M., Pérez, P.: Addressing failure pre- diction by learning model confidence. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 2902–2913. Curran Associates, Inc. (2019) 13. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning. pp. 1180–1189 (2015) 14. Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems. pp. 529–536 (2005) 15. Hanneke, S., et al.: Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning 7(2-3), 131–309 (2014) 16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 17. Hsu, W.N., Lin, H.T.: Active learning by learning. In: Twenty-Ninth AAAI con- ference on artificial intelligence. Citeseer (2015) 18. Johansson, F., Sontag, D., Ranganath, R.: Support and invertibility in domain- invariant representations. In: The 22nd International Conference on Artificial In- telligence and Statistics. pp. 527–536 (2019) 16 136 Bouvier et V. Bouvier, P.al. Very, C. Chastagnol, M. Tami, H. Céline 19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012) 20. Liu, H., Long, M., Wang, J., Jordan, M.: Transferable adversarial training: A gen- eral approach to adapting deep classifiers. In: International Conference on Machine Learning. pp. 4013–4022 (2019) 21. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. pp. 97–105. JMLR. org (2015) 22. Long, M., Cao, Z., Wang, J., Jordan, M.I.: Conditional adversarial domain adap- tation. In: Advances in Neural Information Processing Systems. pp. 1640–1650 (2018) 23. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22(10), 1345–1359 (2009) 24. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1406–1415 (2019) 25. Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., Saenko, K.: Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924 (2017) 26. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset shift in machine learning. The MIT Press (2009) 27. Rai, P., Saha, A., Daumé III, H., Venkatasubramanian, S.: Domain adaptation meets active learning. In: Proceedings of the NAACL HLT 2010 Workshop on Ac- tive Learning for Natural Language Processing. pp. 27–32. Association for Com- putational Linguistics (2010) 28. Roth, D., Small, K.: Margin-based active learning for structured output spaces. In: European Conference on Machine Learning. pp. 413–424. Springer (2006) 29. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: European conference on computer vision. pp. 213–226. Springer (2010) 30. Saha, A., Rai, P., Daumé, H., Venkatasubramanian, S., DuVall, S.L.: Active su- pervised domain adaptation. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 97–112. Springer (2011) 31. Saito, K., Kim, D., Sclaroff, S., Darrell, T., Saenko, K.: Semi-supervised domain adaptation via minimax entropy. In: Proceedings of the IEEE International Con- ference on Computer Vision. pp. 8050–8058 (2019) 32. Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core- set approach. In: ICLR (2018) 33. Settles, B.: Active learning literature survey. Tech. rep., University of Wisconsin- Madison Department of Computer Sciences (2009) 34. Su, J.C., Tsai, Y.H., Sohn, K., Liu, B., Maji, S., Chandraker, M.: Active adversarial domain adaptation. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 739–748 (2020) 35. Wang, D., Shang, Y.: A new active labeling method for deep learning. In: 2014 International joint conference on neural networks (IJCNN). pp. 112–119. IEEE (2014) 36. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in neural information processing systems. pp. 3320–3328 (2014) Stochastic Adversarial Gradient Embedding for Active Domain Adaptation 17 137 37. You, K., Long, M., Cao, Z., Wang, J., Jordan, M.I.: Universal domain adapta- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2720–2729 (2019) 38. Zhao, H., Des Combes, R.T., Zhang, K., Gordon, G.: On learning invariant repre- sentations for domain adaptation. In: International Conference on Machine Learn- ing. pp. 7523–7532 (2019)