Latent Weights Generating for Few Shot Learning Using
                  Information Theory
                                                    Yiwei Zhang and Zongyang Li 1


Abstract. Few shot image classification aims at learning a classi-           samples by letting each of query samples attends to the whole sup-
fier from limited labeled data. Generating the classification weights        port set.
has been applied in many metalearning approaches for few shot im-          • To guarantee the generated weights adaptive to different query
age classification due to its simplicity and effectiveness. However,         sample, we re-formulate the problem to maximize the lower bound
fixed classification weights for different query samples within one          of mutual information between generated weights and query as
task might be sub-optimal, due to the few shot challenge, and it is          well as support data.
difficult to generate the exact and universal classification weights for   • The experiment results demonstrate the effectiveness of LWGIT,
all the diverse query samples from very few training samples. In this        thereby contributing to exceed the performances of the existing
work, we introduce latent weights generating using information the-          state-of-the-art models.
ory (LWGIT) for few shot learning which addresses current issues by
generating different classification weights for different query sam-          The remaining of this paper is organized as follows. Section 2 in-
ples by letting each of query samples attends to the whole support         cludes the related work. Section 3 introduces our proposed latent
set. The experiment results demonstrate the effectiveness of LWGIT,        weights generating using information theory method. In section 4,
thereby contributing to exceed the performances of the existing state-     we evaluate our proposed models and report experimental results on
of-the-art models.                                                         extensive realworld datasets. Section 5 concludes this work.

1   Introduction
                                                                           2     Related Work
While deep learning methods achieve great success in domains such
as computer vision [1], natural language processing [2], reinforce-
ment learning [3], their hunger for large amount of labeled data lim-      2.1    Few Shot Learning
its the application scenarios where only a few data are available for
training. Humans, in contrast, are able to learn from limited data,        Learning from few labeled training data has received growing atten-
which is desirable for deep learning methods. Few shot learning is         tions recently. Most successful existing methods apply meta learning
thus proposed to enable deep models to learn from very few sam-            to solve this problem and can be divided into several categories. In
ples.                                                                      the gradient-based approaches, an optimal initialization for all tasks
   Meta learning is by far the most popular and promising approach         is learned [4]. Ravi Larochelle [8] learned a meta-learner LSTM di-
for few shot problems [4]. In meta learning approaches, the model          rectly to optimize the given fewshot classification task. Sun et al. [9]
extracts high level knowledge across different tasks so that it can        learned the transformation for activations of each layer by gradients
adapt itself quickly to a new-coming task [5]. There are several kinds     to better suit the current task.
of meta learning methods for few shot learning, such as gradient-             In the metric-based methods, a similarity metric between query
based [4] and metric-based [6]. Weights generation, among these dif-       and support samples is learned [10]. Spatial information or local im-
ferent methods, has shown effectiveness with simple formulation [7].       age descriptors are also considered in some works to compute richer
In general, weights generation methods learn to generate the classifi-     similarities [11].
cation weights for different tasks conditioned on the limited labeled         Generating the classification weights directly has been explored
data.                                                                      by some works. Gidaris [12] generated classification weights as lin-
   However, fixed classification weights for different query samples       ear combinations of weights for base and novel classes. Similarly,
within one task might be sub-optimal, due to the few shot challenge,       Qiao et al. [13] generated the classification weights from activations
and it is difficult to generate the exact and universal classification     of a trained feature extractor. Graph neural network denoising au-
weights for all the diverse query samples from very few training sam-      toencoders are used in [7]. Munkhdalai [14] proposed to generate
ples.                                                                      fast weights from the loss gradient for each task. All these methods
   To addresses current issues, we propose latent weights generat-         do not consider generating different weights for different query ex-
ing using information theory (LWGIT) for few shot learning in this         amples, nor maximizing the mutual information.
work.The contribution is as followed:                                         There are some other methods for few-shot classification. Gener-
                                                                           ative models are used to generate or hallucinate more data in [15]
• To overcome issues mentioned above, we propose the LWGIT
                                                                           used the closed-form solutions directly for few shot classification.
  which generates different classification weights for different query
                                                                           Liu et al. [16] integrated label propagation on a transductive graph to
1 Kings college London, United Kingdom, email: yiwei.1.zhang@kcl.ac.uk
                                                                           predict the query class label.


 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2.2    Attention mechanism                                                      3.2    Problem formulation
Attention mechanism shows great success in computer vision [17]                 Following many popular meta-learning methods for few shot classi-
and natural language processing [18]. It is effective in modeling the           fication, we formulate the problem under episodic training paradigm
interaction between queries and key-value pairs from certain context.           [4]. One N-way K-shot task sampled from an unknown task distribu-
Based on the fact that keys and queries point to the same entities              tion P (T ) includes support set and query set:
or not, people refer to attention as self attention or cross attention.
                                                                                                                T = (S, Q)                           (6)
In this work, we use both types of attention to encode the task and
query-task information.                                                                   S = xcn ;k , ycn ;k |k = 1, . . . , K; n = 1, . . . , N ,Q =
                                                                                                                
                                                                                where
                                                                                
                                                                                   x̂1 , . . . , x̂[Q] Support set S contains NK labeled samples. Query
                                                                                set Q includes x̂ and we need to predict label ŷ for x̂ based on S.
3     Latent weights generating using information                               During meta-training, the meta-loss is estimated on Q to optimize
      theory                                                                    the model. During metatesting, the performance of meta-learning
                                                                                method is evaluated on Q, provided the labeled S. The classes used in
3.1    Background
                                                                                meta-training and meta-testing are disjoint so that the meta-learned
Suppose that a sequence of tasks {T1 , . . . , TNt } are sampled from an        model needs to learn the knowledge transferable across tasks and
environment which is a probability distribution E on tasks. In each             adapt itself quickly to novel tasks.
                                                            ntr                    Our proposed approach follows the general framework to gener-
task Ti ∼ E, we have a few examples {xi,j , yi,j }j=1           to constitute
                                                                                ate the classification weights [13]. In this framework, there is a fea-
the training set DTtri and the rest as the test set DTtei .
                                                                                ture extractor to output image feature embeddings. The meta-learner
   Given a base learner f with θ as parameters, the optimal parame-
                                                                                needs to generate the classification weights for different tasks
ters θTi are learned to make accurate predictions, i.e., fθTi (xi,j ) →
yi,j . The effectiveness of sucha base learner on DTtri is evaluated by
the loss function L fθτi , DTtri , which equals the mean square error           3.3    Latent embedding optimization
for regression problems:                                                        Latent Embedding Optimization (LEO) [19] is one of the weights
                          X                                                     generation methods that is most related to our work. In LEO, the la-
                                         kfθTi (xi,j ) − yi,j k22        (1)    tent code z is generated by h conditioned on support set S, described
                                  tr
                 (            )
                     xi,j ,yi,j ∈DT
                                    i
                                                                                as z = h(S). h is instantiated as relation networks [20]. Classifica-
                                                                                tion weights w can be decoded from z with l, w = l(z). In the inner
or the cross entropy loss :                                                     loop, we use w to compute the loss (usually cross entropy) on the
                           X                                                  support set and then update z:
               −                         log p yi,j |xi,j , fθTi         (2)
                   (xi,j ,yi,j )∈DTtri                                                                  z 0 = z − η∇z LS (w)                        (7)
                                                                                   where LS indicates that the loss is evaluated on S only. The up-
for classification problems.
                                                                                dated latent code z 0 is used to decode new classification weights w0
   The goal of meta-learning is to learn from previous tasks a well-
                                                                                with generating function l. w0 is adopted in the outer loop for query
generalized meta-learner M(·) which can facilitate the training of
                                                                                set Q and the objective function of LEO then can be written as
the base learner in a future task with a few examples. In fulfillment of
this, meta-learning involves two stages, i.e., meta-training and meta-                                       min LQ w0
                                                                                                                         
                                                                                                                                                   (8)
                                                                                                              θ
testing.
   During meta- training, the parameters of the base learner for all               Here θ stands for the parameters of h and l and we omit the regu-
tasks, i.e., {θTi }N
                   i=1 , and the meta-learner M(·) are optimized al-
                     t
                                                                                larization terms for clarity. LEO avoids updating high-dimensional w
ternatingly. In virtue of M, the parameters {θTi }N   t
                                                    i=1 are learned to          in the inner loop by learning a lower-dimensional latent space, from
minimize the expected empirical loss over training sets of all Nt his-          which sampled z can be used to generate w. The most significant dif-
torical tasks:                                                                  ference between LEO and LWGIT is that we do not need inner up-
                              Nt
                                                                                dates to adapt the model. Instead, LWGIT is a feedforward network
                                                                                trained to maximize the mutual information so that it fits to differ-
                             X                 
                     min         L M fθTi , DTtri                        (3)
                   {θTi }N t                                                    ent tasks well. On the other hand, LWGIT learns to generate optimal
                         i=1 i=1
                                                                                classification weights for each query sample while LEO generates
In turn, a well-generalized M can be obtained by minimizing the ex-             fixed weights conditioned on the support set within one task.
pected empirical loss over test sets:

                             Nt
                                                                                3.4    Weights Generation Using Information Theory
                             X                   
                       min         L M fθTi , DTtei                      (4)    The framework of our proposed method is shown in Figure 1. As-
                        M
                             i=1                                                sume that we have a feature extractor, which can be a simple 4-layer
                                                                                Convnet or a deeper Resnet. All the images included in the sam-
When it comes to the metatesting phase, provided with a future task
                                                                                pled task T are processed by this feature extractor and represented
Tt , the learning effectiveness and efficiency are improved by apply-
                                                                                as d-dimensional vectors afterwards, i.e., xcn ;k , x̂ ∈ Rd . There are
ing the meta-learner M and solving
                                                                                two paths to encode the task context and the individual query sam-
                                                                                ple respectively, which are called contextual path and attentive path.
                           min L M (fθτt ) , DTtrt
                                                        
                                                                         (5)
                            θτt                                                 The outputs of both paths are concatenated together as input to the
                                                  Figure 1. The structure of the proposed LWGIT


generator for classification weights. Generated classification weights     3.4.2    Weights Generator
are used to not only predict the label of x̂, but also maximize the
                                                                           We replicate Xcp ∈ RN K×dh and X̂ap ∈ R|Q| × dh for |Q|
lower bound of mutual information between itself and other vari-
                                                                           and N K times respectively and reshape them afterwards. Then we
ables, which will be discussed in the following section.
                                                                           have Xcp ∈ R|Q|×N K×dh and X̂ap ∈ R|Q|×N K×dh . These two
                                                                           tensors are concatenated to become Xcp⊕ap ∈ R|Q| × N K ×
                                                                           2dh , Xcp⊕ap can be interpreted that each query sample has its own
                                                                           latent representations for support set to generate specific classifica-
3.4.1   Attention Network                                                  tion weights, which are both aware of the task-context and adaptive
                                                                           to individual query sample.
                                                                              Xcp⊕ap is decoded by the weights generator g :R2dh → R2d
The encoding process includes two paths, namely the contextual path        . We assume that the classification weights follow Gaussian distri-
and attentive path. The contextual path aims at learning representa-       bution with diagonal covariance. g outputs the distribution param-
tions for only the support set with a multi-head self-attention net-       eters and we sample the weights from learned distribution during
work fsacp
           [18]. The outputs of contextual path Xcp ∈ RN K×dh thus         meta-training. The sampled classification weights are represented
contain richer information about the task and can be used later for        as W ∈ R|Q| × N K × d . To reduce complexity, we compute
weights generation.                                                        the mean value on K classification weights for each class to have
   Existing weights generation methods generate the classification         Wf inal ∈ R|Q|×N × d . Therefore, ith query sample has its specific
                                                                                                             f inal
weights conditioned on the support set only, which is equivalent to        classification weight matrix Wi,i,i       ∈ RN ×d . The prediction for
                                                                                                                    f inalT
using contextual path. However, the classification weights generated       query data can be computed by X̂W                . The support data X is
in this way might be sub-optimal. This is because estimating the ex-       replicated for |Q| times and reshaped as Xs ∈ R|Q| × N K × d . So
act and universal classification weights from very few labeled data        the prediction for support data can also be computed as Xs Wf inalT .
in the support set is difficult and sometimes impossible. The gen-            Besides the weights generator g, we have another two decoders
erated weights are usually in lack of adaptation to different query        r1 : Rd → Rdh and r2 : Rd → Rdh . They both take the gen-
samples. We address this issue by introducing attentive path, where        erated weights W as inputs and learn to reconstruct Xcp and Xap
the individual query example attends to the task context and then is       respectively. The outputs of r1 and r2 are denoted as Xcp           ap
                                                                                                                                        re , X̂re ∈
used to generate the classification weights. Therefore, the classifica-    R|Q|×N K×dh .
tion weights are adaptive to different query samples and aware of the
task context as well.
                                                                     ap    3.5     Information Theory
   In the attentive path, a new multi-head self-attention network fsa
on the support set is employed to encode the global task informa-          In this section, we perform the analysis for one query sample with-
        ap                     cp
tion. fsa  is different from fsa  in contextual path because the self-     out loss of generality. The subscripts for classification weights are
attention network in contextual path emphasizes on generating the          omitted for clarity. In general, we use (x, y) and (x̂, ŷ) to represent
classification weights. On the contrary, outputs of self-attention here    support and query samples respectively.
plays the role of providing the Value context for different query sam-        Since the classification weights w generated from g are encoded
ples to attend in the following cross attention. Sharing the same self-    with attentive path and contextual path, it is expected that we can
attention networks might limit the expressiveness of learned repre-        directly have the query-specific weights. However, we show in the
                                                          ap
sentations in both paths. The cross attention network fca    applied on    experiments that simply doing this does not outperform a weight gen-
each query sample and task-aware support set is followed to produce        erator conditioned only on the S significantly, which implies that the
X̂ap ∈ R|Q|×dh .                                                           generated classification weights from two paths are not sensitive to
   We use multi-head attention with h heads in both paths. In one at-      different query samples. In other words, the information from atten-
tention block, we produce h different sets of queries, keys and values.    tive path is not kept well during the weights generation.
Multi-head attention is claimed to be able to learn more comprehen-           To address this limitation, we propose to maximize the mutual in-
sive and expressive representations from h different subspaces [18].       formation between generated weights w and support as well as query
data. The objective function can be described as                                 CE here stands for cross entropy. xcp and x̂ap are the inputs to
                                    X                                         weights generator g. xcp                      ap
                                                                                                       re ∼ pθ (x|w) and x̂re ∼ pθ (x̂|w) are the
             max I((x̂, ŷ); w) +         I((x, y); w)                  (9)   reconstruction of xcp and x̂ap . Since we convert the log likelihood
                                           (x,y)∈S                            in Equation 15 to mean square error or cross entropy in Equation
  According to the chain rule of mutual information, we have                  16 to optimize, the value of each term in Equation 16 is not equal
                                                                              to real log likelihood and we have to decide the weightage for each
                I((x̂, ŷ); w) = I(x̂; w) + I(ŷ; w|x̂)                (10)   one. λ1 , λ2 , λ3 are thus hyper-parameters for trade-off of different
  Equation 10 stands for both terms in 9. So the objective function           terms. With the help of last three terms, the generated classification
can be written as                                                             weights are forced to carry information about the support data and
                                 X                                            the specific query sample.
  max I(x̂; w) + I(ŷ; w|x̂) +        [I(x; w) + I(y; w|x)] (11)                 In LEO [19], the inner update loss is computed as cross entropy
                                        (x,y)∈S
                                                                              on support data. If we merge the inner update into outer loop, then
   Directly computing the mutual information in Equation                      the loss becomes the summation of first two terms in Equation 16.
11 is intractable since the true posteriori distributions like                However, the weight generation in LEO does not involve specific
p(ŷ|x̂, w), p(x̂|w) are still unknown. Therefore, we use Vari-               query samples, thus making reconstructing X̂ap impossible. In this
ational Information Maximization [21] to compute the lower bound              sense, LEO can be regarded as a special case of our proposed method,
of Equation 9. We use pθ (x̂|w) to approximate the true posteriori            where (1) only contextual path exits and (2) λ2 = λ̄3 = 0.
distribution, where θ represents the model parameters. As a result,
we have                                                                                  Model             Feature Extractor   5-way 1-shot   5-way 5-shot
                                                                                       MAML [4]                 Conv-4            51.67          70.3
                                                                                  Prototypical Nets [22]        Conv-4            53.31          72.69
   I(x̂; w) = H(x̂) − H(x̂|w)                                                       Relation Nets [6]           Conv-4            54.48          71.32
                                                                                      TPN [16]                Conv-4            59.91          72.85
            = H(x̂) + Ew∼p(w|x,S) Ex̂∼p(x̂|w) [log p(x̂|w)]                         MetaOptNet [23]           Resnets-12          65.81          81.75
            = H(x̂) + Ew∼p(w|x,S) DKL (p(x̂|w)kpθ (x̂|w))                               LEO [19]             WRN-28-10            66.33          81.44
                                                                                     LWGIT (ours )           WRN-28-10            67.46          82.57
             + Ex̂∼p(x̂|w) [log pθ (x̂|w)]
                                                                 
             ≥ H(x̂) + Ew∼p(w|x,S) Ex̂∼p(x|w) [log pθ (x̂|w)]                 Table 1. Accuracy comparison with other approaches on tieredImageNet.
                                                                  (12)
   H(·) is the entropy of a random variable. H(x̂) is a constant value
for given data. We can maximize this lower bound as the proxy for
the true mutual information. Similar to I(x̂; w)
  I(ŷ; w|x̂) ≥ H(ŷ|x̂)                                                      4      Experiments
                                                                (13)
               + Ew∼p(w|x̂,S) Eŷ∼p(ŷ|x,w) [log pθ (ŷ|x̂, w)]
                                                                              4.1      Datasets And Protocols
       X
               I((x, y); w) ≥
                                        X
                                                H((x, y))                     We conduct experiments on miniImageNet [24] and tieredImageNet
                                                                              [26], two commonly used benchmark datasets, to compare with
     (x,y)∈S                          (x,y)∈S                          (14)
                                                                              other methods and analyze our model. Both datasets are subsets of
      + E(x,y)∼p((x,y)|w) [log pθ (x|w) + log pθ (y|x, w)]                    ILSVRC-12 dataset. miniImageNet contains 100 randomly sampled
pθ (x̂|w), pθ (x, y|w) are used to approximate the true posteriori dis-       classes with 600 images per class. We follow the train/test split in [8],
tribution p(x̂|w) and p(x, y|w)                                               where 64 classes are used for meta-training, 16 for meta-validation
   Put the lower bounds back into Equation 11. Omit the constant              and 20 for meta-testing. tieredImageNet is a larger dataset compared
entropy terms and the expectation subscripts for clarity, we have the         to miniImageNet. There are 608 classes and 779,165 images in to-
new objective function as                                                     tal. They are selected from 34 higher level nodes in ImageNet [27]
                                                                              hierarchy. 351 classes from 20 high level nodes are used for meta-
                                                                              training, 97 from 6 nodes for meta-validation and 160 from 8 nodes
  maxEθ [log pθ (ŷ|x̂, w)]
                                                                       (15)   for meta-testing.
       + Eθ [log pθ (y|x, w) + log pθ (x|w) + log pθ (x̂|w)]                     We use the image features similar to LEO [19]. They trained a 28-
   The first two terms are maximizing the log likelihood of label for         layer Wide Residual Network [28] on the meta-training set. Each im-
both support and query data with respective to the network parame-            age then is represented by a 640 dimensional vector, which is used as
ters, given the generated classification weights. This is equivalent to       the input to our model. For N-way K-shot experiments, we randomly
minimizing the cross entropy between prediction and ground-truth.             sample N classes from meta-training set and each of them contains
We assume that pθ (x̂|w) and pθ (x|w) are Gaussian distributions.             K samples as the support set and 15 as query set. Similar to other
r1 and r2 are used to approximate the mean of these two Gaussian              works, we train 5-way 1-shot and 5-shot models on two dataset. Dur-
distributions. Therefore maximizing the log likelihood is equivalent          ing meta-testing, 600 N-way K-shot tasks are sampled from meta-
to reconstruct XCp and X̂ap with L2 loss. Thus the loss function to           testing set and the average accuracy for query set is reported with
train the network can be written as                                           95confidence interval, as done in recent works [4, 19].
                                      X
          L = CE (ŷpred , ŷ) + λ1       CE (ypred , y)
                                            y∈S                               4.2      Few shot image classification
                    X                                                  (16)
                                 cp
            + λ2            kx        − xcp
                                         re k2 + λ3 kx̂
                                                        ap
                                                           − x̂ap
                                                               re k2          We compare the performance of our approach LWGIT on two
                   xcp ∈S                                                     datasets with several state-of-the-art methods proposed in recent
                                                                           miniImageNet                     tieredImageNet
                                        Model
                                                                    5-way 1-shot 5-way 5-shot         5-way 1-shot 5-way 5-shot
                                     LEO                              61.76%        77.59%              66.33%         81.44%
                              Generator in LEO                        60.33%        74.53%              65.17%         78.77%
                       Generator conditioned on S only                61.02%        74.33%              66.22%         79.66%
                      Generator conditioned on S with IM              62.04%        77.54%              66.43%         81.73%
                      MLP encoding, λ1 = λ2 = λ3 = 0                  58.95%        71.68%              63.92%         75.80%
                                MLP encoding                          62.26%        76.91%              65.84%         79.24%
                              λ1 = λ2 = λ3 = 0                        61.61%        74.14%              65.65%         79.93%
                                 λ1 = λ2 = 0                          62.06%        74.18%              65.85%         80.42%
                                    λ3 = 0                            62.91%        77.88%              67.27%         81.67%
                                    λ1 = 0                            62.19%        74.21%              66.82%         80.61%
                                 λ2 = λ3 = 0                          62.12%        77.65%              66.86%         81.03%
                            random shuffle in class                   62.87%        77.48%              67.52%         82.55%
                        random shuffle between classes                61.20%        77.48%              66.55%         82.53%
                                LWGIT (ours)                          63.42%        78.44%              67.58%         82.04%

  Table 2.   Analysis of our proposed LWGIT. In the top half, the attentive path is removed to compare with LEO. In the bottom part, ablation analysis with
                                                      respective to different components is provided.


          Model             Feature Extractor   5-way 1-shot   5-way 5-shot      4.3     Analysis
  Matching Networks [24]         Conv-4           46.60%         60.00%
        MAML [4]                 Conv-4           48.70%         63.11%
      Meta LSTM [8]              Conv-4           43.44%         60.60%          We perform detailed analysis on LWGIT, shown in Table 3. We in-
   Prototypical Nets [22]        Conv-4           49.42%         68.20%
     Relation Nets [6]           Conv-4           50.44%         65.32%
                                                                                 clude the results of LEO Rusu et al. (2019) for reference. Generator
        SNAIL [25]             Resnets-12         55.71%         68.88%          in LEO means that there is no inner update in LEO. In the upper part
         TPN [16]              Resnets-12         59.46%          0.00%          of the table, we first studied the effect of attentive path. We imple-
         MTL [9]               Resnets-12         61.20%         75.50%
       Dynamic [12]           WRN-28-10           60.06%         76.39%          mented two generators including only the contextual path during en-
      Prediction [13]         WRN-28-10           59.60%         73.74%          coding. Generator conditioned on S with IM indicates that we add the
      DAE-GNN [7]             WRN-28-10           62.96%         78.85%
         LEO [19]             WRN-28-10           61.76%         77.59%          cross entropy loss and reconstruction loss for support set. It can be
       LWGIT (ours)           WRN-28-10           62.27%         78.74%          observed that Generator conditioned on S only is trained with cross
                                                                                 entropy on query set, which is similar to Generator in LEO without
 Table 3. Accuracy comparison with other approaches on miniImageNet              inner update. It is able to achieve similar or slightly better results than
                                                                                 Generator in LEO, which implies that self-attention is no worse than
                                                                                 relation networks used in LEO to model task-context. With infor-
                                                                                 mation maximization, our generator is able to obtain slightly better
                                                                                 performance than LEO.
                                                                                    The effect of attention is investigated by replacing the attention
                                                                                 modules with 2-layer MLPs, which is shown as MLP encoding. More
                                                                                 specifically, one MLP in contextual path is used for support set and
years. The results of MAML, Prototypical Nets, Relation Nets on                  another MLP in attentive path for query samples. We can see that
tieredImageNet are evaluated by [16]. The results of Dynamic on                  even without attention to encode the task-contextual information,
miniImageNet with WRN-28-10 as the feature extractor is reported                 MLP encoding can achieve accuracy close to LEO, for the sake of in-
in [7]. The other results are reported in the corresponding original pa-         formation maximization. However, if we let λ1 = λ2 = λ3 = 0 for
pers. We also include the backbone network structure of the used fea-            MLP encoding, the performance drops significantly, which demon-
ture extractor for reference. The results on miniImageNet and tiered-            strates the importance of maximizing the information
ImageNet are shown in Table 1 and 2 respectively.                                   We conducted ablation analysis with respective to λ1 , λ2 , λ3 to in-
   The top half parts of Table 1 and 2 display the meth-                         vestigate the effect of information maximization. First,λ1 , λ2 , λ3 are
ods belonging with different meta learning categories, such as                   all set to be 0. In this case, the accuracy is similar to generator con-
metric-based(Matching Networks, Prototypical Nets), gradient-                    ditioned on S only, showing that the generated classification weights
based (MAML, MTL), graph-based (TPN). The bottom part shows                      are not fitted for different query samples, even with the attentive path.
the classification weights generation approaches including Dynamic,              It can also be observed that maximizing the mutual information be-
Prediction, DAE-GNN, LEO and our proposed LWGIT.                                 tween weights and support is more crucial since λ1 = λ2 = 0 de-
   LWGIT can outperform all the methods in top parts of two table.               grades accuracy significantly, comparing with λ3 = 0. We further
Comparing with other classification weights generation methods in                investigate the relative importance of the classification on support as
the bottom part, LWGIT still shows very competitive performance,                 well as reconstruction. λ1 = 0 affects the performance noticeably.
namely the best on tieredImageNet and close to the state-of-the-art              We conjecture that the support label prediction is more critical for
on miniImageNet. We note that all the classification weights gener-              information maximization.
ation methods are using WRN-28-10 as backbone network, which                        The classification weights are generated specifically for each
makes the comparison fair. In particular, LWGIT can outperform                   query sample in LWGIT. To this point, we shuffle the classification
LEO in all settings.                                                             weights between query samples within the same classes and between
different classes as well to study whether the classification weights          [11] Y. Lifchitz, Y. Avrithis, S. Picard, and A. Bursuc, “Dense classification
are adapted for different query samples. Assume there are T query                   and implanting for few-shot learning,” in Proceedings of the IEEE Con-
                                                                                    ference on Computer Vision and Pattern Recognition, 2019, pp. 9258–
samples per class in one task. W final ∈ R|Q| × N × d can be re-                    9267.
shaped into W final ∈ RN ×T ×N ×d . Then we shuffle this weight ten-           [12] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning with-
sor along the first and second axis randomly. The results are shown as              out forgetting,” in Proceedings of the IEEE Conference on Computer
random shuffle between classes and random shuffle in class in Table                 Vision and Pattern Recognition, 2018, pp. 4367–4375.
3. For 5-way 1-shot experiments, the random shuffle between classes            [13] S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recogni-
                                                                                    tion by predicting parameters from activations,” in Proceedings of the
degrades the accuracy noticeably while the random shuffle in class                  IEEE Conference on Computer Vision and Pattern Recognition, 2018,
dose not affect too much. This indicates that when the support data                 pp. 7229–7238.
are very limited, the generated weights for query samples from the             [14] T. Munkhdalai and H. Yu, “Meta networks,” in Proceedings of the 34th
same class are very similar to each other while distinct for different              International Conference on Machine Learning-Volume 70. JMLR.
                                                                                    org, 2017, pp. 2554–2563.
classes. When there are more labeled data in support set, two kinds of         [15] Z. Chen, Y. Fu, Y.-X. Wang, L. Ma, W. Liu, and M. Hebert, “Image de-
random shuffle show very close or even the same results in 5-way 5-                 formation meta-networks for one-shot learning,” in Proceedings of the
shot experiments, which are both worse than the original ones. This                 IEEE Conference on Computer Vision and Pattern Recognition, 2019,
implies that the generated classification weights are more diverse and              pp. 8680–8689.
specific for each query sample in 5-way 5-shot setting. The possible           [16] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang,
                                                                                    “Learning to propagate labels: Transductive propagation network for
reason is that larger support set provides more knowledge to estimate               few-shot learning,” arXiv preprint arXiv:1805.10002, 2018.
the optimal classification weights for each query example.                     [17] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and
                                                                                    D. Tran, “Image transformer,” arXiv preprint arXiv:1802.05751, 2018.
                                                                               [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
5    CONCLUSION                                                                     Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
                                                                                    Advances in neural information processing systems, 2017, pp. 5998–
                                                                                    6008.
In this work, we introduce latent weights generating using informa-            [19] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero,
tion theory(LWGIT) for few shot learning. LWGIT learns to generate                  and R. Hadsell, “Meta-learning with latent embedding optimization,”
optimal classification weights for each query sample within the task                arXiv preprint arXiv:1807.05960, 2018.
by two encoding paths. To guarantee this, the lower bound of mu-               [20] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu,
                                                                                    P. Battaglia, and T. Lillicrap, “A simple neural network module for re-
tual information between generated weights and query, support data                  lational reasoning,” in Advances in neural information processing sys-
is maximized. The effectiveness of LWGIT is demonstrated by state-                  tems, 2017, pp. 4967–4976.
of-the-art performance on two benchmark datasets.                              [21] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and
                                                                                    P. Abbeel, “Infogan: Interpretable representation learning by informa-
                                                                                    tion maximizing generative adversarial nets,” in Advances in neural in-
                                                                                    formation processing systems, 2016, pp. 2172–2180.
REFERENCES                                                                     [22] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot
                                                                                    learning,” in Advances in neural information processing systems, 2017,
 [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image         pp. 4077–4087.
     recognition,” in Proceedings of the IEEE conference on computer vision    [23] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with
     and pattern recognition, 2016, pp. 770–778.                                    differentiable convex optimization,” in Proceedings of the IEEE Con-
 [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training          ference on Computer Vision and Pattern Recognition, 2019, pp. 10 657–
     of deep bidirectional transformers for language understanding,” arXiv          10 665.
     preprint arXiv:1810.04805, 2018.                                          [24] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching net-
 [3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,        works for one shot learning,” in Advances in neural information pro-
     M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general rein-          cessing systems, 2016, pp. 3630–3638.
     forcement learning algorithm that masters chess, shogi, and go through    [25] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural
     self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.                  attentive meta-learner,” arXiv preprint arXiv:1707.03141, 2017.
 [4] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for      [26] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum,
     fast adaptation of deep networks,” in Proceedings of the 34th Interna-         H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised
     tional Conference on Machine Learning-Volume 70. JMLR. org, 2017,              few-shot classification,” arXiv preprint arXiv:1803.00676, 2018.
     pp. 1126–1135.                                                            [27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
 [5] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau,                   A large-scale hierarchical image database,” in 2009 IEEE conference on
     T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by           computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
     gradient descent by gradient descent,” in Advances in neural informa-     [28] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv
     tion processing systems, 2016, pp. 3981–3989.                                  preprint arXiv:1605.07146, 2016.
 [6] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales,
     “Learning to compare: Relation network for few-shot learning,” in Pro-
     ceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition, 2018, pp. 1199–1208.
 [7] S. Gidaris and N. Komodakis, “Generating classification weights with
     gnn denoising autoencoders for few-shot learning,” in Proceedings of
     the IEEE Conference on Computer Vision and Pattern Recognition,
     2019, pp. 21–30.
 [8] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learn-
     ing,” 2016.
 [9] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for
     few-shot learning,” in Proceedings of the IEEE Conference on Com-
     puter Vision and Pattern Recognition, 2019, pp. 403–412.
[10] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang, “Finding task-
     relevant features for few-shot learning by category traversal,” in Pro-
     ceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition, 2019, pp. 1–10.