Latent Weights Generating for Few Shot Learning Using Information Theory Yiwei Zhang and Zongyang Li 1 Abstract. Few shot image classification aims at learning a classi- samples by letting each of query samples attends to the whole sup- fier from limited labeled data. Generating the classification weights port set. has been applied in many metalearning approaches for few shot im- • To guarantee the generated weights adaptive to different query age classification due to its simplicity and effectiveness. However, sample, we re-formulate the problem to maximize the lower bound fixed classification weights for different query samples within one of mutual information between generated weights and query as task might be sub-optimal, due to the few shot challenge, and it is well as support data. difficult to generate the exact and universal classification weights for • The experiment results demonstrate the effectiveness of LWGIT, all the diverse query samples from very few training samples. In this thereby contributing to exceed the performances of the existing work, we introduce latent weights generating using information the- state-of-the-art models. ory (LWGIT) for few shot learning which addresses current issues by generating different classification weights for different query sam- The remaining of this paper is organized as follows. Section 2 in- ples by letting each of query samples attends to the whole support cludes the related work. Section 3 introduces our proposed latent set. The experiment results demonstrate the effectiveness of LWGIT, weights generating using information theory method. In section 4, thereby contributing to exceed the performances of the existing state- we evaluate our proposed models and report experimental results on of-the-art models. extensive realworld datasets. Section 5 concludes this work. 1 Introduction 2 Related Work While deep learning methods achieve great success in domains such as computer vision [1], natural language processing [2], reinforce- ment learning [3], their hunger for large amount of labeled data lim- 2.1 Few Shot Learning its the application scenarios where only a few data are available for training. Humans, in contrast, are able to learn from limited data, Learning from few labeled training data has received growing atten- which is desirable for deep learning methods. Few shot learning is tions recently. Most successful existing methods apply meta learning thus proposed to enable deep models to learn from very few sam- to solve this problem and can be divided into several categories. In ples. the gradient-based approaches, an optimal initialization for all tasks Meta learning is by far the most popular and promising approach is learned [4]. Ravi Larochelle [8] learned a meta-learner LSTM di- for few shot problems [4]. In meta learning approaches, the model rectly to optimize the given fewshot classification task. Sun et al. [9] extracts high level knowledge across different tasks so that it can learned the transformation for activations of each layer by gradients adapt itself quickly to a new-coming task [5]. There are several kinds to better suit the current task. of meta learning methods for few shot learning, such as gradient- In the metric-based methods, a similarity metric between query based [4] and metric-based [6]. Weights generation, among these dif- and support samples is learned [10]. Spatial information or local im- ferent methods, has shown effectiveness with simple formulation [7]. age descriptors are also considered in some works to compute richer In general, weights generation methods learn to generate the classifi- similarities [11]. cation weights for different tasks conditioned on the limited labeled Generating the classification weights directly has been explored data. by some works. Gidaris [12] generated classification weights as lin- However, fixed classification weights for different query samples ear combinations of weights for base and novel classes. Similarly, within one task might be sub-optimal, due to the few shot challenge, Qiao et al. [13] generated the classification weights from activations and it is difficult to generate the exact and universal classification of a trained feature extractor. Graph neural network denoising au- weights for all the diverse query samples from very few training sam- toencoders are used in [7]. Munkhdalai [14] proposed to generate ples. fast weights from the loss gradient for each task. All these methods To addresses current issues, we propose latent weights generat- do not consider generating different weights for different query ex- ing using information theory (LWGIT) for few shot learning in this amples, nor maximizing the mutual information. work.The contribution is as followed: There are some other methods for few-shot classification. Gener- ative models are used to generate or hallucinate more data in [15] • To overcome issues mentioned above, we propose the LWGIT used the closed-form solutions directly for few shot classification. which generates different classification weights for different query Liu et al. [16] integrated label propagation on a transductive graph to 1 Kings college London, United Kingdom, email: yiwei.1.zhang@kcl.ac.uk predict the query class label. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2.2 Attention mechanism 3.2 Problem formulation Attention mechanism shows great success in computer vision [17] Following many popular meta-learning methods for few shot classi- and natural language processing [18]. It is effective in modeling the fication, we formulate the problem under episodic training paradigm interaction between queries and key-value pairs from certain context. [4]. One N-way K-shot task sampled from an unknown task distribu- Based on the fact that keys and queries point to the same entities tion P (T ) includes support set and query set: or not, people refer to attention as self attention or cross attention. T = (S, Q) (6) In this work, we use both types of attention to encode the task and query-task information. S = xcn ;k , ycn ;k |k = 1, . . . , K; n = 1, . . . , N ,Q =   where  x̂1 , . . . , x̂[Q] Support set S contains NK labeled samples. Query set Q includes x̂ and we need to predict label ŷ for x̂ based on S. 3 Latent weights generating using information During meta-training, the meta-loss is estimated on Q to optimize theory the model. During metatesting, the performance of meta-learning method is evaluated on Q, provided the labeled S. The classes used in 3.1 Background meta-training and meta-testing are disjoint so that the meta-learned Suppose that a sequence of tasks {T1 , . . . , TNt } are sampled from an model needs to learn the knowledge transferable across tasks and environment which is a probability distribution E on tasks. In each adapt itself quickly to novel tasks. ntr Our proposed approach follows the general framework to gener- task Ti ∼ E, we have a few examples {xi,j , yi,j }j=1 to constitute ate the classification weights [13]. In this framework, there is a fea- the training set DTtri and the rest as the test set DTtei . ture extractor to output image feature embeddings. The meta-learner Given a base learner f with θ as parameters, the optimal parame- needs to generate the classification weights for different tasks ters θTi are learned to make accurate predictions, i.e., fθTi (xi,j ) → yi,j . The effectiveness of sucha base learner on DTtri is evaluated by the loss function L fθτi , DTtri , which equals the mean square error 3.3 Latent embedding optimization for regression problems: Latent Embedding Optimization (LEO) [19] is one of the weights X generation methods that is most related to our work. In LEO, the la- kfθTi (xi,j ) − yi,j k22 (1) tent code z is generated by h conditioned on support set S, described tr ( ) xi,j ,yi,j ∈DT i as z = h(S). h is instantiated as relation networks [20]. Classifica- tion weights w can be decoded from z with l, w = l(z). In the inner or the cross entropy loss : loop, we use w to compute the loss (usually cross entropy) on the X   support set and then update z: − log p yi,j |xi,j , fθTi (2) (xi,j ,yi,j )∈DTtri z 0 = z − η∇z LS (w) (7) where LS indicates that the loss is evaluated on S only. The up- for classification problems. dated latent code z 0 is used to decode new classification weights w0 The goal of meta-learning is to learn from previous tasks a well- with generating function l. w0 is adopted in the outer loop for query generalized meta-learner M(·) which can facilitate the training of set Q and the objective function of LEO then can be written as the base learner in a future task with a few examples. In fulfillment of this, meta-learning involves two stages, i.e., meta-training and meta- min LQ w0  (8) θ testing. During meta- training, the parameters of the base learner for all Here θ stands for the parameters of h and l and we omit the regu- tasks, i.e., {θTi }N i=1 , and the meta-learner M(·) are optimized al- t larization terms for clarity. LEO avoids updating high-dimensional w ternatingly. In virtue of M, the parameters {θTi }N t i=1 are learned to in the inner loop by learning a lower-dimensional latent space, from minimize the expected empirical loss over training sets of all Nt his- which sampled z can be used to generate w. The most significant dif- torical tasks: ference between LEO and LWGIT is that we do not need inner up- Nt dates to adapt the model. Instead, LWGIT is a feedforward network trained to maximize the mutual information so that it fits to differ- X     min L M fθTi , DTtri (3) {θTi }N t ent tasks well. On the other hand, LWGIT learns to generate optimal i=1 i=1 classification weights for each query sample while LEO generates In turn, a well-generalized M can be obtained by minimizing the ex- fixed weights conditioned on the support set within one task. pected empirical loss over test sets: Nt 3.4 Weights Generation Using Information Theory X     min L M fθTi , DTtei (4) The framework of our proposed method is shown in Figure 1. As- M i=1 sume that we have a feature extractor, which can be a simple 4-layer Convnet or a deeper Resnet. All the images included in the sam- When it comes to the metatesting phase, provided with a future task pled task T are processed by this feature extractor and represented Tt , the learning effectiveness and efficiency are improved by apply- as d-dimensional vectors afterwards, i.e., xcn ;k , x̂ ∈ Rd . There are ing the meta-learner M and solving two paths to encode the task context and the individual query sam- ple respectively, which are called contextual path and attentive path. min L M (fθτt ) , DTtrt  (5) θτt The outputs of both paths are concatenated together as input to the Figure 1. The structure of the proposed LWGIT generator for classification weights. Generated classification weights 3.4.2 Weights Generator are used to not only predict the label of x̂, but also maximize the We replicate Xcp ∈ RN K×dh and X̂ap ∈ R|Q| × dh for |Q| lower bound of mutual information between itself and other vari- and N K times respectively and reshape them afterwards. Then we ables, which will be discussed in the following section. have Xcp ∈ R|Q|×N K×dh and X̂ap ∈ R|Q|×N K×dh . These two tensors are concatenated to become Xcp⊕ap ∈ R|Q| × N K × 2dh , Xcp⊕ap can be interpreted that each query sample has its own latent representations for support set to generate specific classifica- 3.4.1 Attention Network tion weights, which are both aware of the task-context and adaptive to individual query sample. Xcp⊕ap is decoded by the weights generator g :R2dh → R2d The encoding process includes two paths, namely the contextual path . We assume that the classification weights follow Gaussian distri- and attentive path. The contextual path aims at learning representa- bution with diagonal covariance. g outputs the distribution param- tions for only the support set with a multi-head self-attention net- eters and we sample the weights from learned distribution during work fsacp [18]. The outputs of contextual path Xcp ∈ RN K×dh thus meta-training. The sampled classification weights are represented contain richer information about the task and can be used later for as W ∈ R|Q| × N K × d . To reduce complexity, we compute weights generation. the mean value on K classification weights for each class to have Existing weights generation methods generate the classification Wf inal ∈ R|Q|×N × d . Therefore, ith query sample has its specific f inal weights conditioned on the support set only, which is equivalent to classification weight matrix Wi,i,i ∈ RN ×d . The prediction for f inalT using contextual path. However, the classification weights generated query data can be computed by X̂W . The support data X is in this way might be sub-optimal. This is because estimating the ex- replicated for |Q| times and reshaped as Xs ∈ R|Q| × N K × d . So act and universal classification weights from very few labeled data the prediction for support data can also be computed as Xs Wf inalT . in the support set is difficult and sometimes impossible. The gen- Besides the weights generator g, we have another two decoders erated weights are usually in lack of adaptation to different query r1 : Rd → Rdh and r2 : Rd → Rdh . They both take the gen- samples. We address this issue by introducing attentive path, where erated weights W as inputs and learn to reconstruct Xcp and Xap the individual query example attends to the task context and then is respectively. The outputs of r1 and r2 are denoted as Xcp ap re , X̂re ∈ used to generate the classification weights. Therefore, the classifica- R|Q|×N K×dh . tion weights are adaptive to different query samples and aware of the task context as well. ap 3.5 Information Theory In the attentive path, a new multi-head self-attention network fsa on the support set is employed to encode the global task informa- In this section, we perform the analysis for one query sample with- ap cp tion. fsa is different from fsa in contextual path because the self- out loss of generality. The subscripts for classification weights are attention network in contextual path emphasizes on generating the omitted for clarity. In general, we use (x, y) and (x̂, ŷ) to represent classification weights. On the contrary, outputs of self-attention here support and query samples respectively. plays the role of providing the Value context for different query sam- Since the classification weights w generated from g are encoded ples to attend in the following cross attention. Sharing the same self- with attentive path and contextual path, it is expected that we can attention networks might limit the expressiveness of learned repre- directly have the query-specific weights. However, we show in the ap sentations in both paths. The cross attention network fca applied on experiments that simply doing this does not outperform a weight gen- each query sample and task-aware support set is followed to produce erator conditioned only on the S significantly, which implies that the X̂ap ∈ R|Q|×dh . generated classification weights from two paths are not sensitive to We use multi-head attention with h heads in both paths. In one at- different query samples. In other words, the information from atten- tention block, we produce h different sets of queries, keys and values. tive path is not kept well during the weights generation. Multi-head attention is claimed to be able to learn more comprehen- To address this limitation, we propose to maximize the mutual in- sive and expressive representations from h different subspaces [18]. formation between generated weights w and support as well as query data. The objective function can be described as CE here stands for cross entropy. xcp and x̂ap are the inputs to X weights generator g. xcp ap re ∼ pθ (x|w) and x̂re ∼ pθ (x̂|w) are the max I((x̂, ŷ); w) + I((x, y); w) (9) reconstruction of xcp and x̂ap . Since we convert the log likelihood (x,y)∈S in Equation 15 to mean square error or cross entropy in Equation According to the chain rule of mutual information, we have 16 to optimize, the value of each term in Equation 16 is not equal to real log likelihood and we have to decide the weightage for each I((x̂, ŷ); w) = I(x̂; w) + I(ŷ; w|x̂) (10) one. λ1 , λ2 , λ3 are thus hyper-parameters for trade-off of different Equation 10 stands for both terms in 9. So the objective function terms. With the help of last three terms, the generated classification can be written as weights are forced to carry information about the support data and X the specific query sample. max I(x̂; w) + I(ŷ; w|x̂) + [I(x; w) + I(y; w|x)] (11) In LEO [19], the inner update loss is computed as cross entropy (x,y)∈S on support data. If we merge the inner update into outer loop, then Directly computing the mutual information in Equation the loss becomes the summation of first two terms in Equation 16. 11 is intractable since the true posteriori distributions like However, the weight generation in LEO does not involve specific p(ŷ|x̂, w), p(x̂|w) are still unknown. Therefore, we use Vari- query samples, thus making reconstructing X̂ap impossible. In this ational Information Maximization [21] to compute the lower bound sense, LEO can be regarded as a special case of our proposed method, of Equation 9. We use pθ (x̂|w) to approximate the true posteriori where (1) only contextual path exits and (2) λ2 = λ̄3 = 0. distribution, where θ represents the model parameters. As a result, we have Model Feature Extractor 5-way 1-shot 5-way 5-shot MAML [4] Conv-4 51.67 70.3 Prototypical Nets [22] Conv-4 53.31 72.69 I(x̂; w) = H(x̂) − H(x̂|w) Relation Nets [6] Conv-4 54.48 71.32   TPN [16] Conv-4 59.91 72.85 = H(x̂) + Ew∼p(w|x,S) Ex̂∼p(x̂|w) [log p(x̂|w)] MetaOptNet [23] Resnets-12 65.81 81.75 = H(x̂) + Ew∼p(w|x,S) DKL (p(x̂|w)kpθ (x̂|w)) LEO [19] WRN-28-10 66.33 81.44 LWGIT (ours ) WRN-28-10 67.46 82.57 + Ex̂∼p(x̂|w) [log pθ (x̂|w)]   ≥ H(x̂) + Ew∼p(w|x,S) Ex̂∼p(x|w) [log pθ (x̂|w)] Table 1. Accuracy comparison with other approaches on tieredImageNet. (12) H(·) is the entropy of a random variable. H(x̂) is a constant value for given data. We can maximize this lower bound as the proxy for the true mutual information. Similar to I(x̂; w) I(ŷ; w|x̂) ≥ H(ŷ|x̂) 4 Experiments   (13) + Ew∼p(w|x̂,S) Eŷ∼p(ŷ|x,w) [log pθ (ŷ|x̂, w)] 4.1 Datasets And Protocols X I((x, y); w) ≥ X H((x, y)) We conduct experiments on miniImageNet [24] and tieredImageNet [26], two commonly used benchmark datasets, to compare with (x,y)∈S (x,y)∈S (14) other methods and analyze our model. Both datasets are subsets of + E(x,y)∼p((x,y)|w) [log pθ (x|w) + log pθ (y|x, w)] ILSVRC-12 dataset. miniImageNet contains 100 randomly sampled pθ (x̂|w), pθ (x, y|w) are used to approximate the true posteriori dis- classes with 600 images per class. We follow the train/test split in [8], tribution p(x̂|w) and p(x, y|w) where 64 classes are used for meta-training, 16 for meta-validation Put the lower bounds back into Equation 11. Omit the constant and 20 for meta-testing. tieredImageNet is a larger dataset compared entropy terms and the expectation subscripts for clarity, we have the to miniImageNet. There are 608 classes and 779,165 images in to- new objective function as tal. They are selected from 34 higher level nodes in ImageNet [27] hierarchy. 351 classes from 20 high level nodes are used for meta- training, 97 from 6 nodes for meta-validation and 160 from 8 nodes maxEθ [log pθ (ŷ|x̂, w)] (15) for meta-testing. + Eθ [log pθ (y|x, w) + log pθ (x|w) + log pθ (x̂|w)] We use the image features similar to LEO [19]. They trained a 28- The first two terms are maximizing the log likelihood of label for layer Wide Residual Network [28] on the meta-training set. Each im- both support and query data with respective to the network parame- age then is represented by a 640 dimensional vector, which is used as ters, given the generated classification weights. This is equivalent to the input to our model. For N-way K-shot experiments, we randomly minimizing the cross entropy between prediction and ground-truth. sample N classes from meta-training set and each of them contains We assume that pθ (x̂|w) and pθ (x|w) are Gaussian distributions. K samples as the support set and 15 as query set. Similar to other r1 and r2 are used to approximate the mean of these two Gaussian works, we train 5-way 1-shot and 5-shot models on two dataset. Dur- distributions. Therefore maximizing the log likelihood is equivalent ing meta-testing, 600 N-way K-shot tasks are sampled from meta- to reconstruct XCp and X̂ap with L2 loss. Thus the loss function to testing set and the average accuracy for query set is reported with train the network can be written as 95confidence interval, as done in recent works [4, 19]. X L = CE (ŷpred , ŷ) + λ1 CE (ypred , y) y∈S 4.2 Few shot image classification X (16) cp + λ2 kx − xcp re k2 + λ3 kx̂ ap − x̂ap re k2 We compare the performance of our approach LWGIT on two xcp ∈S datasets with several state-of-the-art methods proposed in recent miniImageNet tieredImageNet Model 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot LEO 61.76% 77.59% 66.33% 81.44% Generator in LEO 60.33% 74.53% 65.17% 78.77% Generator conditioned on S only 61.02% 74.33% 66.22% 79.66% Generator conditioned on S with IM 62.04% 77.54% 66.43% 81.73% MLP encoding, λ1 = λ2 = λ3 = 0 58.95% 71.68% 63.92% 75.80% MLP encoding 62.26% 76.91% 65.84% 79.24% λ1 = λ2 = λ3 = 0 61.61% 74.14% 65.65% 79.93% λ1 = λ2 = 0 62.06% 74.18% 65.85% 80.42% λ3 = 0 62.91% 77.88% 67.27% 81.67% λ1 = 0 62.19% 74.21% 66.82% 80.61% λ2 = λ3 = 0 62.12% 77.65% 66.86% 81.03% random shuffle in class 62.87% 77.48% 67.52% 82.55% random shuffle between classes 61.20% 77.48% 66.55% 82.53% LWGIT (ours) 63.42% 78.44% 67.58% 82.04% Table 2. Analysis of our proposed LWGIT. In the top half, the attentive path is removed to compare with LEO. In the bottom part, ablation analysis with respective to different components is provided. Model Feature Extractor 5-way 1-shot 5-way 5-shot 4.3 Analysis Matching Networks [24] Conv-4 46.60% 60.00% MAML [4] Conv-4 48.70% 63.11% Meta LSTM [8] Conv-4 43.44% 60.60% We perform detailed analysis on LWGIT, shown in Table 3. We in- Prototypical Nets [22] Conv-4 49.42% 68.20% Relation Nets [6] Conv-4 50.44% 65.32% clude the results of LEO Rusu et al. (2019) for reference. Generator SNAIL [25] Resnets-12 55.71% 68.88% in LEO means that there is no inner update in LEO. In the upper part TPN [16] Resnets-12 59.46% 0.00% of the table, we first studied the effect of attentive path. We imple- MTL [9] Resnets-12 61.20% 75.50% Dynamic [12] WRN-28-10 60.06% 76.39% mented two generators including only the contextual path during en- Prediction [13] WRN-28-10 59.60% 73.74% coding. Generator conditioned on S with IM indicates that we add the DAE-GNN [7] WRN-28-10 62.96% 78.85% LEO [19] WRN-28-10 61.76% 77.59% cross entropy loss and reconstruction loss for support set. It can be LWGIT (ours) WRN-28-10 62.27% 78.74% observed that Generator conditioned on S only is trained with cross entropy on query set, which is similar to Generator in LEO without Table 3. Accuracy comparison with other approaches on miniImageNet inner update. It is able to achieve similar or slightly better results than Generator in LEO, which implies that self-attention is no worse than relation networks used in LEO to model task-context. With infor- mation maximization, our generator is able to obtain slightly better performance than LEO. The effect of attention is investigated by replacing the attention modules with 2-layer MLPs, which is shown as MLP encoding. More specifically, one MLP in contextual path is used for support set and years. The results of MAML, Prototypical Nets, Relation Nets on another MLP in attentive path for query samples. We can see that tieredImageNet are evaluated by [16]. The results of Dynamic on even without attention to encode the task-contextual information, miniImageNet with WRN-28-10 as the feature extractor is reported MLP encoding can achieve accuracy close to LEO, for the sake of in- in [7]. The other results are reported in the corresponding original pa- formation maximization. However, if we let λ1 = λ2 = λ3 = 0 for pers. We also include the backbone network structure of the used fea- MLP encoding, the performance drops significantly, which demon- ture extractor for reference. The results on miniImageNet and tiered- strates the importance of maximizing the information ImageNet are shown in Table 1 and 2 respectively. We conducted ablation analysis with respective to λ1 , λ2 , λ3 to in- The top half parts of Table 1 and 2 display the meth- vestigate the effect of information maximization. First,λ1 , λ2 , λ3 are ods belonging with different meta learning categories, such as all set to be 0. In this case, the accuracy is similar to generator con- metric-based(Matching Networks, Prototypical Nets), gradient- ditioned on S only, showing that the generated classification weights based (MAML, MTL), graph-based (TPN). The bottom part shows are not fitted for different query samples, even with the attentive path. the classification weights generation approaches including Dynamic, It can also be observed that maximizing the mutual information be- Prediction, DAE-GNN, LEO and our proposed LWGIT. tween weights and support is more crucial since λ1 = λ2 = 0 de- LWGIT can outperform all the methods in top parts of two table. grades accuracy significantly, comparing with λ3 = 0. We further Comparing with other classification weights generation methods in investigate the relative importance of the classification on support as the bottom part, LWGIT still shows very competitive performance, well as reconstruction. λ1 = 0 affects the performance noticeably. namely the best on tieredImageNet and close to the state-of-the-art We conjecture that the support label prediction is more critical for on miniImageNet. We note that all the classification weights gener- information maximization. ation methods are using WRN-28-10 as backbone network, which The classification weights are generated specifically for each makes the comparison fair. In particular, LWGIT can outperform query sample in LWGIT. To this point, we shuffle the classification LEO in all settings. weights between query samples within the same classes and between different classes as well to study whether the classification weights [11] Y. Lifchitz, Y. Avrithis, S. Picard, and A. Bursuc, “Dense classification are adapted for different query samples. Assume there are T query and implanting for few-shot learning,” in Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 9258– samples per class in one task. W final ∈ R|Q| × N × d can be re- 9267. shaped into W final ∈ RN ×T ×N ×d . Then we shuffle this weight ten- [12] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning with- sor along the first and second axis randomly. The results are shown as out forgetting,” in Proceedings of the IEEE Conference on Computer random shuffle between classes and random shuffle in class in Table Vision and Pattern Recognition, 2018, pp. 4367–4375. 3. For 5-way 1-shot experiments, the random shuffle between classes [13] S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recogni- tion by predicting parameters from activations,” in Proceedings of the degrades the accuracy noticeably while the random shuffle in class IEEE Conference on Computer Vision and Pattern Recognition, 2018, dose not affect too much. This indicates that when the support data pp. 7229–7238. are very limited, the generated weights for query samples from the [14] T. Munkhdalai and H. Yu, “Meta networks,” in Proceedings of the 34th same class are very similar to each other while distinct for different International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2554–2563. classes. When there are more labeled data in support set, two kinds of [15] Z. Chen, Y. Fu, Y.-X. Wang, L. Ma, W. Liu, and M. Hebert, “Image de- random shuffle show very close or even the same results in 5-way 5- formation meta-networks for one-shot learning,” in Proceedings of the shot experiments, which are both worse than the original ones. This IEEE Conference on Computer Vision and Pattern Recognition, 2019, implies that the generated classification weights are more diverse and pp. 8680–8689. specific for each query sample in 5-way 5-shot setting. The possible [16] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang, “Learning to propagate labels: Transductive propagation network for reason is that larger support set provides more knowledge to estimate few-shot learning,” arXiv preprint arXiv:1805.10002, 2018. the optimal classification weights for each query example. [17] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” arXiv preprint arXiv:1802.05751, 2018. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. 5 CONCLUSION Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998– 6008. In this work, we introduce latent weights generating using informa- [19] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, tion theory(LWGIT) for few shot learning. LWGIT learns to generate and R. Hadsell, “Meta-learning with latent embedding optimization,” optimal classification weights for each query sample within the task arXiv preprint arXiv:1807.05960, 2018. by two encoding paths. To guarantee this, the lower bound of mu- [20] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for re- tual information between generated weights and query, support data lational reasoning,” in Advances in neural information processing sys- is maximized. The effectiveness of LWGIT is demonstrated by state- tems, 2017, pp. 4967–4976. of-the-art performance on two benchmark datasets. [21] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by informa- tion maximizing generative adversarial nets,” in Advances in neural in- formation processing systems, 2016, pp. 2172–2180. REFERENCES [22] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in neural information processing systems, 2017, [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image pp. 4077–4087. recognition,” in Proceedings of the IEEE conference on computer vision [23] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with and pattern recognition, 2016, pp. 770–778. differentiable convex optimization,” in Proceedings of the IEEE Con- [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training ference on Computer Vision and Pattern Recognition, 2019, pp. 10 657– of deep bidirectional transformers for language understanding,” arXiv 10 665. preprint arXiv:1810.04805, 2018. [24] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching net- [3] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, works for one shot learning,” in Advances in neural information pro- M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general rein- cessing systems, 2016, pp. 3630–3638. forcement learning algorithm that masters chess, shogi, and go through [25] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018. attentive meta-learner,” arXiv preprint arXiv:1707.03141, 2017. [4] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for [26] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, fast adaptation of deep networks,” in Proceedings of the 34th Interna- H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised tional Conference on Machine Learning-Volume 70. JMLR. org, 2017, few-shot classification,” arXiv preprint arXiv:1803.00676, 2018. pp. 1126–1135. [27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: [5] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, A large-scale hierarchical image database,” in 2009 IEEE conference on T. Schaul, B. Shillingford, and N. De Freitas, “Learning to learn by computer vision and pattern recognition. Ieee, 2009, pp. 248–255. gradient descent by gradient descent,” in Advances in neural informa- [28] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv tion processing systems, 2016, pp. 3981–3989. preprint arXiv:1605.07146, 2016. [6] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208. [7] S. Gidaris and N. Komodakis, “Generating classification weights with gnn denoising autoencoders for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 21–30. [8] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learn- ing,” 2016. [9] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for few-shot learning,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 403–412. [10] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang, “Finding task- relevant features for few-shot learning by category traversal,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1–10.