=Paper=
{{Paper
|id=Vol-2744/paper30
|storemode=property
|title=Pairwise Ranking Distillation for Deep Face Recognition
|pdfUrl=https://ceur-ws.org/Vol-2744/paper30.pdf
|volume=Vol-2744
|authors=Mikhail Nikitin,Vadim Konushin,Anton Konushin
}}
==Pairwise Ranking Distillation for Deep Face Recognition==
Pairwise Ranking Distillation for Deep Face Recognition Mikhail Nikitin1,2 , Vadim Konushin1 , and Anton Konushin2 [0000−0002−6152−0021] 1 Video Analysis Techonologies LLC, Moscow, Russia {mikhail.nikitin,vadim}@tevian.ru 2 M.V. Lomonosov Moscow State University, Moscow, Russia ktosh@graphics.cs.msu.ru Abstract. This work addresses the problem of knowledge distillation for deep face recognition task. Knowledge distillation technique is known to be an ef- fective way of model compression, which implies transferring of the knowledge from high-capacity teacher to a lightweight student. The knowledge and the way how it is distilled can be defined in different ways depending on the problem where the technique is applied. Considering the fact that face recognition is a typical metric learning task, we propose to perform knowledge distillation on a score-level. Specifically, for any pair of matching scores computed by teacher, our method forces student to have the same order for the corresponding matching scores. We evaluate proposed pairwise ranking distillation (PWR) approach using several face recognition benchmarks for both face verification and face identifi- cation scenarios. Experimental results show that PWR not only can improve over the baseline method by a large margin, but also outperforms other score-level distillation approaches. Keywords: Knowledge Distillation, Model Compression, Face Recognition, Deep Learning, Metric Learning. 1 Introduction Face recognition systems are widely used today, and their quality keeps improving in order to better fit increasing security requirements. Nowadays majority of the computer vision tasks, including facial recognition, are solved with the help of deep neural net- works, and there exists a clear dependency that in case of a fixed training dataset, a network with a lot of layers and parameters outperforms its lightweight version. As a result, the most powerful models use a large amount of memory and computational re- sources, and therefore their deployment is quite challenging. Indeed, switching to the model of higher capacity usually results in reducing of the inference speed, which is very important in some real-life scenarios. For example, if the model is supposed to run on a resource-limited embedded device or to be used in video surveillance system with thousands of queries per second, it is often necessary to replace a large network with Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). 2 M. Nikitin et al. smaller one for the purpose of satisfying the limitations of available computational re- sources. This creates a strong demand for methods that reduce model complexity while trying to preserve its performance as much a possible. In general, there are two main strategies to reduce deep neural network complexity: one is to develop a new lightweight architecture [1–3], and another one is to com- press already trained model. Network compression can be done in many different ways, including parameter quantization [4, 5], weights prunning [6, 7], low-rank factoriza- tion [8, 9], and knowledge distillation. All these compression methods, except for the knowledge distillation, focus on reducing model size in terms of parameters while keep- ing network architecture roughly the same. On the contrary, knowledge distillation, the main idea of which is to transfer knowledge encoded in one network to another, is con- sidered to be a more general approach, since it doesn’t impose any restrictions on the architecture of the output network. Therefore, in this paper we propose a new knowledge distillation technique for ef- ficient computation of face recognition embeddings. Our method utilizes the idea of pairwise learning-to-rank approach and applies it on top of the matching scores between face embeddings. Specifically, we consider scores’ ranking produced by a teacher net- work as a ground truth label, and use it to detect and penalize mistakes in pairwise ranking of student’s matching scores. Using LFW [32], CPLFW [33], AgeDB [34], and MegaFace [35] datasets, we show that the proposed distillation method can signif- icantly improve face recognition quality compared to the conventional way of training the student network. Moreover, we found that our pairwise ranking distillation tech- nique outperforms other scores-based distillation approaches by a large margin. 2 Related Work In [14] a dichotomy of distillation approaches was proposed. It is based on the way how the knowledge is determined, and the authors distinguish individual and relational knowledge distillation methods. 2.1 Individual knowledge distillation Individual knowledge distillation (IKD) methods consider each input object indepen- dently and force student network to mimic teacher’s representation of that object. Let FT ( x ) and FS ( x ) represent the feature representations of teacher and student for input x respectively. Then, for training dataset χ = { xi }iM =1 the IKD objective function can be formulated as follow: L IKD = ∑ l ( FT ( xi ), FS ( xi )), (1) xi ∈ χ where l is some loss function that penalizes the difference between the teacher and the student. The knowledge in IKD methods is determined by the function F ( x ), which can be defined in different ways. Some examples are presented below. Authors of [10] and [11] describe the knowledge in terms of labels distribution, so that student uses output of teacher’s classifier as a ground truth soft label vector. The Pairwise Ranking Distillation for Deep Face Recognition 3 motivation of such approach lies in observation that input image sometimes contains several objects in it and can be better described using a mixture of labels. Another ap- proach was presented in [12], where authors propose to use hint connections, which go from teacher to student and transfer hidden layer activations. Depending on depth of network and spatial resolution of features where such distillation is applied, it makes student to mimic teacher at different levels of abstraction. However, over-reguralization of hidden layers can lead to poor quality, so usually hints are only used for embedding (pre-classification) layer [16, 21]. In order to successfully guide student even at initial layers, modification of hints idea was proposed in [13]. Transferring of activation was replaced there with transferring of spatial attention maps, i.e. instead of trying to repro- duce teacher’s feature representation as is, student only learns to analyze the same areas of input image. Individual knowledge distillation methods utilize clear idea of imitating the teacher’s output. However, due to the gap in model capacity between teacher and student, it may be difficult for the student to learn mapping function, which is similar or even identical to the teacher’s one. Relational knowledge distillation approach refers to that problem and considers knowledge from another point of view. 2.2 Relational knowledge distillation Relational knowledge distillation (RKD) methods define the knowledge using a group of objects rather than a single object. Each group of objects forms a structure in repre- sentational space, which can be used as a unit of knowledge. In other words, student in RKD methods learns to reproduce structure of teacher’s latent space, instead of precise feature representations of objects. To describe relative structure of n input examples relational function ψ, which maps n-tuple of embeddings to a scalar value, is used. Putting ti = FT ( xi ) and si = FS ( xi ), the objective function for RKD is defined as L RKD = ∑ l (ψ(t1 , t2 , ..., tn ), ψ(s1 , s2 , ..., sn )). (2) ( x1 ,x2 ,...,xn )∈χn Accordingly to the above equation, the choice of relational function ψ defines cer- tain RKD method. Easiest and the most obvious approach considers pairs of objects and encodes space structure in terms of Euclidean distance between two feature em- beddings. Such approach with minor modifications is used in [14] and [16]. Similar idea was recently adapted in [18], where authors use correlation between teacher’s and student’s outputs as the pairwise relational function. Triplets-based RKD approach was proposed in [14]. Three points in representational space form an angle, and its value can be used to describe structure of the triplet. Another approach, which can also be considered as relational knowledge distillation, although it doesn’t precisely follow the equation of RKD loss (2), was presented in [15]. Its main idea is to reformulate knowl- edge distillation problem as a list-wise learning-to-rank problem, where teacher’s list of matching scores is used as ranking to be learned by student. 2.3 Knowledge distillation for Face Recognition During the first several years of the development of knowledge distillation methods, experiments were carried out mostly on small classification problems. That is why the 4 M. Nikitin et al. application of such techniques for face recognition problem hasn’t been fully investi- gated yet, and only few studies have been published in this area. Some recent works [21, 22] follow the idea of hint connections and impose con- straints on the discrepancy between teacher’s and student’s embeddings. But in order to better fit angular nature of conventional losses used to train face recognition net- works [28, 29, 31], authors put penalty on cosine similarity, instead of Euclidean dis- tance. More specific approach, which is oriented to be used in metric learning tasks, was proposed in [19]. This approach utilizes the idea that high-capacity teacher network can better understand subtle differences between images, and uses this observation to adaptively choose margin value in triplet loss function. In [17] authors study knowl- edge distillation techniques in the context of fully convolutional networks (FCN). They notice that network inference effectiveness can be boosted not only by lowering model complexity, but also by decreasing the size of the input image. Following this idea, authors propose to keep the same FCN architecture and train student on a downsam- pled version of the original dataset with the help of distillation guidance from teacher’s embeddings, computed on high-resolution input. As can be seen, majority of existing distillation methods for face recognition prob- lem utilize IKD approach, while the effect of RKD hasn’t been yet investigated. In this paper, we propose a new relational knowledge distillation technique for face recogni- tion. Our method is inspired by works [14], [16] and [15], and its main idea is to relax objective function (2) in a way that the loss is computed only for those pairs of relational function values, which violate teacher’s ranking. 3 Pairwise ranking distillation Facial recognition systems usually have a gallery of target face images as its component, and each incoming image is compared to it. Gallery image with maximum matching score is further considered to be a candidate for correct match. This leads to the idea that only relative positioning of matching scores is important, rather than their absolute values. In this paper we propose an approach that adapts pairwise ranking techniques for knowledge distillation problem. More specifically, our method considers pairs of relational function values and its goal is to minimize the number of their inversions. Let X T = {ti }iN=1 and X S = {si }iN=1 be the feature representations computed by teacher and student networks for input batch X = { xi }iN=1 respectively. For both teacher and student we compute values Ψ T = {ψiT }iM =1 and Ψ = { ψi }i =1 of rela- S S M tional function ψ for all possible input n-tuples of feature embeddings. Then the pair- wise ranking (PWR) distillation loss is given by: L PWR ( X S , X T ) = ∑ 1[ψiT > ψjT ]linv (ψiS , ψSj ), (3) i,j where linv is the function that penalizes pairwise ranking inversions. As can be seen from the above equation, pairwise ranking knowledge distillation is fully defined by the relational function ψ and the inversion loss function linv . Pairwise Ranking Distillation for Deep Face Recognition 5 3.1 Relational function In this work, we fix relational function ψ to be the function with two inputs and choose it in a way that value ψ( x, y) characterizes similarity between objects x and y. To be precise, we examined Euclidean distance and cosine similarity as a relational function, and found that cosine similarity performs slightly better3 . It is worth noting that one can choose any function which describes relationship of set of points in embedding space. For example, RKD-A [14] function, which measures the angle formed by the three objects, is also a valid choice. 3.2 Pairwise inversion loss function Difference loss The most obvious way to keep the desired ranking of a pair of items is to penalize it as soon as the correct order is violated. For a pair of scalar values ( x, y) with ground truth ranking x > y, wrong order can be detected by analyzing the difference of the elements: if y − x is greater than zero, elements are misordered. Based on this observation we propose the difference loss as a simplest option of a pairwise inversion loss function: linv (ψiS , ψSj ) = max (ψSj − ψiS , 0). (4) In order to make the difference loss more flexible, we add non-linearity in the area of values where misranking happens (ψSj > ψiS ). This let us to change behaviour of the loss function, and choose where to put more attention — to small or big mistakes. One easy way to add non-linearity to some function is to exponentiate it. This idea results in power difference loss: linv (ψiS , ψSj ) = max (ψSj − ψiS , 0) p . (5) Setting p > 1 lowers penalty for marginal mistakes and increases penalty for large ones, while setting p < 1 results in opposite function behaviour (see Figure 1). Note that vanilla difference loss (4) is a special case of power difference loss (p = 1.0). Another option to make difference loss non-linear is to put it into the exponential function. We define exponential difference loss as: linv (ψiS , ψSj ) = max (exp[ β(ψSj − ψiS )] − 1, 0). (6) It is similar to power difference loss with p > 1, but its β parameter can be chosen so that the loss curve would be more flat (see Figure 2). 3 It could be explained by the fact that we use angular margin loss function as a base loss to train our face recognition models. However, other RKD methods we compared with don’t gain any advantage from cosine similarity relational function. 6 M. Nikitin et al. 4.0 p = 0.25 β = 0.5 6 β = 1.0 p = 0.50 3.5 p = 1.00 p = 1.50 5 3.0 p = 2.00 2.5 4 loss loss 2.0 3 1.5 2 1.0 0.5 1 0.0 0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 ψi − ψj ψi − ψj Fig. 1. Power difference loss. Fig. 2. Exponential difference loss. Margin The next modification to difference loss we propose is to use a margin term, which is quite common in metric learning tasks [23, 24]. Introducing positive margin not only makes student to learn the same ranking for pairs of objects as teacher has, but also forces the distance between objects to be no less than the margin value. Such modification can be applied to any of the discussed above losses, but for simplicity we consider only the case of vanilla difference loss (4). The most straightforward approach is to manually choose the margin value and use it throughout the whole training process: α = Const, linv (ψiS , ψSj ) = max (ψSj − ψiS + α, 0). (7) Figure 3 depicts how loss curve would look like for different values of margin α. However, in most cases it is difficult to set the margin so that it won’t over-regularize training. To cope with this problem we propose to choose margin dynamically, taking into consideration the scale of objects to be ranked. Specifically, for each batch of ob- jects X we estimate standard deviation of teacher’s relational function values and use it as a margin: α X = std(ΨiT ), linv (ψiS , ψSj ) = max (ψSj − ψiS + α X , 0). (8) One more option to choose margin we investigated is also adaptive, but now it’s selected individually for each pair of objects. It is also based on values of teacher’s relational function, and computed as their difference: αij = ψiT − ψjT , linv (ψiS , ψSj ) = max (ψSj − ψiS + αij , 0). (9) The idea behind this approach is the following: student learns to preserve order of ob- jects, while keeping the distance between them at least the same as teacher has. From some perspectives, it is similar to the RKD-D approach [14], but now we optimize lower bound of teacher and student difference, instead of forcing student to completely replicate teacher’s output. Pairwise Ranking Distillation for Deep Face Recognition 7 RankNet for knowledge distillation RankNet [25] is a classical learning-to-rank ap- proach. It formulates ranking as a pairwise classification problem, where each pair is considered independently, and the goal of the method is to miniminize the number of inversions. That perfectly fits our formulation of pairwise ranking distillation, so we adapt RankNet to solve it. For each pair of objects RankNet defines probability of cor- rect ranking and uses cross-entropy as a loss function: 1 P(ψiS > ψSj ) = , (10) 1 + exp(− β(ψiS − ψSj )) linv (ψiS , ψSj ) = −logP(ψiS > ψSj ) = log(1 + exp(− β(ψiS − ψSj ))). (11) As can be seen from Figure 4, RankNet loss function looks like a smooth version of difference loss with margin. Parameter β controls how sharp probability function is, and its increasing results in paying more attention to the area of values, which corresponds to ranking mistakes. 2.5 α = 0.0 10 β = 1.0 α = 0.2 β = 2.5 2.0 α = 0.4 β = 5.0 8 1.5 6 loss loss 1.0 4 0.5 2 0.0 0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 ψi − ψj ψi − ψj Fig. 3. Difference loss with margin. Fig. 4. RankNet loss. 4 Experiments We evaluate proposed PWR distillation approach on the face recognition task. Through- out this section we refer to PWR with vanilla difference loss (4) as PWR-Diff, PWR with exponential difference loss (6) as PWR-Exp, and PWR based on RankNet (11) as PWR-RankNet. If margin is used, information about it is specified in parentheses. For example, pairwise ranking distillation based on exponential difference loss with adap- tive margin computed for each pair of objects, would be named PWR-Exp (teacher-diff). To demonstrate robustness of the proposed approach, we compare it with other re- lational knowledge distillation methods. Namely, we consider DarkRank [15] and both RKD [14] approaches: distance-based (RKD-D) and angle-based (RKD-A). Note that knowledge distillation based on the equality of corresponding matching scores between teacher and student was investigated also in [16], but for the sake of simplicity we refer to this approach as RKD-D in this section. Regarding DarkRank method, it was no- ticed in [16] that soft version of DarkRank has numerical stability issues, which lead 8 M. Nikitin et al. to severe limitations of batch size that can be used during training. At the same time, authors report that DarkRank-hard demonstrates similar results on a range of metric learning problems, while can be easily computed for any size of batch. That is why in our experiments we use hard version of DarkRank method. 4.1 Datasets MS-Celeb-1M [30] is used to train all our models. Originally it contains 10 million face images of nearly 100,000 identities. However, due to the fact that the dataset was collected in a semi-automatic manner, significant portion of it includes noisy images or incorrect id labels. That is why we use cleaned version of MS-Celeb-1M provided by [31]. It consists of 5.8 million photos from 85, 000 subjects. We evaluate trained models on LFW [32], CPLFW [33], AgeDB [34], and MegaFace [35]. The first three datasets employ face verification scenario, while MegaFace pro- vides also evaluation protocol for face identificaion. Labeled Faces in the Wild (LFW) consists of 13, 233 in-the-wild face images of 5749 identities. Besides images, the list of 6000 matching pairs (3000 positive and 3000 negative) is provided, together with their 10-fold split for cross-validation. Cross-Pose LFW (CPLFW) uses similar to LFW evaluation protocol with the same to- tal number of comparisons. However, its matching pairs are much more difficult. Faces in positive pairs show substantial pose variations, while negative pairs are constructed using identities of the same race and gender. AgeDB dataset contains 16, 488 face images from 568 subjects, and also adopts 10-fold cross-validation protocol. This dataset is developed for age-invariant face verification, so all photos have not only identity, but also age labels. In our experiments we follow AgeDB-30 protocol, where faces in matching pairs have age difference of 30 years. Besides age factor, other facial variations (i.e. pose, illumination, expression) are also included. MegaFace is the most challenging benchmark in the area to date. It performs evaluation of face recognition algorithms at large-scale distractors. The gallery set of MegaFace includes 1 million images of 690, 000 identities, while the probe set consists of 100, 000 photos of 530 unique identities from FaceScrub [36] dataset. Results for both face iden- tification and face verification are reported. All faces are aligned by five facial landmarks detected using MTCNN [37] and then cropped to the size of 112 × 112. 4.2 Experimental setup In all experiments we use ResNet18 [26] as a student model, and ResNet50 [26] as a teacher model. To obtain face embeddings we append a fully-connected layer on the top of the last convolutional layer. Both teacher and student models have embedding size of 512. We conduct our experiments using MXNet [27] deep learning framework on a ma- chine with 6 NVIDIA GeForce GTX 1080 Ti GPUs. Batch size is fixed to 552(92 × 6) for both reference models and student models during knowledge distillation. Stochastic Pairwise Ranking Distillation for Deep Face Recognition 9 gradient descent (SGD) optimizer is used in all experiments. Learning rate is initially set to 0.1, and during training it is divided by 10 each 2 epochs. The total number of epochs is 13. Baseline models are trained from scratch, while student models in all distillation experiments are initialized with pretrained weights of the baseline model. Teacher model and baseline student model are trained using CosFace [28, 29] loss. CosFace is an angular margin classification loss, which is widely used in face recog- nition and other metric learning problems. We compare it with ArcFace [31], another popular angular loss function, and found that CosFace provides slightly better baseline results. Following [29], we set its parameters to be margin = 0.35 and scale = 64.0. We found that investigated distillation losses have different convergence abilities for the face recognition task. Specifically, some of them can be used alone to success- fully train student network, while others demonstrate sufficient performance only when combined with base classification loss. In addition, we examined whether student per- formance can be further boosted with the help of HKD (Hinton’s Knowledge Distilla- tion) [11] loss. As a result, overall objective function is defined as L = αLKD + βLCosFace + γL HKD , (12) where LKD stands for relational distillation knowledge loss (RKD-D, RKD-A, Dark- Rank, PWR), and α, β and γ are the coefficients of corresponding loss terms. When CosFace loss is used to stabilize training process of distillation, its weight β is always set to 1.0, and distillation weight α is chosen depending on the type of used LKD . In case if HKD is used, its softmax temperature is set to 4.0, and combination with CosFace is done with β = 0.7 and γ = 0.3. Weight of the relational knowledge distillation term α was chosen empirically, following recommendations of original pa- pers. Namely, we set α = 100 for RKD-D, α = 200 for RKD-A, and α = 1 for DarkRank. Concerning our PWR distillation losses, we found α = 100 to be a good option for PWR-Diff and PWR-Exp, while for PWR-RankNet it should be smaller (we use α = 15). 4.3 Evaluation results We follow standard protocols for all testing datasets. For LFW, CPLFW and AgeDB- 30 verification accuracy, estimated on 10-fold cross validation, is reported. Evaluation on MegaFace dataset includes two protocols, verification and identification. We report TPR@FPR = 1e-6 for verification, and accuracy at rank-1 and rank-10 for identifica- tion. Evaluation results are presented in Table 1. In our experiments we found that RKD-D, RKD-A and DarkRank methods fail to achieve even baseline quality when used alone. That is why for these methods we only report results for experiments when they trained together with base classification loss. On the contrary, proposed PWR approach demonstrates quality improvement when used alone, while adding CosFace and HKD loss terms slightly degrades recognition quality. Therefore, the effect of inversion loss function used in PWR distillation was explored only in such setting. As can be seen from Table 1, most of the methods demonstrate increase in student accuracy on LFW and AgeDB datasets, however its magnitude is different, especially 10 M. Nikitin et al. on AgeDB, where proposed PWR approach beats all other distillation methods by a large margin. At the same time, only one distillation method, PWR-Exp (teacher-diff), managed to boost accuracy on CPFLW dataset. It can be possibly explained by the fact that CPLFW contains images with large pose variations, while faces in the training dataset are mostly frontal, and even teacher’s accuracy is relatively low on CPLFW. Table 1. Experimental results. MegaFace Model LFW CPLFW AgeDB-30 Ver Id(1) Id(10) Teacher, CosFace 0.9962 0.9205 0.9775 0.9745 0.9698 0.9821 Student, CosFace 0.9943 0.8928 0.9665 0.9497 0.9380 0.9663 CosFace + RKD-D 0.9947 0.8895 0.9633 0.9430 0.9281 0.9632 CosFace + RKD-A 0.9945 0.8867 0.9652 0.9417 0.9287 0.9619 CosFace + RKD-DA 0.9940 0.8870 0.9670 0.9421 0.9280 0.9629 CosFace + DarkRank 0.9947 0.8738 0.9672 0.9421 0.9321 0.9631 CosFace + HKD 0.9953 0.8810 0.9635 0.9366 0.9239 0.9604 CosFace + HKD + RKD-D 0.9953 0.8868 0.9675 0.9486 0.9352 0.9648 CosFace + HKD + RKD-A 0.9955 0.8920 0.9715 0.9526 0.9415 0.9676 CosFace + HKD + RKD-DA 0.9958 0.8928 0.9688 0.9528 0.9404 0.9680 CosFace + HKD + DarkRank 0.9953 0.8737 0.9690 0.9400 0.9294 0.9618 PWR-Diff (0.1) 0.9948 0.8865 0.9715 0.9502 0.9393 0.9675 PWR-Diff (teacher-std) 0.9945 0.8885 0.9727 0.9540 0.9433 0.9695 PWR-Diff (teacher-diff) 0.9958 0.8898 0.9742 0.9533 0.9433 0.9690 PWR-Exp (0.1) 0.9955 0.8907 0.9710 0.9497 0.9400 0.9681 PWR-Exp (teacher-std) 0.9958 0.8807 0.9710 0.9554 0.9439 0.9693 PWR-Exp (teacher-diff) 0.9963 0.8942 0.9728 0.9538 0.9433 0.9695 PWR-RankNet 0.9953 0.8870 0.9715 0.9508 0.9402 0.9674 Considering MegaFace benchmark results, it’s clear that RKD-D and DarkRank methods can not provide any recognition quality improvement, even when used to- gether with auxiliary losses. Among methods under comparison besides PWR, only combination of RKD-A with HKD provides better results than the baseline model has. At the same time, all investigated pairwise ranking distillation approaches substantially improve MegaFace recognition quality. Evaluation results demonstrate that the proposed family of PWR distillation tech- niques provides methods which outperform other relational knowledge distillation ap- proaches on the face recognition task. However, some modifications of PWR are better than others. For example, considering loss function non-linearity, one can see that in most cases PWR-Exp shows slightly better results than PWR-Diff. As for the margin value, experiments where it was fixed (PWR-Diff (0.1), PWR-Exp(0.1), PWR-RankNet) perform worse than those where adaptive margin was used. As a result, we can con- clude that the best relational distillation method for face recognition at the moment is PWR-Exp with adaptively chosen margin (teacher-std or teacher-diff ). Pairwise Ranking Distillation for Deep Face Recognition 11 5 Conclusion We propose new relational knowledge distillation technique for deep face recognition, which is based on pairwise ranking of matching scores. During training of a student network our PWR approach considers pairs of relational function values and fixes those of them where values are ordered incorrectly, compared to the teacher’s ranking. Exper- iments have proven, that the proposed method significantly outperforms other relational distillation approaches on a range of facial recognition benchmarks. References 1. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016) 2. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applica- tions. arXiv preprint arXiv:1704.04861 (2017) 3. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolutional neu- ral network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018) 4. Han, S., Mao, H., Dally, W. J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015) 5. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural net- works: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18(1), 6869–6898 (2017) 6. Han, S., Pool, J., Tran, J., and Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015) 7. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural net- works for resource efficient inference. arXiv preprint arXiv:1611.06440 (2016) 8. Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems, pp. 1269–1277 (2014) 9. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014) 10. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014) 11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 12. Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014) 13. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016) 14. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019) 15. Chen, Y., Wang, N., Zhang, Z.: DarkRank: Accelerating deep metric learning via cross sample similarities transfer. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 12 M. Nikitin et al. 16. Yu, L., Yazici, V. O., Liu, X., Weijer, J. V. D., Cheng, Y., Ramisa, A.: Learning Metrics from Teachers: Compact Networks for Image Embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2907–2916 (2019) 17. Karlekar, J., Feng, J., Wong, Z. S., Pranata, S.: Deep Face Recognition Model Compression via Knowledge Transfer and Distillation. arXiv preprint arXiv:1906.00619 (2019) 18. Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., Zhang, Z.: Correlation congru- ence for knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5007–5016 (2019) 19. Feng, Y., Wang, H., Yi, D. T., Hu, R.: Triplet distillation for deep face recognition. arXiv preprint arXiv:1905.04457 (2019) 20. Wang, M., Liu, R., Hajime, N., Narishige, A., Uchida, H., Matsunami, T.: Improved Knowl- edge Distillation for Training Fast Low Resolution Face Recognition Model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019) 21. Yan, M., Zhao, M., Xu, Z., Zhang, Q., Wang, G., Su, Z.: VarGFaceNet: An efficient variable group convolutional neural network for lightweight face recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019) 22. Duong, C. N., Luu, K., Quach, K. G., Le, N.: ShrinkTeaNet: Million-scale lightweight face recognition via shrinking teacher-student networks. arXiv preprint arXiv:1905.10620 (2019) 23. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with appli- cation to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 539–546 (2005) 24. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) 25. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Confer- ence on Machine learning, pp. 89–96 (2005) 26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 27. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015) 28. Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Processing Letters 25(7), 926–930 (2018) 29. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: CosFace: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018) 30. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: A dataset and benchmark for large- scale face recognition. In: European Conference on Computer Vision, pp. 87–102. Springer, Cham (2016) 31. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019) 32. Huang, G. B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In: Workshop on faces in ’Real- Life’ Images: detection, alignment, and recognition (2008) 33. Zheng, T., Deng, W.: Cross-pose LFW: A database for studying cross-pose face recognition in unconstrained environments. In: Beijing University of Posts and Telecommunications, vol. 5, pp. 4873–4882 (2018) Pairwise Ranking Distillation for Deep Face Recognition 13 34. Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., Zafeiriou, S.: AgeDB: the first manually collected, in-the-wild age database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 51–59 (2017) 35. Kemelmacher-Shlizerman, I., Seitz, S. M., Miller, D., Brossard, E.: The megaface bench- mark: 1 million faces for recognition at scale. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016) 36. Ng, H. W., Winkler, S.: A data-driven approach to cleaning large face datasets. In: IEEE International Conference on Image Processing, pp. 343–347 (2014) 37. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)