<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pairwise Ranking Distillation for Deep Face Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikhail Nikitin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vadim Konushin</string-name>
          <email>vadimg@tevian.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton Konushin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>M.V. Lomonosov Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Video Analysis Techonologies LLC</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work addresses the problem of knowledge distillation for deep face recognition task. Knowledge distillation technique is known to be an effective way of model compression, which implies transferring of the knowledge from high-capacity teacher to a lightweight student. The knowledge and the way how it is distilled can be defined in different ways depending on the problem where the technique is applied. Considering the fact that face recognition is a typical metric learning task, we propose to perform knowledge distillation on a score-level. Specifically, for any pair of matching scores computed by teacher, our method forces student to have the same order for the corresponding matching scores. We evaluate proposed pairwise ranking distillation (PWR) approach using several face recognition benchmarks for both face verification and face identification scenarios. Experimental results show that PWR not only can improve over the baseline method by a large margin, but also outperforms other score-level distillation approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Distillation</kwd>
        <kwd>Model Compression</kwd>
        <kwd>Face Recognition</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Metric Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Face recognition systems are widely used today, and their quality keeps improving in
order to better fit increasing security requirements. Nowadays majority of the computer
vision tasks, including facial recognition, are solved with the help of deep neural
networks, and there exists a clear dependency that in case of a fixed training dataset, a
network with a lot of layers and parameters outperforms its lightweight version. As a
result, the most powerful models use a large amount of memory and computational
resources, and therefore their deployment is quite challenging. Indeed, switching to the
model of higher capacity usually results in reducing of the inference speed, which is
very important in some real-life scenarios. For example, if the model is supposed to run
on a resource-limited embedded device or to be used in video surveillance system with
thousands of queries per second, it is often necessary to replace a large network with
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
smaller one for the purpose of satisfying the limitations of available computational
resources. This creates a strong demand for methods that reduce model complexity while
trying to preserve its performance as much a possible.</p>
      <p>
        In general, there are two main strategies to reduce deep neural network complexity:
one is to develop a new lightweight architecture [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
        ], and another one is to
compress already trained model. Network compression can be done in many different ways,
including parameter quantization [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], weights prunning [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], low-rank
factorization [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], and knowledge distillation. All these compression methods, except for the
knowledge distillation, focus on reducing model size in terms of parameters while
keeping network architecture roughly the same. On the contrary, knowledge distillation, the
main idea of which is to transfer knowledge encoded in one network to another, is
considered to be a more general approach, since it doesn’t impose any restrictions on the
architecture of the output network.
      </p>
      <p>
        Therefore, in this paper we propose a new knowledge distillation technique for
efficient computation of face recognition embeddings. Our method utilizes the idea of
pairwise learning-to-rank approach and applies it on top of the matching scores between
face embeddings. Specifically, we consider scores’ ranking produced by a teacher
network as a ground truth label, and use it to detect and penalize mistakes in pairwise
ranking of student’s matching scores. Using LFW [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], CPLFW [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], AgeDB [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],
and MegaFace [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] datasets, we show that the proposed distillation method can
significantly improve face recognition quality compared to the conventional way of training
the student network. Moreover, we found that our pairwise ranking distillation
technique outperforms other scores-based distillation approaches by a large margin.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] a dichotomy of distillation approaches was proposed. It is based on the way
how the knowledge is determined, and the authors distinguish individual and relational
knowledge distillation methods.
      </p>
      <sec id="sec-2-1">
        <title>2.1 Individual knowledge distillation</title>
        <p>Individual knowledge distillation (IKD) methods consider each input object
independently and force student network to mimic teacher’s representation of that object. Let
FT (x) and FS(x) represent the feature representations of teacher and student for input
M
x respectively. Then, for training dataset c = fxigi=1 the IKD objective function can
be formulated as follow:</p>
        <p>LIKD =
å l(FT (xi), FS(xi)),
xi2c
(1)
where l is some loss function that penalizes the difference between the teacher and the
student. The knowledge in IKD methods is determined by the function F(x), which can
be defined in different ways. Some examples are presented below.</p>
        <p>
          Authors of [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] describe the knowledge in terms of labels distribution, so
that student uses output of teacher’s classifier as a ground truth soft label vector. The
        </p>
        <p>
          Pairwise Ranking Distillation for Deep Face Recognition 3
motivation of such approach lies in observation that input image sometimes contains
several objects in it and can be better described using a mixture of labels. Another
approach was presented in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], where authors propose to use hint connections, which go
from teacher to student and transfer hidden layer activations. Depending on depth of
network and spatial resolution of features where such distillation is applied, it makes
student to mimic teacher at different levels of abstraction. However, over-reguralization
of hidden layers can lead to poor quality, so usually hints are only used for embedding
(pre-classification) layer [
          <xref ref-type="bibr" rid="ref16 ref21">16, 21</xref>
          ]. In order to successfully guide student even at initial
layers, modification of hints idea was proposed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Transferring of activation was
replaced there with transferring of spatial attention maps, i.e. instead of trying to
reproduce teacher’s feature representation as is, student only learns to analyze the same areas
of input image.
        </p>
        <p>Individual knowledge distillation methods utilize clear idea of imitating the teacher’s
output. However, due to the gap in model capacity between teacher and student, it may
be difficult for the student to learn mapping function, which is similar or even identical
to the teacher’s one. Relational knowledge distillation approach refers to that problem
and considers knowledge from another point of view.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Relational knowledge distillation</title>
        <p>Relational knowledge distillation (RKD) methods define the knowledge using a group
of objects rather than a single object. Each group of objects forms a structure in
representational space, which can be used as a unit of knowledge. In other words, student in
RKD methods learns to reproduce structure of teacher’s latent space, instead of precise
feature representations of objects. To describe relative structure of n input examples
relational function y, which maps n-tuple of embeddings to a scalar value, is used.
Putting ti = FT (xi) and si = FS(xi), the objective function for RKD is defined as
LRKD =</p>
        <p>å
(x1,x2,...,xn)2cn
l(y(t1, t2, ..., tn), y(s1, s2, ..., sn)).
(2)</p>
        <p>
          Accordingly to the above equation, the choice of relational function y defines
certain RKD method. Easiest and the most obvious approach considers pairs of objects
and encodes space structure in terms of Euclidean distance between two feature
embeddings. Such approach with minor modifications is used in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Similar
idea was recently adapted in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], where authors use correlation between teacher’s and
student’s outputs as the pairwise relational function. Triplets-based RKD approach was
proposed in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Three points in representational space form an angle, and its value
can be used to describe structure of the triplet. Another approach, which can also be
considered as relational knowledge distillation, although it doesn’t precisely follow the
equation of RKD loss (2), was presented in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Its main idea is to reformulate
knowledge distillation problem as a list-wise learning-to-rank problem, where teacher’s list
of matching scores is used as ranking to be learned by student.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Knowledge distillation for Face Recognition</title>
        <p>During the first several years of the development of knowledge distillation methods,
experiments were carried out mostly on small classification problems. That is why the
application of such techniques for face recognition problem hasn’t been fully
investigated yet, and only few studies have been published in this area.</p>
        <p>
          Some recent works [
          <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
          ] follow the idea of hint connections and impose
constraints on the discrepancy between teacher’s and student’s embeddings. But in order
to better fit angular nature of conventional losses used to train face recognition
networks [
          <xref ref-type="bibr" rid="ref28 ref29 ref31">28, 29, 31</xref>
          ], authors put penalty on cosine similarity, instead of Euclidean
distance. More specific approach, which is oriented to be used in metric learning tasks,
was proposed in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. This approach utilizes the idea that high-capacity teacher network
can better understand subtle differences between images, and uses this observation to
adaptively choose margin value in triplet loss function. In [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] authors study
knowledge distillation techniques in the context of fully convolutional networks (FCN). They
notice that network inference effectiveness can be boosted not only by lowering model
complexity, but also by decreasing the size of the input image. Following this idea,
authors propose to keep the same FCN architecture and train student on a
downsampled version of the original dataset with the help of distillation guidance from teacher’s
embeddings, computed on high-resolution input.
        </p>
        <p>
          As can be seen, majority of existing distillation methods for face recognition
problem utilize IKD approach, while the effect of RKD hasn’t been yet investigated. In this
paper, we propose a new relational knowledge distillation technique for face
recognition. Our method is inspired by works [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and its main idea is to relax
objective function (2) in a way that the loss is computed only for those pairs of relational
function values, which violate teacher’s ranking.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Pairwise ranking distillation</title>
      <p>Facial recognition systems usually have a gallery of target face images as its component,
and each incoming image is compared to it. Gallery image with maximum matching
score is further considered to be a candidate for correct match. This leads to the idea
that only relative positioning of matching scores is important, rather than their absolute
values. In this paper we propose an approach that adapts pairwise ranking techniques
for knowledge distillation problem. More specifically, our method considers pairs of
relational function values and its goal is to minimize the number of their inversions.</p>
      <p>Let XT = ftigi=1 and XS = fsigi=1 be the feature representations computed
N N</p>
      <p>N
by teacher and student networks for input batch X = fxigi=1 respectively. For both
teacher and student we compute values YT = fyiT giM=1 and YS = fyiSgiM=1 of
relational function y for all possible input n-tuples of feature embeddings. Then the
pairwise ranking (PWR) distillation loss is given by:</p>
      <p>LPWR(XS, XT ) = å 1[yiT &gt; yjT ]linv(yiS, ySj),
i,j
(3)
where linv is the function that penalizes pairwise ranking inversions.</p>
      <p>As can be seen from the above equation, pairwise ranking knowledge distillation is
fully defined by the relational function y and the inversion loss function linv.</p>
      <p>Pairwise Ranking Distillation for Deep Face Recognition 5
3.1</p>
      <sec id="sec-3-1">
        <title>Relational function</title>
        <p>
          In this work, we fix relational function y to be the function with two inputs and choose
it in a way that value y(x, y) characterizes similarity between objects x and y. To be
precise, we examined Euclidean distance and cosine similarity as a relational function,
and found that cosine similarity performs slightly better3. It is worth noting that one
can choose any function which describes relationship of set of points in embedding
space. For example, RKD-A [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] function, which measures the angle formed by the
three objects, is also a valid choice.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Pairwise inversion loss function</title>
        <p>Difference loss The most obvious way to keep the desired ranking of a pair of items
is to penalize it as soon as the correct order is violated. For a pair of scalar values
(x, y) with ground truth ranking x &gt; y, wrong order can be detected by analyzing the
difference of the elements: if y x is greater than zero, elements are misordered. Based
on this observation we propose the difference loss as a simplest option of a pairwise
inversion loss function:</p>
        <p>In order to make the difference loss more flexible, we add non-linearity in the area
of values where misranking happens (ySj &gt; yiS). This let us to change behaviour of the
loss function, and choose where to put more attention — to small or big mistakes. One
easy way to add non-linearity to some function is to exponentiate it. This idea results in
power difference loss:</p>
        <p>Setting p &gt; 1 lowers penalty for marginal mistakes and increases penalty for large
ones, while setting p &lt; 1 results in opposite function behaviour (see Figure 1). Note
that vanilla difference loss (4) is a special case of power difference loss (p = 1.0).</p>
        <p>Another option to make difference loss non-linear is to put it into the exponential
function. We define exponential difference loss as:
(4)
(5)
linv(yiS, ySj) = max(ex p[b(ySj
yiS)]
1, 0).</p>
        <p>(6)
It is similar to power difference loss with p &gt; 1, but its b parameter can be chosen so
that the loss curve would be more flat (see Figure 2).
3 It could be explained by the fact that we use angular margin loss function as a base loss to train
our face recognition models. However, other RKD methods we compared with don’t gain any
advantage from cosine similarity relational function.</p>
        <p>
          Margin The next modification to difference loss we propose is to use a margin term,
which is quite common in metric learning tasks [
          <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
          ]. Introducing positive margin
not only makes student to learn the same ranking for pairs of objects as teacher has,
but also forces the distance between objects to be no less than the margin value. Such
modification can be applied to any of the discussed above losses, but for simplicity we
consider only the case of vanilla difference loss (4).
        </p>
        <p>The most straightforward approach is to manually choose the margin value and use
it throughout the whole training process:</p>
        <p>a = Const,</p>
        <p>aX = std(YiT ),</p>
        <p>One more option to choose margin we investigated is also adaptive, but now it’s
selected individually for each pair of objects. It is also based on values of teacher’s
relational function, and computed as their difference:
aij = yiT</p>
        <p>yjT,</p>
        <p>
          The idea behind this approach is the following: student learns to preserve order of
objects, while keeping the distance between them at least the same as teacher has. From
some perspectives, it is similar to the RKD-D approach [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], but now we optimize
lower bound of teacher and student difference, instead of forcing student to completely
replicate teacher’s output.
        </p>
        <p>
          Pairwise Ranking Distillation for Deep Face Recognition 7
RankNet for knowledge distillation RankNet [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] is a classical learning-to-rank
approach. It formulates ranking as a pairwise classification problem, where each pair is
considered independently, and the goal of the method is to miniminize the number of
inversions. That perfectly fits our formulation of pairwise ranking distillation, so we
adapt RankNet to solve it. For each pair of objects RankNet defines probability of
correct ranking and uses cross-entropy as a loss function:
        </p>
        <p>P(yiS &gt; ySj) =</p>
        <p>1
1 + ex p( b(yiS
ySj))
,
linv(yiS, ySj) =
logP(yiS &gt; ySj) = log(1 + ex p( b(yiS
ySj))).</p>
        <p>As can be seen from Figure 4, RankNet loss function looks like a smooth version of
difference loss with margin. Parameter b controls how sharp probability function is, and
its increasing results in paying more attention to the area of values, which corresponds
to ranking mistakes.</p>
        <p>2.5
2.0
We evaluate proposed PWR distillation approach on the face recognition task.
Throughout this section we refer to PWR with vanilla difference loss (4) as PWR-Diff, PWR
with exponential difference loss (6) as PWR-Exp, and PWR based on RankNet (11) as
PWR-RankNet. If margin is used, information about it is specified in parentheses. For
example, pairwise ranking distillation based on exponential difference loss with
adaptive margin computed for each pair of objects, would be named PWR-Exp (teacher-diff).</p>
        <p>
          To demonstrate robustness of the proposed approach, we compare it with other
relational knowledge distillation methods. Namely, we consider DarkRank [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and both
RKD [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] approaches: distance-based (RKD-D) and angle-based (RKD-A). Note that
knowledge distillation based on the equality of corresponding matching scores between
teacher and student was investigated also in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], but for the sake of simplicity we refer
to this approach as RKD-D in this section. Regarding DarkRank method, it was
noticed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] that soft version of DarkRank has numerical stability issues, which lead
(10)
(11)
4.1
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Datasets</title>
        <p>to severe limitations of batch size that can be used during training. At the same time,
authors report that DarkRank-hard demonstrates similar results on a range of metric
learning problems, while can be easily computed for any size of batch. That is why in
our experiments we use hard version of DarkRank method.</p>
        <p>
          MS-Celeb-1M [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] is used to train all our models. Originally it contains 10 million
face images of nearly 100,000 identities. However, due to the fact that the dataset was
collected in a semi-automatic manner, significant portion of it includes noisy images or
incorrect id labels. That is why we use cleaned version of MS-Celeb-1M provided by
[
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. It consists of 5.8 million photos from 85, 000 subjects.
        </p>
        <p>
          We evaluate trained models on LFW [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], CPLFW [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], AgeDB [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], and MegaFace
[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. The first three datasets employ face verification scenario, while MegaFace
provides also evaluation protocol for face identificaion.
        </p>
        <p>Labeled Faces in the Wild (LFW) consists of 13, 233 in-the-wild face images of 5749
identities. Besides images, the list of 6000 matching pairs (3000 positive and 3000
negative) is provided, together with their 10-fold split for cross-validation.
Cross-Pose LFW (CPLFW) uses similar to LFW evaluation protocol with the same
total number of comparisons. However, its matching pairs are much more difficult. Faces
in positive pairs show substantial pose variations, while negative pairs are constructed
using identities of the same race and gender.</p>
        <p>AgeDB dataset contains 16, 488 face images from 568 subjects, and also adopts 10-fold
cross-validation protocol. This dataset is developed for age-invariant face verification,
so all photos have not only identity, but also age labels. In our experiments we follow
AgeDB-30 protocol, where faces in matching pairs have age difference of 30 years.
Besides age factor, other facial variations (i.e. pose, illumination, expression) are also
included.</p>
        <p>
          MegaFace is the most challenging benchmark in the area to date. It performs evaluation
of face recognition algorithms at large-scale distractors. The gallery set of MegaFace
includes 1 million images of 690, 000 identities, while the probe set consists of 100, 000
photos of 530 unique identities from FaceScrub [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] dataset. Results for both face
identification and face verification are reported.
        </p>
        <p>
          All faces are aligned by five facial landmarks detected using MTCNN [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] and then
cropped to the size of 112 112.
4.2
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Experimental setup</title>
        <p>
          In all experiments we use ResNet18 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] as a student model, and ResNet50 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] as a
teacher model. To obtain face embeddings we append a fully-connected layer on the top
of the last convolutional layer. Both teacher and student models have embedding size
of 512.
        </p>
        <p>
          We conduct our experiments using MXNet [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] deep learning framework on a
machine with 6 NVIDIA GeForce GTX 1080 Ti GPUs. Batch size is fixed to 552(92 6)
for both reference models and student models during knowledge distillation. Stochastic
        </p>
        <p>Pairwise Ranking Distillation for Deep Face Recognition 9
gradient descent (SGD) optimizer is used in all experiments. Learning rate is initially
set to 0.1, and during training it is divided by 10 each 2 epochs. The total number of
epochs is 13. Baseline models are trained from scratch, while student models in all
distillation experiments are initialized with pretrained weights of the baseline model.</p>
        <p>
          Teacher model and baseline student model are trained using CosFace [
          <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
          ] loss.
CosFace is an angular margin classification loss, which is widely used in face
recognition and other metric learning problems. We compare it with ArcFace [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], another
popular angular loss function, and found that CosFace provides slightly better baseline
results. Following [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], we set its parameters to be margin = 0.35 and scale = 64.0.
        </p>
        <p>
          We found that investigated distillation losses have different convergence abilities
for the face recognition task. Specifically, some of them can be used alone to
successfully train student network, while others demonstrate sufficient performance only when
combined with base classification loss. In addition, we examined whether student
performance can be further boosted with the help of HKD (Hinton’s Knowledge
Distillation) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] loss. As a result, overall objective function is defined as
        </p>
        <p>L = aLKD + bLCosFace + gLHKD,
(12)
where LKD stands for relational distillation knowledge loss (RKD-D, RKD-A,
DarkRank, PWR), and a, b and g are the coefficients of corresponding loss terms.</p>
        <p>When CosFace loss is used to stabilize training process of distillation, its weight b
is always set to 1.0, and distillation weight a is chosen depending on the type of used
LKD. In case if HKD is used, its softmax temperature is set to 4.0, and combination
with CosFace is done with b = 0.7 and g = 0.3. Weight of the relational knowledge
distillation term a was chosen empirically, following recommendations of original
papers. Namely, we set a = 100 for RKD-D, a = 200 for RKD-A, and a = 1 for
DarkRank. Concerning our PWR distillation losses, we found a = 100 to be a good
option for PWR-Diff and PWR-Exp, while for PWR-RankNet it should be smaller (we
use a = 15).
4.3</p>
      </sec>
      <sec id="sec-3-5">
        <title>Evaluation results</title>
        <p>We follow standard protocols for all testing datasets. For LFW, CPLFW and
AgeDB30 verification accuracy, estimated on 10-fold cross validation, is reported. Evaluation
on MegaFace dataset includes two protocols, verification and identification. We report
TPR@FPR = 1e-6 for verification, and accuracy at rank-1 and rank-10 for
identification. Evaluation results are presented in Table 1.</p>
        <p>In our experiments we found that RKD-D, RKD-A and DarkRank methods fail
to achieve even baseline quality when used alone. That is why for these methods we
only report results for experiments when they trained together with base classification
loss. On the contrary, proposed PWR approach demonstrates quality improvement when
used alone, while adding CosFace and HKD loss terms slightly degrades recognition
quality. Therefore, the effect of inversion loss function used in PWR distillation was
explored only in such setting.</p>
        <p>As can be seen from Table 1, most of the methods demonstrate increase in student
accuracy on LFW and AgeDB datasets, however its magnitude is different, especially
on AgeDB, where proposed PWR approach beats all other distillation methods by a
large margin. At the same time, only one distillation method, PWR-Exp (teacher-diff),
managed to boost accuracy on CPFLW dataset. It can be possibly explained by the fact
that CPLFW contains images with large pose variations, while faces in the training
dataset are mostly frontal, and even teacher’s accuracy is relatively low on CPLFW.</p>
        <p>Considering MegaFace benchmark results, it’s clear that RKD-D and DarkRank
methods can not provide any recognition quality improvement, even when used
together with auxiliary losses. Among methods under comparison besides PWR, only
combination of RKD-A with HKD provides better results than the baseline model has.
At the same time, all investigated pairwise ranking distillation approaches substantially
improve MegaFace recognition quality.</p>
        <p>Evaluation results demonstrate that the proposed family of PWR distillation
techniques provides methods which outperform other relational knowledge distillation
approaches on the face recognition task. However, some modifications of PWR are better
than others. For example, considering loss function non-linearity, one can see that in
most cases PWR-Exp shows slightly better results than PWR-Diff. As for the margin
value, experiments where it was fixed (PWR-Diff (0.1), PWR-Exp(0.1), PWR-RankNet)
perform worse than those where adaptive margin was used. As a result, we can
conclude that the best relational distillation method for face recognition at the moment is
PWR-Exp with adaptively chosen margin (teacher-std or teacher-diff ).</p>
        <p>Pairwise Ranking Distillation for Deep Face Recognition 11
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We propose new relational knowledge distillation technique for deep face recognition,
which is based on pairwise ranking of matching scores. During training of a student
network our PWR approach considers pairs of relational function values and fixes those
of them where values are ordered incorrectly, compared to the teacher’s ranking.
Experiments have proven, that the proposed method significantly outperforms other relational
distillation approaches on a range of facial recognition benchmarks.</p>
      <p>Pairwise Ranking Distillation for Deep Face Recognition 13</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Iandola</surname>
            ,
            <given-names>F. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moskewicz</surname>
            ,
            <given-names>M. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashraf</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dally</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keutzer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and &lt; 0.5 MB model size</article-title>
          .
          <source>arXiv preprint arXiv:1602.07360</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalenichenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weyand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andreetto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adam</surname>
          </string-name>
          , H.:
          <article-title>MobileNets: Efficient convolutional neural networks for mobile vision applications</article-title>
          .
          <source>arXiv preprint arXiv:1704.04861</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>ShuffleNet: An extremely efficient convolutional neural network for mobile devices</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>6848</fpage>
          -
          <lpage>6856</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Han,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Dally</surname>
          </string-name>
          , W. J.:
          <article-title>Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding</article-title>
          .
          <source>arXiv preprint arXiv:1510.00149</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hubara</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courbariaux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soudry</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Yaniv</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Quantized neural networks: Training neural networks with low precision weights and activations</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <fpage>6869</fpage>
          -
          <lpage>6898</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Han,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Pool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            , and
            <surname>Dally</surname>
          </string-name>
          , W.:
          <article-title>Learning both weights and connections for efficient neural network</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1143</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Molchanov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tyree</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karras</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aila</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kautz</surname>
          </string-name>
          , J.:
          <article-title>Pruning convolutional neural networks for resource efficient inference</article-title>
          .
          <source>arXiv preprint arXiv:1611.06440</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Denton</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruna</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , LeCun,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Fergus</surname>
          </string-name>
          , R.:
          <article-title>Exploiting linear structure within convolutional networks for efficient evaluation</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          , pp.
          <fpage>1269</fpage>
          -
          <lpage>1277</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jaderberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Speeding up convolutional neural networks with low rank expansions</article-title>
          .
          <source>arXiv preprint arXiv:1405.3866</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          , R.:
          <article-title>Do deep nets really need to be deep?</article-title>
          <source>In: Advances in Neural Information Processing Systems</source>
          , pp.
          <fpage>2654</fpage>
          -
          <lpage>2662</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
          <source>arXiv preprint arXiv:1503.02531</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kahou</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chassang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Fitnets: Hints for thin deep nets</article-title>
          .
          <source>arXiv preprint arXiv:1412.6550</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zagoruyko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Komodakis</surname>
          </string-name>
          , N.:
          <article-title>Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer</article-title>
          .
          <source>arXiv preprint arXiv:1612.03928</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Relational knowledge distillation</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>3967</fpage>
          -
          <lpage>3976</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>DarkRank: Accelerating deep metric learning via cross sample similarities transfer</article-title>
          .
          <source>In: Thirty-Second AAAI Conference on Artificial Intelligence</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yazici</surname>
            ,
            <given-names>V. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weijer</surname>
          </string-name>
          , J. V. D., Cheng, Y.,
          <string-name>
            <surname>Ramisa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning Metrics from Teachers: Compact Networks for Image Embedding</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2907</fpage>
          -
          <lpage>2916</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Karlekar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>Z. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pranata</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep Face Recognition Model Compression via Knowledge Transfer and Distillation</article-title>
          . arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>00619</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Correlation congruence for knowledge distillation</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>5007</fpage>
          -
          <lpage>5016</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>D. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Triplet distillation for deep face recognition</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>04457</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajime</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narishige</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uchida</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsunami</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improved Knowledge Distillation for Training Fast Low Resolution Face Recognition Model</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          Workshops (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>VarGFaceNet: An efficient variable group convolutional neural network for lightweight face recognition</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          Workshops (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>C. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quach</surname>
            ,
            <given-names>K. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
          </string-name>
          , N.:
          <article-title>ShrinkTeaNet: Million-scale lightweight face recognition via shrinking teacher-student networks</article-title>
          .
          <source>arXiv preprint arXiv:1905</source>
          .
          <volume>10620</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , LeCun, Y.:
          <article-title>Learning a similarity metric discriminatively, with application to face verification</article-title>
          .
          <source>In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>546</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Schroff</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalenichenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philbin</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>FaceNet: A unified embedding for face recognition and clustering</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaked</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renshaw</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deeds</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamilton</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hullender</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Learning to rank using gradient descent</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Machine learning</source>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1512.01274</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , F., Cheng, J.,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Liu, H.:
          <article-title>Additive margin softmax for face verification</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          <volume>25</volume>
          (
          <issue>7</issue>
          ),
          <fpage>926</fpage>
          -
          <lpage>930</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>CosFace: Large margin cosine loss for deep face recognition</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>5265</fpage>
          -
          <lpage>5274</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , J.:
          <article-title>Ms-celeb-1m: A dataset and benchmark for largescale face recognition</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>102</lpage>
          . Springer, Cham (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zafeiriou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>ArcFace: Additive angular margin loss for deep face recognition</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>4690</fpage>
          -
          <lpage>4699</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Learned-Miller</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Labeled faces in the wild: A database for studying face recognition in unconstrained environments</article-title>
          . In: Workshop on faces in 'RealLife'
          <article-title>Images: detection, alignment, and recognition (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Cross-pose LFW: A database for studying cross-pose face recognition in unconstrained environments</article-title>
          . In: Beijing University of Posts and Telecommunications, vol.
          <volume>5</volume>
          , pp.
          <fpage>4873</fpage>
          -
          <lpage>4882</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Moschoglou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papaioannou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagonas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotsia</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zafeiriou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>AgeDB: the first manually collected, in-the-wild age database</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>59</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Kemelmacher-Shlizerman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seitz</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brossard</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The megaface benchmark: 1 million faces for recognition at scale</article-title>
          .
          <source>In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>4873</fpage>
          -
          <lpage>4882</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>H. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A data-driven approach to cleaning large face datasets</article-title>
          .
          <source>In: IEEE International Conference on Image Processing</source>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>347</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Joint face detection and alignment using multitask cascaded convolutional networks</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          <volume>23</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1499</fpage>
          -
          <lpage>1503</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>