Interpretable and Robust Face Verification
Preetam Prabhu Srikar Dammu*1 , Srinivasa Rao Chalamala*2 , Ajeet Kumar Singh1 and
Yegnanarayana Bayya2
1
    TCS Research, Tata Consultancy Services Ltd., India
2
    International Institute of Information Technology, Hyderabad, India


                                             Abstract
                                             Advances in deep learning have been instrumental in enhancing the performance of face verification systems. Despite their
                                             ability to attain high accuracy, most of these systems fail to provide interpretations of their decisions. With the increased
                                             demands in making deep learning models more interpretable, numerous post-hoc methods have been proposed to probe the
                                             workings of these systems. Yet, the quest for face verification systems that inherently provide interpretations still remains
                                             largely unexplored. Additionally, most of the existing face recognition models are highly susceptible to adversarial attacks.
                                             In this work, we propose a face verification system which addresses the issue of interpretability by employing modular neural
                                             networks. In this, representations for each individual facial parts such as nose, mouth, eyes etc. are learned separately. We
                                             also show that our method is significantly more resistant to adversarial attacks, thereby addressing another crucial weakness
                                             concerning deep learning models.

                                             Keywords
                                             Face Verification, Interpretability, Adversarial Robustness


1. Introduction                                                                                                       is still difficult to understand these heatmaps as they are
                                                                                                                      generated at a pixel-level. If these heatmaps can highlight
Over the last decade, many deep learning methods for                                                                  logical visual concepts in the images then it would be
face verification have been proposed, a few of them have                                                              more convenient to interpret. (Please refer Figure. 7 and
even surpassed human performance [1, 2, 3, 4]. These                                                                  Section. 5.2).
deep learning methods, while enabling exceptional per-                                                                   Another significant drawback of deep learning models
formance does not provide reasoning for their predictions.                                                            is their susceptibility to adversarial attacks. Seemingly
Blindly relying on the results of these black boxes with-                                                             insignificant noise which is imperceptible to the human
out interpreting the reasons for their decisions could be                                                             eye can fool deep learning models. Numerous black box
detrimental especially in critical applications related to                                                            and white box adversarial attack methods have been pro-
medical, financial, and security domains.                                                                             posed in the literature [8, 9, 10].
   In the context of image recognition, various methods                                                                  The problem of detecting and defending adversarial at-
have been proposed to tackle interpretability by attempt-                                                             tacks on deep learning models is still largely unsolved. As
ing to reason why an object has been recognized in a par-                                                             these attacks on face verification systems pose a serious
ticular way. LRP[5], Grad-CAM[6], LIME[7] have been                                                                   security threat, it is imperative to develop trustworthy
used widely to highlight regions of the image that the                                                                systems. Our motivation behind this work is to integrate
models look at for arriving at the final prediction. Despite                                                          both robustness to attacks as well as interpretability into
the existence of several ways post hoc interpretability                                                               face verification systems.
methods, it is desirable to have a system that is inher-                                                                 Hence, in this work, we propose a face verification sys-
ently capable of producing interpretations of its decisions.                                                          tem that addresses the aforementioned issues by learning
When the latent features generated by the system repre-                                                               independent latent representations of high-level facial
sent a logical part of an object, it is convenient to infer                                                           features. The proposed method generates intuitive and
the contributions of these features to the final prediction.                                                          easily understood heatmaps on the fly, and is also shown
   Though most of the interpretability method procure                                                                 to be much more robust against adversarial examples.
heatmaps highlighting the regions that contribute to the
decision process of the models, in some applications it
                                                                                                                      2. Related Work
*Equal Contribution
3rd International Workshop on Privacy, Security, and Trust in                                                         Face recognition is a non-invasive biometric authenti-
Computational Intelligence (PSTCI2021)                                                                                cation mechanism and has been in commercial use for
" d.preetam@tcs.com (P. P. S. Dammu*);                                                                                several years. It has become one of the preferred choice
srinivas.chalamala@research.iiit.ac.in (S. R. Chalamala*);                                                            of authentication for mobile device users as it easy to use
ajeetk.singh1@tcs.com (A. K. Singh); yegna@iiit.ac.in (Y. Bayya)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   and avoids the need of remembering passwords. Though
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        people have some reservations against using face recog-
nition on large scale systems due to privacy issues, it       controlled degradations using inpainting to generate ex-
continues to be one of the widely used technologies for       planations. In [29], visual psychophysics was used to
identification.                                               probe and study the behavior of face recognition sys-
   Deep learning based face recognition has surpassed         tems. In [30], the authors propose a loss function that
hand crafted feature-based systems and shallow learning       introduces interpretability to the face verification model
systems in performance. In [2], the authors proposed          through training. In [31], the authors use 3D modeling
a deep learning architecture called VGGFace for gener-        to visualize and understand how the model represents
ating facial feature representations or face embeddings.      the information of face images. Fooling techniques [32]
These face embeddings can be further used for identify-       have also been used for gaining insights on facial regions
ing the person using a similarity measure or a classifier.    that contribute more to the decision.
DeepID2[11] uses a Bayesian learning framework for               The recently developed explainability methods for
learning metrics for face recognition. In FaceNet[12] au-     face recognition are considerably different from one an-
thors proposed a compact embedding learned directly           other in their approach and form of explanations, unlike
from images using triplet-loss for face verification. Dif-    saliency methods for object recognition which generate
ferent loss functions that maximizes intra-class similarity   similar form of explanations. Each of these methods have
and improves discriminability for faces have been pro-        their own pros and cons and are suitable for different pur-
posed ArcFace[13], CosFace[14], SphereFace[15], CoCo          poses. We believe our method has certain characteristics
Loss[16].                                                     that are well-suited for real world applications: easily
   Existing face recognition models are extremely vul-        interpretable feature level explanations, on-the-fly expla-
nerable to adversarial attacks even in black-box setting,     nations for every prediction, structurally interpretable
which raises security concerns and the requisite for devel-   model architecture, provides feedback in real time and
oping more robust face recognition models. Adversarial        more importantly robust towards adversarial attacks.
attacks[17, 18, 19] involve additive small, imperceptible
and carefully crafted perturbations to the input with the
aim of fooling machine learning models. Adversarial           3. Interpretable and Robust Face
attacks allow an attacker to evade detection or recogni-         Verification System
tion or to impersonate another person.[20] described a
method to realize adversarial attacks by introducing a        Modular neural networks (MNN) [33] are a class of com-
pair of eye glasses. These glasses could be used to evade     posite neural networks that were inspired by the biologi-
detection or to impersonate others. Another approach          cal modularity of the human brain. MNNs are composed
for fooling ArcFace using adversarial patches has been        of independent neural networks that serve as modules,
proposed in [21]. In [22], the authors have proposed an       each of them specializing in a specific task. MNNs are
approach for detecting adversarial attacks on faces.          inherently more interpretable than monolithic neural net-
   Understanding and interpreting the decisions of ma-        works due to their architecture and divide-and-conquer
chine learning systems is of high importance in many          methodology. MNNs also intrinsically introduce struc-
applications, as it allows verifying the reasoning of the     tural interpretability due to their modular structure. Stud-
system and provides information to the human expert           ies have shown that MNNs are better at handling noise
or end-user. Early works include direct visualization of      than monolithic networks [33]. Several defense mecha-
the filters [23], deconvolutional networks to reconstruct     nisms against adversarial attacks have been proposed in
inputs from different layers [24].                            the literature, some of which have employed deep gen-
   Numerous interpretability methods have been pro-           erative models [34, 35]. One of the main motivations
posed in the literature, some of the widely known             for using generative models is their capability of repre-
ones are Layer-wise Relevance Propagation (LRP) [5],          senting information in a lower-dimensional latent space
Gradient-weighted Class Activation Mapping (Grad-             retaining only the most salient features [36].
CAM) [25], Grad-CAM++ [26], SHapley Additive ex-
Planations (SHAP) values [27] and Local Interpretable         3.1. Model
Model-Agnostic Explanations (LIME) [7]. Most of these
techniques attempt to provide pixel-level explanations        3.1.1. Model Composition Overview
to indicate the contribution of each pixel to the classi-     In the proposed MNN architecture, we allocate dedicated
fication decision. However, these methods are mostly          modules for eyes, nose, mouth and one for the rest of the
suitable for tasks such as object recognition where the       features. We employ autoencoders to learn separate and
deep learning models only take a single input image.          distinct latent representations for different facial features.
   Recently, a few methods that attempt to explain the        To achieve this, we mask the input image to retain only
behavior and decisions of face recognition systems have       the region of interest of that specific module and present
emerged [28, 29, 30, 31, 32]. In [28], the authors rely on
Figure 1: Proposed feature specific latent representations encoding. Images are encoded to feature specific latent repre-
sentations using feature extracting autoencoders. Reconstructions and corresponding target images are displayed on the
right.


it as the target image (See Fig. 1). After the autoencoders       latent representation containing important information
have been trained, we retain the encoder and substitute           about the feature and restores only the required part of
the decoder with Siamese networks in all of the modules,          the image (See Fig. 1, examples in 3.2).
resulting in Modular Siamese Networks (MSN) (See Fig.
2).                                                               3.1.3. Siamese Networks
    In the task of face verification, a pair of images is given
as input, which could be either a valid pair or an impostor Siamese networks have achieved great results in image
pair. In the proposed MSN architecture, disentangled em-    verification [37, 38]. The two Siamese twin networks
beddings of facial features are generated for both of the   share the same weights and parameters. The hypothesis
input images by the feature extracting encoders present     behind this architecture is that if the inputs 𝑥1 and 𝑥2 are
in each feature specific module. These feature embed-       similar, then the distance between the output vectors ℎ1
ding pairs are then fed to the Siamese networks present     and ℎ2 will be less. The network is trained in such a way
in each module which compute the 𝐿1 distance vectors        that it maximizes the distance between mismatched pairs
for each of the twin feature latent embeddings pairs, sim-  and minimizes the distance between matched pairs. Loss
ilar to the method followed in [37]. The distance vectors   functions like contrastive loss [39] and triplet loss [40]
from all of the modules are then concatenated and fed       can be used to achieve this task, few improvised versions
to a common decision network which makes the final          of these loss functions have also been proposed in the
prediction.                                                 literature [41, 42].
                                                               In our model, we employ Siamese networks for dis-
                                                            criminating between feature specific latent vectors of
3.1.2. Feature-extracting Autoencoders
                                                            impostors and valid pairs. The latent vectors 𝑥1 and 𝑥2
In this work, we employ undercomplete autoencoders are obtained from the feature-extracting autoencoders
[36], a type of autoencoder which has a latent dimension described in 3.1.2. L1 distance vectors are computed from
lower than the input dimension. Undercomplete autoen- the output vectors ℎ1 and ℎ2 obtained from the Siamese
coders are trained to reconstruct the original image as twins for each module. The distance vectors of all of the
accurately as possible while constricting the latent space modules are then concatenated and given as input to the
to a sufficiently small dimension to ensure that only the decision network (See Fig. 2).
most salient features are retained in the encoded latent
vectors. To achieve our task of extracting feature specific 3.1.4. Decision Network
latent vectors, we use a novel technique. In this tech-
nique, instead of giving a full image as the target, we The decision network is a feed-forward fully connected
mask the input image and retain only a part of the image network that takes the concatenated input from all of
containing the feature of interest and produce it as the the modules. This network enables us to incorporate
target image. Consequently, the autoencoder learns a
Figure 2: Proposed Modular Siamese Network. Image is initially disentangled by feature-specific encoders to obtain feature-
wise embedding pairs, then these embedding pairs are fed to Siamese networks which will compute the distance vectors. All
of the distance vectors are then concatenated and fed to the decision network for final verification decision.


information from all of the modules to predict the final
decision.

3.1.5. Model Architectural Details
The model architecture and training setting described
in [43] were used for training the feature extracting au-
toencoders. The Siamese networks consist of four fully
connected layers with ELU activation functions. The final
decision network that takes the concatenated distance
vectors from the modules has two fully connected layers Figure 3: Reconstruction of eyes. (a) input image, (b) masked
with ReLU activation functions.                           target image, (c) reconstructed image

3.2. Training details
The training of the proposed MSN is carried out in 3
training phases. In the first phase, the feature extracting
autoencoders are trained with perceptual loss [43]. In
the next phase, the decoder parts in each of the modules
are replaced with the Siamese network and trained using
the triplet loss, freezing the layers trained in the previous
phase. Finally, the decision network is trained using
Binary Cross-Entropy (BCE). The Adam optimization
technique [44] was used for training the network in all
of the three training phases.                                   Figure 4: Reconstruction of nose. (a) input image, (b) masked
   From Fig. 3, 4, 5 and 6, we observe that the feature         target image, (c) reconstructed image
extracting autoencoders are able to generate high qual-
ity reconstructions of the intended facial feature. Once
training is complete, the autoencoders take unmasked
                                                                Facial landmarks used for masking were generated by
full images as input and reconstruct only the required
                                                                using MTCNN [45].
facial region by incorporating relevant information of
that facial feature into the latent feature vector.
   The subnetworks can be trained in parallel as they are
independent of each other. Once the training is complete,
we obtain a complete end-to-end face verification system.
                                                                 the Wild (LFW) dataset [47]. For reporting performance,
                                                                 we use 10-fold cross validation using the splits defined
                                                                 by LFW protocol which serves as a benchmark for com-
                                                                 parison [47].

                                                                 5.1. Verification
                                                                 The accuracies of the individual modules and the pro-
                                                                 posed MSN model have been presented in Table 1. The
                                                                 accuracies for individual modules have been calculated by
Figure 5: Reconstruction of mouth. (a) input image, (b)          finding the optimum distance threshold that maximizes
masked target image, (c) reconstructed image                     accuracy.

                                                                       No.              Model               Accuracy
                                                                        1.    Module 1 - Eyes                 80.8%
                                                                        2.    Module 2 - Nose                 73.2%
                                                                        3.    Module 3 - Mouth                74.5%
                                                                        4.    Module 4 - Rest                 78.3%
                                                                        5.    Modular Siamese Network         98.5%

                                                                 Table 1
                                                                 Accuracies of modular siamese network and sub-modules.

                                                                    We observe that the eyes module outperforms other
Figure 6: Reconstruction of remaining facial region. (a) is      modules, indicating that it could be the most discrim-
the input image, (b) is the masked target image, (c) is the      inating feature. The accuracy of MSN is 98.5% which
reconstructed image                                              is comparable to the SOTA accuracies that have been
                                                                 reported in the literature which are greater than 99%.

4. Interpretability in Modular                                   5.2. Feature-level Heatmaps
   Siamese Networks                                              Feature-level heatmaps are intuitive and easily inter-
                                                                 pretable as humans, unlike computers, look at features
The proposed system generates inherently feature-level           as whole and not at pixels individually. The pairwise
heatmaps that are intuitive and easily interpreted, as           heatmaps that are inherently generated by the proposed
humans naturally observe the similarity of high-level            method incorporate relative information taking both of
visual concepts instead of pixels. Each subnetwork of            the input images into consideration. The feature-wise
the MSN generates a distance measure that reflects the           euclidean distances computed by individual modules in
visual similarity of the features. This is achieved by com-      MSN are used to generate the heatmaps. As can be seen
puting the euclidean distance between the twin output            in Figure. 7, features that look visually similar are col-
vectors produced by the Siamese networks for each mod-           ored blue and colored red when dissimilar in all of the
ule representing a certain feature. Using these distance         images. For true positives, the heatmaps are indicating
measures, a pairwise heatmap incorporating the simi-             high similarity for features that are visually close, as ex-
larity or dissimilarity of the features is generated and         pected. The system shows high dissimilarity between the
overlayed on both of the images. As can be seen in Fig. 7,       nose regions of the first impostor pair in 5.b, which is in
the proposed system is able to effectively localize the sim-     line with human perception as their shapes are signifi-
ilarities and dissimilarities of features in a pair of images.   cantly different. Studying when the system fails could
These heatmaps could be used as a tool for understand-           be helpful, since these visual cues may help rectify the
ing the decisions taken by the verification system (Refer        workings of the system. In the first pair of 5.c, we ob-
section 5.2).                                                    serve that both of the persons wearing eye glasses caused
                                                                 the eyes module to assign low distance score and when
5. Experimental Results                                          accompanied another similar looking feature resulted in
                                                                 misclassification. The heatmap of the second pair of 5.c
The face verification system was trained on the VG-              demonstrates how spectacles and similar looking facial
GFace2 dataset [46] and evaluated on Labeled Faces in            hair fooled the system. The heatmaps in 5.d illustrate
                                                                  Figure 8: Robustness of proposed approach against FGSM
                                                                  Attack. (IFV: Interpretable and Robust Face Verification sys-
                                                                  tem (proposed method))


Figure 7: Demonstration of facial feature explanations: Each
facial factor and its relevance to face verification. Green in-
dicates similarity while red indicates dissimilarity. (a) True
Positives (b) True Negatives (c) False Positives (d) False Neg-
atives (e) Color map indicating dissimilarity. Best viewed in
color. (Refer Section. 5.2)


how closing eyes and significant difference in pose can
affect the verification. In the first pair, the same person
closing eyes in one of the images made the eyes module Figure 9: Robustness of proposed approach against Deep-
to compute a high distance score. In the second, signifi- Fool Attack. (IFV: Interpretable and Robust Face Verification
cantly different pose which resulted in partial visibility system (proposed method))
of facial features in one of the images led the system to
predict high dissimilarity score.
   Since these computations at feature level are carried
out in live, the system could instantly generate meaning-
ful messages that can help the user to correct any issues
in case of a failure, like removing eye glasses or changing
pose for better lighting.

5.3. Performance under adversarial
     attacks
We tested the robustness and resistance of the proposed
method against the widely known adversarial attacks
such as the Fast Gradient Sign Method (FGSM) [8],                 Figure 10: Robustness of proposed approach against FFGSM
                                                                  Attack. (IFV: Interpretable and Robust Face Verification sys-
DeepFool [48] and FGSM in fast adversarial training
                                                                  tem (proposed method))
(FFGSM)[49].
   Assuming the first image in the two image pairs to be
the test image, and the other one to be the anchor image,
we attack only test image similar to the experiments              tacks.
conducted in the studies [50, 51]. For comparison, we                For FGSM, the accuracy of FaceNet falls below 20%
have considered the well-known FaceNet model which                when 𝜖 is 0.05 while MSN is still close to 60% accurate
has report SOTA performance earlier. The results have             (See Figure. 8). In the case of DeepFool attack, we no-
been plotted in Figures. 8, 9 and 10.                             tice a sharp drop of accuracy to below 10% on step 2 in
   The proposed method has shown significantly higher             facenet, while MSN shows a lot more resilience by being
robustness than FaceNet against all three adversarial at-         more than 70% accurate. Similarly for FFGSM, accuracy
of FaceNet drops to just above 30% while MSN has an            References
accuracy still above 60% at 𝜖 equals 0.03. In all of these
attacks, we notice that individual modules are noticably        [1] N. Kumar, A. C. Berg, P. N. Belhumeur, S. K. Nayar,
more resistant. Since MSN makes the final prediction                Attribute and simile classifiers for face verification,
based on these functionally independent modules, it con-            in: 2009 IEEE 12th international conference on com-
sequently inherits its robustness from them.                        puter vision, IEEE, 2009, pp. 365–372.
   The enhanced robustness could be attributed to the           [2] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep Face
fault tolerant nature of MNN [52, 33]. Additionally, the            Recognition, in: Procedings of the British Machine
encoders used for extracting feature specific latent rep-           Vision Conference 2015, British Machine Vision
resentations are trained to retain only the most salient            Association, 2015, pp. 41.1–41.12. URL: http://www.
features because of the bottleneck latent layer and as a re-        bmva.org/bmvc/2015/papers/paper041/index.html.
sult, they may be able to provide some immunity against             doi:10.5244/C.29.41.
noise or perturbations.                                         [3] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A
                                                                    unified embedding for face recognition and clus-
                                                                    tering, in: Proceedings of the IEEE conference on
6. Conclusion and Future Work                                       computer vision and pattern recognition, 2015, pp.
                                                                    815–823.
Numerous face verification methods have been proposed           [4] F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin
in the literature, most of which focus solely on improving          softmax for face verification, IEEE Signal Process-
the performance. Consequently, super-human accuracy                 ing Letters 25 (2018) 926–930.
has already been achieved in face verification. The real        [5] S. Bach, A. Binder, G. Montavon, F. Klauschen,
need for improvement in this domain is in the areas of              K. R. Müller, W. Samek, On pixel-wise explana-
robustness, explainability and fairness. The most impor-            tions for non-linear classifier decisions by layer-
tant attribute of the proposed method is that it is both            wise relevance propagation, in: PLoS ONE, vol-
robust to adversarial attacks and inherently interpretable.         ume 10, Public Library of Science, 2015, p. e0130140.
To the best of our knowledge, there is no other published           doi:10.1371/journal.pone.0130140.
method for face verification that provides both of these        [6] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedan-
qualities at the same time. We believe that pursuing this           tam, D. Parikh, D. Batra, Grad-CAM: Visual Expla-
direction is essential for developing more trustworthy              nations from Deep Networks via Gradient-based
systems.                                                            Localization, in: ICCV, 2017, pp. 618–626. URL:
   Having the interpretations of predictions or decisions           http://gradcam.cloudcv.org.
while they are being taken by deep learning models could        [7] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i
prove to be paramount in many applications. While post-             trust you?" explaining the predictions of any clas-
hoc interpretations might help in understanding the be-             sifier, in: Proceedings of the 22nd ACM SIGKDD
havior of the model, they may not be of much help in                international conference on knowledge discovery
generating real-time explanations. Incorporating inter-             and data mining, 2016, pp. 1135–1144.
pretability to the system itself could allow us to handle       [8] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain-
human errors by enabling communication with the user,               ing and harnessing adversarial examples, arXiv
informing them of what went wrong and suggesting rec-               preprint arXiv:1412.6572 (2014).
tifications.                                                    [9] J. Su, D. V. Vargas, K. Sakurai, One pixel attack for
   In this paper, we have presented a new technique to              fooling deep neural networks, IEEE Transactions
learn latent representations of high-level facial features.         on Evolutionary Computation 23 (2019) 828–841.
We proposed a modular face verification system that in-        [10] N. Carlini, D. Wagner, Towards evaluating the ro-
herently generates interpretations of its decisions with            bustness of neural networks, in: 2017 ieee sympo-
the help of the learned feature-specific latent representa-         sium on security and privacy (sp), IEEE, 2017, pp.
tions. The need and importance of having such a readily             39–57.
interpretable systems were discussed. Further, we have         [11] Y. Sun, X. Wang, X. Tang, Deep Learning Face
demonstrated that the proposed system a has higher re-              Representation by Joint Identification-Verification,
sistance to adversarial examples.                                   undefined (2014). arXiv:1406.4773v1.
   In summary, we have introduced and validated a face         [12] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A
verification system that: provides on-the-fly and easily            unified embedding for face recognition and cluster-
interpretable feature level explanations, has structurally          ing, in: Proceedings of the IEEE Computer Society
interpretable model architecture, is able to provide feed-          Conference on Computer Vision and Pattern Recog-
back in real time, and has increased robustness towards             nition, volume 07-12-June, IEEE Computer Soci-
adversarial attacks.                                                ety, 2015, pp. 815–823. doi:10.1109/CVPR.2015.
     7298682. arXiv:1503.03832.                                    Multimedia Signal Processing (MMSP), IEEE, 2018,
[13] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Addi-         pp. 1–6.
     tive angular margin loss for deep face recognition,      [23] M. D. Zeiler, R. Fergus, Visualizing and under-
     in: Proceedings of the IEEE/CVF Conference on                 standing convolutional networks, in: Lecture
     Computer Vision and Pattern Recognition, 2019,                Notes in Computer Science (including subseries
     pp. 4690–4699.                                                Lecture Notes in Artificial Intelligence and Lec-
[14] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou,           ture Notes in Bioinformatics), volume 8689 LNCS,
     Z. Li, W. Liu, Cosface: Large margin cosine loss              Springer Verlag, 2014, pp. 818–833. doi:10.1007/
     for deep face recognition, in: Proceedings of the             978-3-319-10590-1_53. arXiv:1311.2901.
     IEEE Conference on Computer Vision and Pattern           [24] M. D. Zeiler, G. W. Taylor, R. Fergus, Adap-
     Recognition, 2018, pp. 5265–5274.                             tive deconvolutional networks for mid and high
[15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song,                level feature learning, in: Proceedings of the
     Sphereface: Deep hypersphere embedding for face               IEEE International Conference on Computer Vision,
     recognition, in: Proceedings of the IEEE conference           IEEE, 2011, pp. 2018–2025. URL: http://ieeexplore.
     on computer vision and pattern recognition, 2017,             ieee.org/document/6126474/. doi:10.1109/ICCV.
     pp. 212–220.                                                  2011.6126474.
[16] Y. Liu, H. Li, X. Wang, Rethinking Feature Discrim-      [25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
     ination and Polymerization for Large-scale Recog-             D. Parikh, D. Batra, Grad-cam: Visual explanations
     nition, in: undefined, 2017. URL: http://arxiv.org/           from deep networks via gradient-based localization,
     abs/1710.00870. arXiv:1710.00870.                             in: Proceedings of the IEEE International Confer-
[17] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi,                   ence on Computer Vision, 2017, pp. 618–626.
     P. Frossard,       Universal adversarial perturba-       [26] A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Bal-
     tions (2016). URL: http://arxiv.org/abs/1610.08401.           asubramanian, Grad-cam++: Generalized gradient-
     arXiv:1610.08401.                                             based visual explanations for deep convolutional
[18] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining           networks, in: 2018 IEEE Winter Conference on Ap-
     and harnessing adversarial examples, in: 3rd Inter-           plications of Computer Vision (WACV), IEEE, 2018,
     national Conference on Learning Representations,              pp. 839–847.
     ICLR 2015 - Conference Track Proceedings, Inter-         [27] S. M. Lundberg, S.-I. Lee, A unified approach to
     national Conference on Learning Representations,              interpreting model predictions, in: Advances in
     ICLR, 2015. arXiv:1412.6572.                                  neural information processing systems, 2017, pp.
[19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-        4765–4774.
     han, I. Goodfellow, R. Fergus, Intriguing properties     [28] J. R. Williford, B. B. May, J. Byrne, Explainable face
     of neural networks, in: 2nd International Con-                recognition, in: A. Vedaldi, H. Bischof, T. Brox,
     ference on Learning Representations, ICLR 2014 -              J.-M. Frahm (Eds.), Computer Vision – ECCV 2020,
     Conference Track Proceedings, International Con-              Springer International Publishing, Cham, 2020, pp.
     ference on Learning Representations, ICLR, 2014.              248–263.
     arXiv:1312.6199.                                         [29] B. RichardWebster, S. Y. Kwon, C. Clarizio, S. E. An-
[20] M. Sharif, S. Bhagavatula, L. Bauer, M. K. Re-                thony, W. J. Scheirer, Visual psychophysics for mak-
     iter, Accessorize to a crime: Real and stealthy               ing face recognition algorithms more explainable,
     attacks on state-of-the-art face recognition, in:             in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss
     Proceedings of the ACM Conference on Com-                     (Eds.), Computer Vision – ECCV 2018, Springer
     puter and Communications Security, volume 24-28-              International Publishing, Cham, 2018, pp. 263–281.
     October-2016, Association for Computing Machin-          [30] B. Yin, L. Tran, H. Li, X. Shen, X. Liu, Towards
     ery, New York, New York, USA, 2016, pp. 1528–1540.            interpretable face recognition, in: Proceedings of
     URL: http://dl.acm.org/citation.cfm?doid=2976749.             the IEEE International Conference on Computer
     2978392. doi:10.1145/2976749.2978392.                         Vision, 2019, pp. 9348–9357.
[21] M. Pautov, G. Melnikov, E. Kaziakhmedov, K. Kireev,      [31] T. Xu, J. Zhan, O. G. Garrod, P. H. Torr, S.-C. Zhu,
     A. Petiushko, On adversarial patches: real-world              R. A. Ince, P. G. Schyns, Deeper interpretability
     attack on arcface-100 face recognition system, in:            of deep networks, arXiv preprint arXiv:1811.07807
     2019 International Multi-Conference on Engineer-              (2018).
     ing, Computer and Information Sciences (SIBIR-           [32] T. Zee, G. Gali, I. Nwogu, Enhancing human face
     CON), IEEE, 2019, pp. 0391–0396.                              recognition with an interpretable neural network,
[22] A. J. Bose, P. Aarabi, Adversarial attacks on face de-        in: Proceedings of the IEEE/CVF International Con-
     tectors using neural net based constrained optimiza-          ference on Computer Vision (ICCV) Workshops,
     tion, in: 2018 IEEE 20th International Workshop on            2019.
[33] A. Schmidt, Z. Bandar, Modularity - a concept for       [47] G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller,
     new neural network architectures, Proc. IASTED In-           Labeled Faces in the Wild: A Database for Study-
     ternational Conference on Computer Systems and               ing Face Recognition in Unconstrained Environ-
     Applications, Irbid, Jordan, 1998 (1998).                    ments, Technical Report 07-49, University of Mas-
[34] U. Hwang, J. Park, H. Jang, S. Yoon, N. I. Cho, Pu-          sachusetts, Amherst, 2007.
     vae: A variational autoencoder to purify adversarial    [48] S. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deep-
     examples, IEEE Access 7 (2019) 126582–126593.                fool: A simple and accurate method to fool deep
[35] P. Samangouei, M. Kabkab, R. Chellappa, Defense-             neural networks, in: 2016 IEEE Conference on Com-
     gan: Protecting classifiers against adversarial at-          puter Vision and Pattern Recognition (CVPR), 2016,
     tacks using generative models, arXiv preprint                pp. 2574–2582. doi:10.1109/CVPR.2016.282.
     arXiv:1805.06605 (2018).                                [49] E. Wong, L. Rice, J. Z. Kolter, Fast is better than
[36] I. Goodfellow, Y. Bengio, A. Courville, Deep learn-          free: Revisiting adversarial training, arXiv preprint
     ing, MIT press, 2016.                                        arXiv:2001.03994 (2020).
[37] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural     [50] F. Zuo, B. Yang, X. Li, Q. Zeng, Exploiting the inher-
     networks for one-shot image recognition, in: ICML            ent limitation of l0 adversarial examples, in: 22nd
     deep learning workshop, volume 2, Lille, 2015.               International Symposium on Research in Attacks,
[38] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face            Intrusions and Defenses ({RAID} 2019), 2019, pp.
     recognition (2015).                                          293–307.
[39] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality         [51] M. Kulkarni, A. Abubakar, Siamese networks for
     reduction by learning an invariant mapping, in:              generating adversarial examples, arXiv preprint
     2006 IEEE Computer Society Conference on Com-                arXiv:1805.01431 (2018).
     puter Vision and Pattern Recognition (CVPR’06),         [52] G. Auda, M. Kamel, Modular neural networks: a
     volume 2, IEEE, 2006, pp. 1735–1742.                         survey, International Journal of Neural Systems 9
[40] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large           (1999) 129–151.
     scale online learning of image similarity through
     ranking, Journal of Machine Learning Research 11
     (2010) 1109–1135.
[41] W. Chen, X. Chen, J. Zhang, K. Huang, Beyond
     triplet loss: a deep quadruplet network for person
     re-identification, in: Proceedings of the IEEE Con-
     ference on Computer Vision and Pattern Recogni-
     tion, 2017, pp. 403–412.
[42] D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Per-
     son re-identification by multi-channel parts-based
     cnn with improved triplet loss function, in: Pro-
     ceedings of the iEEE conference on computer vision
     and pattern recognition, 2016, pp. 1335–1344.
[43] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature consis-
     tent variational autoencoder, in: 2017 IEEE Winter
     Conference on Applications of Computer Vision
     (WACV), 2017, pp. 1133–1141. doi:10.1109/WACV.
     2017.131.
[44] D. P. Kingma, J. Ba, Adam: A method for stochas-
     tic optimization, arXiv preprint arXiv:1412.6980
     (2014).
[45] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detec-
     tion and alignment using multitask cascaded con-
     volutional networks, IEEE Signal Processing Let-
     ters 23 (2016) 1499–1503. doi:10.1109/LSP.2016.
     2603342.
[46] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman,
     Vggface2: A dataset for recognising faces across
     pose and age, in: 2018 13th IEEE International Con-
     ference on Automatic Face & Gesture Recognition
     (FG 2018), IEEE, 2018, pp. 67–74.