<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interpretable and Robust Face Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Preetam Prabhu Srikar Dammu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srinivasa Rao Chalamala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ajeet Kumar Singh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yegnanarayana Bayya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TCS Research, Tata Consultancy Services Ltd.</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Advances in deep learning have been instrumental in enhancing the performance of face verification systems. Despite their ability to attain high accuracy, most of these systems fail to provide interpretations of their decisions. With the increased demands in making deep learning models more interpretable, numerous post-hoc methods have been proposed to probe the workings of these systems. Yet, the quest for face verification systems that inherently provide interpretations still remains largely unexplored. Additionally, most of the existing face recognition models are highly susceptible to adversarial attacks. In this work, we propose a face verification system which addresses the issue of interpretability by employing modular neural networks. In this, representations for each individual facial parts such as nose, mouth, eyes etc. are learned separately. We also show that our method is significantly more resistant to adversarial attacks, thereby addressing another crucial weakness concerning deep learning models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Face Verification</kwd>
        <kwd>Interpretability</kwd>
        <kwd>Adversarial Robustness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        is still dificult to understand these heatmaps as they are
generated at a pixel-level. If these heatmaps can highlight
Over the last decade, many deep learning methods for logical visual concepts in the images then it would be
face verification have been proposed, a few of them have more convenient to interpret. (Please refer Figure. 7 and
even surpassed human performance [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. These Section. 5.2).
deep learning methods, while enabling exceptional per- Another significant drawback of deep learning models
formance does not provide reasoning for their predictions. is their susceptibility to adversarial attacks. Seemingly
Blindly relying on the results of these black boxes with- insignificant noise which is imperceptible to the human
out interpreting the reasons for their decisions could be eye can fool deep learning models. Numerous black box
detrimental especially in critical applications related to and white box adversarial attack methods have been
promedical, financial, and security domains. posed in the literature [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ].
      </p>
      <p>
        In the context of image recognition, various methods The problem of detecting and defending adversarial
athave been proposed to tackle interpretability by attempt- tacks on deep learning models is still largely unsolved. As
ing to reason why an object has been recognized in a par- these attacks on face verification systems pose a serious
ticular way. LRP[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Grad-CAM[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], LIME[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been security threat, it is imperative to develop trustworthy
used widely to highlight regions of the image that the systems. Our motivation behind this work is to integrate
models look at for arriving at the final prediction. Despite both robustness to attacks as well as interpretability into
the existence of several ways post hoc interpretability face verification systems.
methods, it is desirable to have a system that is inher- Hence, in this work, we propose a face verification
sysently capable of producing interpretations of its decisions. tem that addresses the aforementioned issues by learning
When the latent features generated by the system repre- independent latent representations of high-level facial
sent a logical part of an object, it is convenient to infer features. The proposed method generates intuitive and
the contributions of these features to the final prediction. easily understood heatmaps on the fly, and is also shown
      </p>
      <p>Though most of the interpretability method procure to be much more robust against adversarial examples.
heatmaps highlighting the regions that contribute to the
decision process of the models, in some applications it</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Face recognition is a non-invasive biometric
authentication mechanism and has been in commercial use for
several years. It has become one of the preferred choice
of authentication for mobile device users as it easy to use
and avoids the need of remembering passwords. Though
people have some reservations against using face
recognition on large scale systems due to privacy issues, it controlled degradations using inpainting to generate
excontinues to be one of the widely used technologies for planations. In [29], visual psychophysics was used to
identification. probe and study the behavior of face recognition
sys</p>
      <p>
        Deep learning based face recognition has surpassed tems. In [30], the authors propose a loss function that
hand crafted feature-based systems and shallow learning introduces interpretability to the face verification model
systems in performance. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors proposed through training. In [31], the authors use 3D modeling
a deep learning architecture called VGGFace for gener- to visualize and understand how the model represents
ating facial feature representations or face embeddings. the information of face images. Fooling techniques [32]
These face embeddings can be further used for identify- have also been used for gaining insights on facial regions
ing the person using a similarity measure or a classifier. that contribute more to the decision.
DeepID2[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] uses a Bayesian learning framework for The recently developed explainability methods for
learning metrics for face recognition. In FaceNet[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] au- face recognition are considerably diferent from one
anthors proposed a compact embedding learned directly other in their approach and form of explanations, unlike
from images using triplet-loss for face verification. Dif- saliency methods for object recognition which generate
ferent loss functions that maximizes intra-class similarity similar form of explanations. Each of these methods have
and improves discriminability for faces have been pro- their own pros and cons and are suitable for diferent
purposed ArcFace[13], CosFace[14], SphereFace[15], CoCo poses. We believe our method has certain characteristics
Loss[16]. that are well-suited for real world applications: easily
      </p>
      <p>Existing face recognition models are extremely vul- interpretable feature level explanations, on-the-fly
explanerable to adversarial attacks even in black-box setting, nations for every prediction, structurally interpretable
which raises security concerns and the requisite for devel- model architecture, provides feedback in real time and
oping more robust face recognition models. Adversarial more importantly robust towards adversarial attacks.
attacks[17, 18, 19] involve additive small, imperceptible
and carefully crafted perturbations to the input with the
aim of fooling machine learning models. Adversarial 3. Interpretable and Robust Face
attacks allow an attacker to evade detection or recogni- Verification System
tion or to impersonate another person.[20] described a
method to realize adversarial attacks by introducing a Modular neural networks (MNN) [33] are a class of
compair of eye glasses. These glasses could be used to evade posite neural networks that were inspired by the
biologidetection or to impersonate others. Another approach cal modularity of the human brain. MNNs are composed
for fooling ArcFace using adversarial patches has been of independent neural networks that serve as modules,
proposed in [21]. In [22], the authors have proposed an each of them specializing in a specific task. MNNs are
approach for detecting adversarial attacks on faces. inherently more interpretable than monolithic neural
net</p>
      <p>Understanding and interpreting the decisions of ma- works due to their architecture and divide-and-conquer
chine learning systems is of high importance in many methodology. MNNs also intrinsically introduce
strucapplications, as it allows verifying the reasoning of the tural interpretability due to their modular structure.
Studsystem and provides information to the human expert ies have shown that MNNs are better at handling noise
or end-user. Early works include direct visualization of than monolithic networks [33]. Several defense
mechathe filters [ 23], deconvolutional networks to reconstruct nisms against adversarial attacks have been proposed in
inputs from diferent layers [24]. the literature, some of which have employed deep
gen</p>
      <p>
        Numerous interpretability methods have been pro- erative models [34, 35]. One of the main motivations
posed in the literature, some of the widely known for using generative models is their capability of
repreones are Layer-wise Relevance Propagation (LRP) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], senting information in a lower-dimensional latent space
Gradient-weighted Class Activation Mapping (Grad- retaining only the most salient features [36].
CAM) [25], Grad-CAM++ [26], SHapley Additive
exPlanations (SHAP) values [27] and Local Interpretable 3.1. Model
Model-Agnostic Explanations (LIME) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Most of these
techniques attempt to provide pixel-level explanations 3.1.1. Model Composition Overview
to indicate the contribution of each pixel to the classi- In the proposed MNN architecture, we allocate dedicated
ifcation decision. However, these methods are mostly modules for eyes, nose, mouth and one for the rest of the
suitable for tasks such as object recognition where the features. We employ autoencoders to learn separate and
deep learning models only take a single input image. distinct latent representations for diferent facial features.
      </p>
      <p>Recently, a few methods that attempt to explain the To achieve this, we mask the input image to retain only
behavior and decisions of face recognition systems have the region of interest of that specific module and present
emerged [28, 29, 30, 31, 32]. In [28], the authors rely on
it as the target image (See Fig. 1). After the autoencoders latent representation containing important information
have been trained, we retain the encoder and substitute about the feature and restores only the required part of
the decoder with Siamese networks in all of the modules, the image (See Fig. 1, examples in 3.2).
resulting in Modular Siamese Networks (MSN) (See Fig.</p>
      <sec id="sec-2-1">
        <title>2). 3.1.3. Siamese Networks</title>
        <p>In the task of face verification, a pair of images is given
as input, which could be either a valid pair or an impostor
pair. In the proposed MSN architecture, disentangled
embeddings of facial features are generated for both of the
input images by the feature extracting encoders present
in each feature specific module. These feature
embedding pairs are then fed to the Siamese networks present
in each module which compute the 1 distance vectors
for each of the twin feature latent embeddings pairs,
similar to the method followed in [37]. The distance vectors
from all of the modules are then concatenated and fed
to a common decision network which makes the final
prediction.</p>
        <p>Siamese networks have achieved great results in image
verification [ 37, 38]. The two Siamese twin networks
share the same weights and parameters. The hypothesis
behind this architecture is that if the inputs 1 and 2 are
similar, then the distance between the output vectors ℎ1
and ℎ2 will be less. The network is trained in such a way
that it maximizes the distance between mismatched pairs
and minimizes the distance between matched pairs. Loss
functions like contrastive loss [39] and triplet loss [40]
can be used to achieve this task, few improvised versions
of these loss functions have also been proposed in the
literature [41, 42].</p>
        <p>In our model, we employ Siamese networks for
dis3.1.2. Feature-extracting Autoencoders criminating between feature specific latent vectors of
impostors and valid pairs. The latent vectors 1 and 2
In this work, we employ undercomplete autoencoders are obtained from the feature-extracting autoencoders
[36], a type of autoencoder which has a latent dimension described in 3.1.2. L1 distance vectors are computed from
lower than the input dimension. Undercomplete autoen- the output vectors ℎ1 and ℎ2 obtained from the Siamese
coders are trained to reconstruct the original image as twins for each module. The distance vectors of all of the
accurately as possible while constricting the latent space modules are then concatenated and given as input to the
to a suficiently small dimension to ensure that only the decision network (See Fig. 2).
most salient features are retained in the encoded latent
vectors. To achieve our task of extracting feature specific 3.1.4. Decision Network
latent vectors, we use a novel technique. In this
technique, instead of giving a full image as the target, we The decision network is a feed-forward fully connected
mask the input image and retain only a part of the image network that takes the concatenated input from all of
containing the feature of interest and produce it as the the modules. This network enables us to incorporate
target image. Consequently, the autoencoder learns a
information from all of the modules to predict the final
decision.
3.1.5. Model Architectural Details
The model architecture and training setting described
in [43] were used for training the feature extracting
autoencoders. The Siamese networks consist of four fully
connected layers with ELU activation functions. The final
decision network that takes the concatenated distance
vectors from the modules has two fully connected layers
with ReLU activation functions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Training details</title>
        <p>The training of the proposed MSN is carried out in 3
training phases. In the first phase, the feature extracting
autoencoders are trained with perceptual loss [43]. In
the next phase, the decoder parts in each of the modules
are replaced with the Siamese network and trained using
the triplet loss, freezing the layers trained in the previous
phase. Finally, the decision network is trained using
Binary Cross-Entropy (BCE). The Adam optimization
technique [44] was used for training the network in all
of the three training phases.</p>
        <p>From Fig. 3, 4, 5 and 6, we observe that the feature
extracting autoencoders are able to generate high
quality reconstructions of the intended facial feature. Once
training is complete, the autoencoders take unmasked
full images as input and reconstruct only the required
facial region by incorporating relevant information of
that facial feature into the latent feature vector.</p>
        <p>The subnetworks can be trained in parallel as they are
independent of each other. Once the training is complete,
we obtain a complete end-to-end face verification system.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Interpretability in Modular</title>
    </sec>
    <sec id="sec-4">
      <title>Siamese Networks</title>
      <p>The proposed system generates inherently feature-level
heatmaps that are intuitive and easily interpreted, as
humans naturally observe the similarity of high-level
visual concepts instead of pixels. Each subnetwork of
the MSN generates a distance measure that reflects the
visual similarity of the features. This is achieved by
computing the euclidean distance between the twin output
vectors produced by the Siamese networks for each
module representing a certain feature. Using these distance
measures, a pairwise heatmap incorporating the
similarity or dissimilarity of the features is generated and
overlayed on both of the images. As can be seen in Fig. 7,
the proposed system is able to efectively localize the
similarities and dissimilarities of features in a pair of images.
These heatmaps could be used as a tool for
understanding the decisions taken by the verification system (Refer
section 5.2).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <sec id="sec-5-1">
        <title>The face verification system was trained on the VGGFace2 dataset [46] and evaluated on Labeled Faces in</title>
        <p>No.
Feature-level heatmaps are intuitive and easily
interpretable as humans, unlike computers, look at features
as whole and not at pixels individually. The pairwise
heatmaps that are inherently generated by the proposed
method incorporate relative information taking both of
the input images into consideration. The feature-wise
euclidean distances computed by individual modules in
MSN are used to generate the heatmaps. As can be seen
in Figure. 7, features that look visually similar are
colored blue and colored red when dissimilar in all of the
images. For true positives, the heatmaps are indicating
high similarity for features that are visually close, as
expected. The system shows high dissimilarity between the
nose regions of the first impostor pair in 5.b, which is in
line with human perception as their shapes are
significantly diferent. Studying when the system fails could
be helpful, since these visual cues may help rectify the
workings of the system. In the first pair of 5.c, we
observe that both of the persons wearing eye glasses caused
the eyes module to assign low distance score and when
accompanied another similar looking feature resulted in
misclassification. The heatmap of the second pair of 5.c
demonstrates how spectacles and similar looking facial
hair fooled the system. The heatmaps in 5.d illustrate
how closing eyes and significant diference in pose can
afect the verification. In the first pair, the same person
closing eyes in one of the images made the eyes module
to compute a high distance score. In the second,
significantly diferent pose which resulted in partial visibility
of facial features in one of the images led the system to
predict high dissimilarity score.</p>
        <p>Since these computations at feature level are carried
out in live, the system could instantly generate
meaningful messages that can help the user to correct any issues
in case of a failure, like removing eye glasses or changing
pose for better lighting.</p>
        <sec id="sec-5-1-1">
          <title>5.3. Performance under adversarial attacks</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>We tested the robustness and resistance of the proposed</title>
        <p>
          method against the widely known adversarial attacks
such as the Fast Gradient Sign Method (FGSM) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ],
DeepFool [48] and FGSM in fast adversarial training
(FFGSM)[49].
        </p>
        <p>Assuming the first image in the two image pairs to be
the test image, and the other one to be the anchor image,
we attack only test image similar to the experiments
conducted in the studies [50, 51]. For comparison, we
have considered the well-known FaceNet model which
has report SOTA performance earlier. The results have
been plotted in Figures. 8, 9 and 10.</p>
        <p>The proposed method has shown significantly higher
robustness than FaceNet against all three adversarial
attacks.</p>
        <p>For FGSM, the accuracy of FaceNet falls below 20%
when  is 0.05 while MSN is still close to 60% accurate
(See Figure. 8). In the case of DeepFool attack, we
notice a sharp drop of accuracy to below 10% on step 2 in
facenet, while MSN shows a lot more resilience by being
more than 70% accurate. Similarly for FFGSM, accuracy
of FaceNet drops to just above 30% while MSN has an
accuracy still above 60% at  equals 0.03. In all of these
attacks, we notice that individual modules are noticably
more resistant. Since MSN makes the final prediction
based on these functionally independent modules, it
consequently inherits its robustness from them.</p>
        <p>The enhanced robustness could be attributed to the
fault tolerant nature of MNN [52, 33]. Additionally, the
encoders used for extracting feature specific latent
representations are trained to retain only the most salient
features because of the bottleneck latent layer and as a
result, they may be able to provide some immunity against
noise or perturbations.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>Numerous face verification methods have been proposed
in the literature, most of which focus solely on improving
the performance. Consequently, super-human accuracy
has already been achieved in face verification. The real
need for improvement in this domain is in the areas of
robustness, explainability and fairness. The most
important attribute of the proposed method is that it is both
robust to adversarial attacks and inherently interpretable.</p>
      <p>To the best of our knowledge, there is no other published
method for face verification that provides both of these
qualities at the same time. We believe that pursuing this
direction is essential for developing more trustworthy
systems.</p>
      <p>Having the interpretations of predictions or decisions
while they are being taken by deep learning models could
prove to be paramount in many applications. While
posthoc interpretations might help in understanding the
behavior of the model, they may not be of much help in
generating real-time explanations. Incorporating
interpretability to the system itself could allow us to handle
human errors by enabling communication with the user,
informing them of what went wrong and suggesting
rectifications.</p>
      <p>In this paper, we have presented a new technique to
learn latent representations of high-level facial features.</p>
      <p>We proposed a modular face verification system that
inherently generates interpretations of its decisions with
the help of the learned feature-specific latent
representations. The need and importance of having such a readily
interpretable systems were discussed. Further, we have
demonstrated that the proposed system a has higher
resistance to adversarial examples.</p>
      <p>In summary, we have introduced and validated a face
verification system that: provides on-the-fly and easily
interpretable feature level explanations, has structurally
interpretable model architecture, is able to provide
feedback in real time, and has increased robustness towards
adversarial attacks.</p>
      <p>7298682. arXiv:1503.03832. Multimedia Signal Processing (MMSP), IEEE, 2018,
[13] J. Deng, J. Guo, N. Xue, S. Zafeiriou, Arcface: Addi- pp. 1–6.</p>
      <p>tive angular margin loss for deep face recognition, [23] M. D. Zeiler, R. Fergus, Visualizing and
underin: Proceedings of the IEEE/CVF Conference on standing convolutional networks, in: Lecture
Computer Vision and Pattern Recognition, 2019, Notes in Computer Science (including subseries
pp. 4690–4699. Lecture Notes in Artificial Intelligence and
Lec[14] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, ture Notes in Bioinformatics), volume 8689 LNCS,
Z. Li, W. Liu, Cosface: Large margin cosine loss Springer Verlag, 2014, pp. 818–833. doi:10.1007/
for deep face recognition, in: Proceedings of the 978-3-319-10590-1_53. arXiv:1311.2901.
IEEE Conference on Computer Vision and Pattern [24] M. D. Zeiler, G. W. Taylor, R. Fergus,
AdapRecognition, 2018, pp. 5265–5274. tive deconvolutional networks for mid and high
[15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, level feature learning, in: Proceedings of the
Sphereface: Deep hypersphere embedding for face IEEE International Conference on Computer Vision,
recognition, in: Proceedings of the IEEE conference IEEE, 2011, pp. 2018–2025. URL: http://ieeexplore.
on computer vision and pattern recognition, 2017, ieee.org/document/6126474/. doi:10.1109/ICCV.
pp. 212–220. 2011.6126474.
[16] Y. Liu, H. Li, X. Wang, Rethinking Feature Discrim- [25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
ination and Polymerization for Large-scale Recog- D. Parikh, D. Batra, Grad-cam: Visual explanations
nition, in: undefined, 2017. URL: http://arxiv.org/ from deep networks via gradient-based localization,
abs/1710.00870. arXiv:1710.00870. in: Proceedings of the IEEE International
Confer[17] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, ence on Computer Vision, 2017, pp. 618–626.</p>
      <p>P. Frossard, Universal adversarial perturba- [26] A. Chattopadhay, A. Sarkar, P. Howlader, V. N.
Baltions (2016). URL: http://arxiv.org/abs/1610.08401. asubramanian, Grad-cam++: Generalized
gradientarXiv:1610.08401. based visual explanations for deep convolutional
[18] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining networks, in: 2018 IEEE Winter Conference on
Apand harnessing adversarial examples, in: 3rd Inter- plications of Computer Vision (WACV), IEEE, 2018,
national Conference on Learning Representations, pp. 839–847.</p>
      <p>ICLR 2015 - Conference Track Proceedings, Inter- [27] S. M. Lundberg, S.-I. Lee, A unified approach to
national Conference on Learning Representations, interpreting model predictions, in: Advances in
ICLR, 2015. arXiv:1412.6572. neural information processing systems, 2017, pp.
[19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er- 4765–4774.</p>
      <p>han, I. Goodfellow, R. Fergus, Intriguing properties [28] J. R. Williford, B. B. May, J. Byrne, Explainable face
of neural networks, in: 2nd International Con- recognition, in: A. Vedaldi, H. Bischof, T. Brox,
ference on Learning Representations, ICLR 2014 - J.-M. Frahm (Eds.), Computer Vision – ECCV 2020,
Conference Track Proceedings, International Con- Springer International Publishing, Cham, 2020, pp.
ference on Learning Representations, ICLR, 2014. 248–263.</p>
      <p>arXiv:1312.6199. [29] B. RichardWebster, S. Y. Kwon, C. Clarizio, S. E.
An[20] M. Sharif, S. Bhagavatula, L. Bauer, M. K. Re- thony, W. J. Scheirer, Visual psychophysics for
makiter, Accessorize to a crime: Real and stealthy ing face recognition algorithms more explainable,
attacks on state-of-the-art face recognition, in: in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss
Proceedings of the ACM Conference on Com- (Eds.), Computer Vision – ECCV 2018, Springer
puter and Communications Security, volume 24-28- International Publishing, Cham, 2018, pp. 263–281.
October-2016, Association for Computing Machin- [30] B. Yin, L. Tran, H. Li, X. Shen, X. Liu, Towards
ery, New York, New York, USA, 2016, pp. 1528–1540. interpretable face recognition, in: Proceedings of
URL: http://dl.acm.org/citation.cfm?doid=2976749. the IEEE International Conference on Computer
2978392. doi:10.1145/2976749.2978392. Vision, 2019, pp. 9348–9357.
[21] M. Pautov, G. Melnikov, E. Kaziakhmedov, K. Kireev, [31] T. Xu, J. Zhan, O. G. Garrod, P. H. Torr, S.-C. Zhu,
A. Petiushko, On adversarial patches: real-world R. A. Ince, P. G. Schyns, Deeper interpretability
attack on arcface-100 face recognition system, in: of deep networks, arXiv preprint arXiv:1811.07807
2019 International Multi-Conference on Engineer- (2018).
ing, Computer and Information Sciences (SIBIR- [32] T. Zee, G. Gali, I. Nwogu, Enhancing human face
CON), IEEE, 2019, pp. 0391–0396. recognition with an interpretable neural network,
[22] A. J. Bose, P. Aarabi, Adversarial attacks on face de- in: Proceedings of the IEEE/CVF International
Contectors using neural net based constrained optimiza- ference on Computer Vision (ICCV) Workshops,
tion, in: 2018 IEEE 20th International Workshop on 2019.
[33] A. Schmidt, Z. Bandar, Modularity - a concept for [47] G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller,
new neural network architectures, Proc. IASTED In- Labeled Faces in the Wild: A Database for
Studyternational Conference on Computer Systems and ing Face Recognition in Unconstrained
EnvironApplications, Irbid, Jordan, 1998 (1998). ments, Technical Report 07-49, University of
Mas[34] U. Hwang, J. Park, H. Jang, S. Yoon, N. I. Cho, Pu- sachusetts, Amherst, 2007.</p>
      <p>vae: A variational autoencoder to purify adversarial [48] S. Moosavi-Dezfooli, A. Fawzi, P. Frossard,
Deepexamples, IEEE Access 7 (2019) 126582–126593. fool: A simple and accurate method to fool deep
[35] P. Samangouei, M. Kabkab, R. Chellappa, Defense- neural networks, in: 2016 IEEE Conference on
Comgan: Protecting classifiers against adversarial at- puter Vision and Pattern Recognition (CVPR), 2016,
tacks using generative models, arXiv preprint pp. 2574–2582. doi:10.1109/CVPR.2016.282.
arXiv:1805.06605 (2018). [49] E. Wong, L. Rice, J. Z. Kolter, Fast is better than
[36] I. Goodfellow, Y. Bengio, A. Courville, Deep learn- free: Revisiting adversarial training, arXiv preprint
ing, MIT press, 2016. arXiv:2001.03994 (2020).
[37] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural [50] F. Zuo, B. Yang, X. Li, Q. Zeng, Exploiting the
inhernetworks for one-shot image recognition, in: ICML ent limitation of l0 adversarial examples, in: 22nd
deep learning workshop, volume 2, Lille, 2015. International Symposium on Research in Attacks,
[38] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face Intrusions and Defenses ({RAID} 2019), 2019, pp.</p>
      <p>recognition (2015). 293–307.
[39] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality [51] M. Kulkarni, A. Abubakar, Siamese networks for
reduction by learning an invariant mapping, in: generating adversarial examples, arXiv preprint
2006 IEEE Computer Society Conference on Com- arXiv:1805.01431 (2018).
puter Vision and Pattern Recognition (CVPR’06), [52] G. Auda, M. Kamel, Modular neural networks: a
volume 2, IEEE, 2006, pp. 1735–1742. survey, International Journal of Neural Systems 9
[40] G. Chechik, V. Sharma, U. Shalit, S. Bengio, Large (1999) 129–151.</p>
      <p>scale online learning of image similarity through
ranking, Journal of Machine Learning Research 11
(2010) 1109–1135.
[41] W. Chen, X. Chen, J. Zhang, K. Huang, Beyond
triplet loss: a deep quadruplet network for person
re-identification, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, 2017, pp. 403–412.
[42] D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng,
Person re-identification by multi-channel parts-based
cnn with improved triplet loss function, in:
Proceedings of the iEEE conference on computer vision
and pattern recognition, 2016, pp. 1335–1344.
[43] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature
consistent variational autoencoder, in: 2017 IEEE Winter
Conference on Applications of Computer Vision
(WACV), 2017, pp. 1133–1141. doi:10.1109/WACV.</p>
      <p>2017.131.
[44] D. P. Kingma, J. Ba, Adam: A method for
stochastic optimization, arXiv preprint arXiv:1412.6980
(2014).
[45] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face
detection and alignment using multitask cascaded
convolutional networks, IEEE Signal Processing
Letters 23 (2016) 1499–1503. doi:10.1109/LSP.2016.</p>
      <p>2603342.
[46] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, A. Zisserman,</p>
      <p>Vggface2: A dataset for recognising faces across
pose and age, in: 2018 13th IEEE International
Conference on Automatic Face &amp; Gesture Recognition
(FG 2018), IEEE, 2018, pp. 67–74.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Belhumeur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Nayar</surname>
          </string-name>
          ,
          <article-title>Attribute and simile classifiers for face verification</article-title>
          ,
          <source>in: 2009 IEEE 12th international conference on computer vision</source>
          , IEEE,
          <year>2009</year>
          , pp.
          <fpage>365</fpage>
          -
          <lpage>372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Deep Face Recognition,
          <source>in: Procedings of the British Machine Vision Conference</source>
          <year>2015</year>
          ,
          <source>British Machine Vision Association</source>
          ,
          <year>2015</year>
          , pp.
          <volume>41</volume>
          .
          <fpage>1</fpage>
          -
          <lpage>41</lpage>
          .12. URL: http://www. bmva.org/bmvc/2015/papers/paper041/index.html. doi:
          <volume>10</volume>
          .5244/C.29.41.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          ,
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          , J. Cheng, W. Liu, H. Liu,
          <article-title>Additive margin softmax for face verification</article-title>
          ,
          <source>IEEE Signal Processing Letters</source>
          <volume>25</volume>
          (
          <year>2018</year>
          )
          <fpage>926</fpage>
          -
          <lpage>930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Montavon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klauschen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Müller</surname>
          </string-name>
          , W. Samek,
          <article-title>On pixel-wise explanations for non-linear classifier decisions by layerwise relevance propagation</article-title>
          ,
          <source>in: PLoS ONE</source>
          , volume
          <volume>10</volume>
          ,
          <string-name>
            <surname>Public</surname>
            <given-names>Library</given-names>
          </string-name>
          <source>of Science</source>
          ,
          <year>2015</year>
          , p.
          <source>e0130140. doi:10</source>
          .1371/journal.pone.
          <volume>0130140</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-CAM:
          <article-title>Visual Explanations from Deep Networks via Gradient-based Localization</article-title>
          , in: ICCV,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          . URL: http://gradcam.cloudcv.org.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>" why should i trust you?" explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <article-title>Explaining and harnessing adversarial examples</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6572</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Vargas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakurai</surname>
          </string-name>
          ,
          <article-title>One pixel attack for fooling deep neural networks</article-title>
          ,
          <source>IEEE Transactions on Evolutionary Computation</source>
          <volume>23</volume>
          (
          <year>2019</year>
          )
          <fpage>828</fpage>
          -
          <lpage>841</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <article-title>Towards evaluating the robustness of neural networks</article-title>
          ,
          <source>in: 2017 ieee symposium on security and privacy (sp)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Deep Learning Face Representation by Joint Identification-Verification, undefined (</article-title>
          <year>2014</year>
          ). arXiv:
          <volume>1406</volume>
          .
          <year>4773v1</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Philbin,</surname>
          </string-name>
          <article-title>FaceNet: A unified embedding for face recognition and clustering</article-title>
          ,
          <source>in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , volume
          <volume>07</volume>
          -12-June, IEEE Computer Society,
          <year>2015</year>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>