Bridging Human Concepts and Computer Vision for
                                Explainable Face Verification
                                Miriam Doh1,2,† , Caroline Mazini Rodrigues3,4 , Nicolas Boutry3 , Laurent Najman4 ,
                                Matei Mancas1 and Hugues Bersini2
                                1
                                  ISIA lab - Université de Mons (UMONS), Bd Dolez 31, 7000, Mons, Belgium
                                2
                                  IRIDIA lab - Université Libre de Bruxelles (ULB), Av. Adolphe Buyl 87,1060, Ixelles, Belgium
                                3
                                  LRE - Laboratoire de Recherche de l’EPITA, 14-16 Rue Voltaire, 94270, Kremlin-Bicêtre, France
                                4
                                  LIGM - Laboratoire d’Informatique Gaspard-Monge, Université Gustave-Eiffel, 77454, Marne-la-Vallée, France


                                                                         Abstract
                                                                         With Artificial Intelligence (AI) influencing the decision-making process of sensitive applications such
                                                                         as Face Verification, it is fundamental to ensure the transparency, fairness, and accountability of deci-
                                                                         sions. Although Explainable Artificial Intelligence (XAI) techniques exist to clarify AI decisions, it is
                                                                         equally important to provide interpretability of these decisions to humans. In this paper, we present an
                                                                         approach to combine computer and human vision to increase the explanation’s interpretability of a face
                                                                         verification algorithm. In particular, we are inspired by the human perceptual process to understand
                                                                         how machines perceive face’s human-semantic areas during face comparison tasks. We use Mediapipe,
                                                                         which provides a segmentation technique that identifies distinct human-semantic facial regions, enabling
                                                                         the machine’s perception analysis. Additionally, we adapted two model-agnostic algorithms to provide
                                                                         human-interpretable insights into the decision-making processes.

                                                                         Keywords
                                                                         Face verification, Explainable AI (XAI), Interpretability


                                1. Introduction
                                Face verification [1] aims to confirm an individual’s identity based on facial features, with appli-
                                cations in law enforcement [2], border control [3], or smartphone security [4]. As AI becomes
                                prevalent in decision-making [5], ensuring model fairness, accountability, confidentiality, and
                                transparency is crucial [6]. However, complex ML models are often seen as ’black boxes’ [7].
                                Explainable AI (XAI) [8] addresses this challenge by enhancing AI interpretability to make AI
                                systems transparent and understandable to humans, thereby increasing trust in their decisions.
                                   Saliency maps have become the most popular XAI solution in computer vision, offering
                                insights into the critical features considered in the decision-making process. However, in face
                                verification, decisions often rely on adjustable thresholds based on the specific application rather
                                than understandable semantic classes. This raises questions about the adequacy of identifying
                                the most important features in an image as the only explanation [9]. Taking inspiration from
                                the human perceptual process, we propose a model-agnostic approach capable of determining

                                BEWARE-23 Joint Workshop @ AIxIA 2023, Rome, Italy, November 6th, 2023, Rome, Italy
                                †
                                 work supported by the ARIAC project (No. 2010235), funded by the Service Public de Wallonie (SPW Recherche).
                                                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
how the machine perceives similar semantic areas of the face when comparing two faces. Our
primary objective is to translate the XAI solution into human decision-making meaningfully.
However, incorporating human-based semantics in the models’ explanation process can also
introduce human bias to these same explanations. To increase human interpretability, we must
also assure the Faithfulness of explanations to the model’s reasoning. Faithfulness refers to
whether a feature, considered important for the model, changes the model’s decision [10].
   For face verification, the model extracts features for each face that will be compared. Modifi-
cations in the features will also impact how similar are the two faces. Therefore, it is essential
to understand how face parts, such as an eye, would impact the final features.
   To translate the model’s knowledge to human knowledge as smoothly as possible, we first
perform the segmentation of face parts based on human semantics. By considering the impact
of those face parts on a set of face images, we can have a global view of the model’s knowledge.
Following the features’ extraction (through the model), we verify if two people are the same by
comparing their facial features. To understand the contribution of the chosen concepts to the
relation between two compared faces, we introduce an algorithm grounded in the perturbation
of facial regions linked to the extracted concepts, mirroring the human perceptual process of
face recognition. It encompasses evaluating corresponding semantic areas along a spectrum of
similarities, providing interpretation and contextualisation.
   We structured this paper as follows: in Section 2, we present state-of-the-art methods for
explaining the face verification task; in Section 3, we describe our framework, including the
model concept’s extraction and the perturbation methods for face comparison; in Section 4 we
include the experimental results and limitations; in Section 5 we conclude the work.


2. Related work
Saliency maps, such as CAMs [11, 12] and RISE [13], are crucial for interpreting deep-learning
models, revealing their inner workings. However, their primary development has centered on
object recognition, leaving the field of face analysis relatively unexplored.
   Despite its critical applications, research in face analysis has been limited. Works by
[14, 15, 16] mainly focus on individual pixel or low-level feature significance, which can be chal-
lenging for human analysts and may not align with intuition. Conversely, LIME [17] employs
superpixels within the image, providing a user-friendly, concept-driven explanation. However,
this technique relies on a new model approximating the original, potentially obscuring the
actual reasons for the original model’s behavior [18].
   Alternative approaches, such as TCAV [19] and knowledge graphss [20], prioritize low-level
importance from pixels and aim to represent the model’s knowledge through concepts. TCAV
employs semantic concepts defined by users or discovered through image segment activations
(with method ACE [21]), while knowledge graphs identify repeating patterns across network
layers. Additionally, Tan et al. [22] introduced the Locality Guided Neural Network (LGNN),
designed to induce filter topology that enhances the visualization of concepts.
   Inspired by these methods, our approach combines human and model perspectives to identify
essential concepts for face verification. We acknowledge that relying solely on human concepts
can introduce bias while relying solely on the model can complicate interpretation.
3. Proposed Method


Figure 1: Face verification adaptation of XAI Perceptual processing framework proposed by [23] and
inspired by how humans process stimuli (select, organize, interpret and compare)


   To help humans understand how AI systems make decisions, it is essential to present the
information in a way that aligns with human cognitive processes. Cognitive psychology provides
valuable insights into how people perceive and process information when identifying faces.
Taking inspiration from the flowchart proposed by [23], we aim to apply a similar method to
face verification (see Figure 1). The human perceptual process consists of three key phases:
selection, organization, and interpretation [24]. Cognitive psychology has shown that when
recognizing faces, our attention is particularly drawn to particular facial areas, such as the
eyes and nose [25, 26, 27]. Subsequently, in the perceptual process, these facial stimuli are
organized into meaningful concepts, adding semantics to the process. Our brains compare these
higher-level concepts to assess the similarity between items, facilitating face categorization.
This comparative analysis may involve matching a face to a remembered image or with another
face in front of us. In this context, we question the adequacy of salience maps used in computer
vision as an explanation and their alignment with our human reasoning processes.
   Based on cognitive psychology, we have developed a general flowchart shown in Figure 2.
Generally, face verification systems rely primarily on a matching score between two face images
𝐴 and 𝐵. This score, 𝑆𝐵𝐴 , is computed using cosine similarity, which compares the feature vectors
fA and fB extracted from each image as follows:
                                                  fA ⋅ fB
                                        𝑆𝐵𝐴 =                                                  (1)
                                                ||fA || ||fB ||
   The resulting score ranges from 0 to 1, with a score of 1 indicating identical images (𝐴 = 𝐵).
As our approach is model-agnostic, we aim to explain the algorithm by perturbing the inputs to
study the system’s decision behaviour concerning the input-output relationship. Inspired by the
work of [16], our desired output is a similarity map indicating which face areas are considered
similar or dissimilar for both images, using an AI model as a feature extractor. To achieve this,
we perform semantic perturbation on images 𝐴 and 𝐵, resulting in new images denoted as 𝐴(𝑛)
                                                                              (𝑛)
and 𝐵(𝑛) where the 𝑛 section is removed in both images. We obtain a new 𝑆𝐵𝐴(𝑛) score from these
Figure 2: Proposed flowchart. We extract concepts from the face verification model (using KernelSHAP)
and input them into a Semantic Face perturbation phase. In this phase, the two images’ perturbation is
made in the same regions to evaluate similarities and dissimilarities. We propose three algorithms for
the perturbations: Single removal, greedy removal, and average similarity map.


perturbed images. By fairly masking the images, we can assess if the system perceives semantic
areas, such as the eyes, as similar or dissimilar. Considering Δ𝑆 the difference between original
                                                        (𝑛)
and new scores represented by Equation 2, if the 𝑆𝐵𝐴(𝑛) decreases compared to 𝑆𝐵𝐴 , it suggests
that the removed parts positively contribute to the similarity (Δ𝑆 ≥ 0). Conversely, its increase
indicates a negative contribution (Δ𝑆 < 0).
                                                         (𝑛)
                                          Δ𝑆 = 𝑆𝐵𝐴 − 𝑆𝐵𝐴(𝑛)                                        (2)
   Compared to [16], our objective is to incorporate semantic masking in the perturbation
process to increase interpretability by providing not only a map but also a chart related to
the semantic areas. We apply two types of perturbation algorithms inspired by [15], allowing
us to study the face section area’s single or collaborative contributions and then incorporate
this information within an average similarity map. This single/collaborative approach aligns
with the notion that humans perceive and interpret faces in a relational/configurational way
[28] (see Figure 3). First-order features concern individual components that can be processed
independently (e.g., eyes, nose). Second-order features involve information acquired when
simultaneously processing two or more parts together (e.g., spacing between eyes). Furthermore,
higher-order features emerge from combinations of multiple first-order and/or second-order
features. In our case, the single removal procedure models the information associated with
first-order or single features, and the greedy removal procedure addresses the second-order
features, wherein multiple parts are processed collectively.

3.1. Semantic Extraction
To incorporate semantics, we employ Mediapipe Face masks, a versatile open-source framework
by Google, widely recognized for its face detection and landmark estimation capabilities. By ex-
Figure 3: An interpretation of a relational/configural model of face perception.


Figure 4: In the image (a) Mediapipe landmarks are plotted on the sample image. In the image (b), the
13 semantic sections are defined through the landmarks


tracting landmarks from Mediapipe, we defined 13 polygons corresponding to distinct semantic
areas of the face (see Figure 4.a). The landmark estimation provided by Mediapipe is limited
to specific facial regions, and hair or ears were not included in the earlier facial subdivisions.
Nevertheless, this decision is consistent with previous research [29], which demonstrated that
some areas of the face are more influential than others. For example, removing the ears has
less impact on the final score than the eye area. Hence, we assumed these areas were not
primarily influential and did not include them in our face classes. Additionally, face verification
algorithms typically apply a preprocessing step for extracting the face area. Therefore, we
reduce the area outside the face by applying MCTNN [30], a deep learning-based face detection
algorithm. Overall, our subdivision of the face detected 13 distinct semantic classes, including
the background (figure 4.b). With this approach, the semantic areas vary in size, resulting in
larger maps having a more significant influence on the score than smaller ones. To mitigate
this undesired effect, we introduce a weight, denoted as 𝑤𝐴,𝑛 related to the section 𝑛 ∈ [1, 𝑚]
with 𝑚 = 13. The 𝑤𝐴,𝑝 is defined as the rapport between the total area of the image 𝐴 (Area𝐴 )
and the area of the mask 𝑀 (𝐴,𝑛) , Area𝑀 (𝐴,𝑛) , indicating region 𝑝 (white pixels in the mask). This
weight serves to counterbalance the differences in magnitude. Moreover, due to the precise face
positioning achieved by Mediapipe, the masks obtained on images 𝐴 and 𝐵 may only partially
match. This discrepancy arises because the depicted faces may not have the same position
                                                                                              Area
and expressiveness. For this reason, we define two weight 𝑤𝐴,𝑛 = 𝑊 (𝐴, 𝑀 (𝐴,𝑛) ) = Area 𝐴
                                                                                               𝑀 (𝐴,𝑛)
                                                   Area𝐵
associated to 𝐴 and 𝑤𝐵,𝑛   = 𝑊 (𝐵, 𝑀 (𝐵,𝑛) ) =   Area𝑀 (𝐵,𝑛)
                                                             to 𝐵.
                                                    𝑤𝐴,𝑛 .𝑤𝐵,𝑛
                                      ̂ (𝐴,𝐵) =
                                      𝑊                                                            (3)
                                             𝑛      𝑚
                                                  ∑𝑖=1 𝑤𝐴,𝑖 .𝑤𝐵,𝑖


                                                     ̂ (𝐴,𝐵)
                                          𝐶𝑛 = Δ 𝑆 ⋅ 𝑊                                             (4)
                                                            𝑛


In this manner, the contribution of the mask, defined as 𝐶𝑛 , is weighted by 𝑤𝐴,𝑛 and 𝑤𝐵,𝑛 ,
representing the relative weights associated with 𝑀 (𝐴,𝑛) and 𝑀 (𝐵,𝑛) masks, respectively.

3.1.1. Concepts Extraction
Using Mediapipe for face part extraction provides a human-based semantic segmentation, yet
it may not align with how models perceive faces. To bridge this gap we introduce a model’s
concept extraction process. This involves filtering machine-important parts based on human
semantics. For evaluating the importance of facial parts, we employ KernelSHAP [31], which
combines LIME [17]’s interpretable components with Shapley values [32] from game theory
which look for each feature contribution to the final result. We extract model importance scores
for each of semantic parts. In face final representations with 512 features, for example, we will
have 512 importance scores per human-semantics part. In the process of face verification, every
feature change, negative or positive, is significant to determine faces’ similarity, with emphasis
on the magnitude of the change, instead of on the signal. If one feature of a human-semantic
part obtained a negative Shap value, the lack of this part reduced the feature value, and vice
versa. Therefore, negative and positive Shap values are equally important in our context. For
this reason, sum the absolute Shap values throughout all the representation features to obtain a
single importance value per part.


                          (a)                                                   (b)


                          (c)                                                   (d)
Figure 5: Examples of two images’ human-semantics part importance scores using KernelSHAP [31].
We analyse two models: CasiaNet [33] in (a) and (c), and VGGfaces2 [34] in (b) and (d). Green parts are
more important according to Shap scores. There are differences between important parts for different
images, especially for VGGfaces2. That is why we aggregate ranked importance over 200 images.
  Ultimately, we will have one importance value per semantic part (see Figure 5). However,
this remains a local importance, i.e., an importance score according to a single image dataset.
To increase globalism in the concepts’ extraction, we need to include information from multiple
images. Our solution is to combine the importance levels from a set of images by a ranking
combination strategy. Each image obtains 13 importance scores (one per human-semantics
part) that we can order. More significant scores are at the top of a ranking, as they were
considered more important for the model. We use 200 images from CelebA [35] dataset to obtain
200 rankings. From these rankings’ combinations, by a vote-based technique using BORDA
count [36], we obtain a final ranking with more globally important concepts at the top.
  The experiments will focus on the model’s top eight concepts determined by this process.

3.1.2. Proposed similarity maps
The algorithm used to generate the similarity maps draws inspiration from the work of [15],
where six algorithms were presented to create saliency maps. Specifically, we will employ the
single removal approach (S0) and the greedy removal approach (S1), with the possibility of
creating an average map of these two approaches (SAVG ). Our approach incorporates significant
changes compared to previous research. First and foremost, we utilize semantically meaningful
masks to perturb the images, diverging from conventional circular or square masks with a fixed
shape. Moreover, since our objective is to generate community similarity maps between the
two images, both images undergo perturbation, contrary to previous approaches that typically
perturb only one of the images, thus aligning more closely with the strategy proposed by [16].

3.1.3. Single Removal - S0


                                             (𝑛)
Figure 6: Samples of single removal where 𝑆𝐵𝐴(𝑛) is the cosine similarity between image 𝐴 and image 𝐵
with the 𝑛 semantic part removed.


   We define the two perturbed images as the pixel-wise multiplication of the images and the
relative semantic section mask of the same size with values between 0 and 1.

                              𝐴′ = 𝐴 ⋅ 𝑀 (𝐴,𝑛) and 𝐵′ = 𝐵 ⋅ 𝑀 (𝐵,𝑛)                               (5)

The single removal operation is computed for all the semantic areas. For each mask, the value
of the contribution map H0 is initialized with the 𝐶𝑛 contribution associated with the mask:

                                       𝐻 0(𝐴,𝑛) = 𝐶𝑛 ⋅ 𝑀 (𝐴,𝑛)                                    (6)

  The similarity map is defined as the sum of the negative and the positive contributions
normalized by Equation 7 for all 𝑛 ∈ [1, 𝑚], to obtain 𝑆0𝐴 .
                                            𝐻 0(𝐴,𝑛)
                                      ⎧ ∑𝐻 0       |𝐻 0(𝐴,𝑚) |
                                                                 if 𝐻 0(𝐴,𝑛) ≥ 0
                                          (𝐴,𝑚) ≥0
                         𝐻 0±
                            (𝐴,𝑛) = ⎨       𝐻 0(𝐴,𝑛)
                                                                 otherwise.                        (7)
                                    ⎩ ∑𝐻 0(𝐴,𝑚) <0 |𝐻 0(𝐴,𝑚) |
                             𝑆0𝐴 = 𝑆0𝐴 + (𝐻 0+          −
                                             (𝐴,𝑛) + 𝐻 0(𝐴,𝑛) ) ⋅ 𝑀
                                                                    (𝐴,𝑛)


  We use the same Equations 6 and 7 to obtain 𝐻 0(𝐵,𝑛) , 𝐻 0+           −
                                                             (𝐵,𝑛) , 𝐻 0(𝐵,𝑛) and 𝑆0𝐵 . This means
negative contributions are seen as dissimilar areas in the face image, while positive ones are
similar. The algorithm 1 gives us the similarity maps 𝑆0𝐴 and 𝑆0𝐵 as a result of single removal.

3.1.4. Greedy Removal-S1


                                                                                       (𝑛)
Figure 7: Greedy removal for image 𝐴 and image 𝐵 in 𝑛 steps (𝑡 = 𝑛 − 1), where 𝑆𝐵𝐴(𝑛) is the cosine
similarity between the two images and 𝑛 is the best part removed (𝐵𝑒𝑠𝑡M𝐴 and 𝐵𝑒𝑠𝑡M𝐵 ) at 𝑡-step.


   The iterative approach of the greedy algorithm involves repeatedly performing a single
removal procedure. In each iteration, the section of the face with the greatest impact is removed
from images 𝐴 and 𝐵. In particular, the initial images are represented as 𝐴0 = 𝐴 and 𝐵0 = 𝐵,
and at each iteration, 𝐴𝑡 and 𝐵𝑡 are obtained by removing the principal parts of 𝐴𝑡−1 and 𝐵𝑡−1 ,
respectively. This means that at each iteration, the mask removed will be defined as the actual
section mask sum with the previous best mask removed:

                    𝑀 (𝐴𝑡 ) = 𝑀 (𝐴𝑡 ,𝑛) + BestM𝐴 and 𝑀 (𝐵𝑡 ) = 𝑀 (𝐵𝑡 ,𝑛) + BestM𝐵                  (8)
In greedy removal, calculating positive and negative contribution maps follows distinct pro-
cedures. We also use Equations 6 and 7 to obtain 𝐻 1(.,𝑛) , 𝐻 1+            −
                                                                 (.,𝑛) , 𝐻 1(.,𝑛) , 𝑆1𝐴 and 𝑆1𝐵 . To be
more concise, in algorithm 1, the presentation primarily focuses on calculating the negative
contribution map H1- . Indeed, in each iteration, the 𝑠𝑡 value is set to 1. Consequently, the
removed areas correspond to those exhibiting negative contribution, as the condition 𝑠 ′ < 𝑠𝑡
dictates. Conversely, H1+ is computed, setting 𝑠𝑡 value to 0 at each iteration with the condition
𝑠 ′ > 𝑠𝑡 . In our example, the iteration stops when the maximum number of iterations is reached
or when the score difference reaches a low enough point. In this case, that occurs at t = 7,
where the score difference is only 0.009. After obtaining H1+ and H1- , the similarity map S1 is
obtained following the equation 7.

3.1.5. Average Similarity map SAVG
In subsection 3.1.3 and 3.1.4, the processes for determining similarity maps are outlined. Using
single and greedy removal techniques makes it possible to assess the significance of each facial
Algorithm 1 Calculate H0 and H1
 1: Input:
 2: 𝐴 -Face image A
 3: 𝐵 -Face image B
 4: 𝑆𝐵𝐴 -Initial Score
 5: 𝜃 -Minimal increment allowed
 6: 𝑡𝑚𝑎𝑥 -Max number of iteration


 7: N, M ← size(𝐴)                               ▷ height and width of face images
 8: 𝐴0 ← 𝐴                                              ▷ initialization of the image
 9: 𝐵0 ← 𝐵                                              ▷ initialization of the image
10: H0𝐴 ← zeros(N, M)                                    ▷ initialization of the maps
11: H0𝐵 ← zeros(N, M)                                    ▷ initialization of the maps
12: H1- 𝐴 ← zeros(N, M)                                  ▷ initialization of the maps
13: H1- 𝐵 ← zeros(N, M)                                  ▷ initialization of the maps
14: BestM𝐴 ← zeros(N, M)                                   ▷ initialization of mask A
15: BestM𝐵 ← zeros(N, M)                                   ▷ initialization of mask B
16: 𝑡 = 0                                       ▷ initialization of iteration counter
17: 𝑠𝑡−1 ← 𝑆𝐵𝐴                                               ▷ initial matching score
18: Δ𝑠 ← 1                                   ▷ initialization of difference of scores
19: while 𝑡 < 𝑡𝑚𝑎𝑥 and Δ𝑠 > 𝜃 do
20:    𝑠𝑡 ← 1
21:    𝑡 ←𝑡 +1
22:    for 𝑛 in FaceSections do
23:          𝑀 (𝐴𝑡 ) ← 𝑀 (𝐴𝑡 ,𝑛) + BestM𝐴
24:          𝑀 (𝐵𝑡 ) ← 𝑀 (𝐵𝑡 ,𝑛) + BestM𝐵
25:          𝐴′ = 𝐴𝑡 − 1 ⋅ 𝑀 (𝐴𝑡 )
26:          𝐵′ = 𝐵𝑡 − 1 ⋅ 𝑀 (𝐵𝑡 )
                        ′
27:          𝑠 ′ ← 𝑆𝐵𝐴′
28:          if 𝑠 ′ < 𝑠𝑡 then
29:                𝑠𝑡 = 𝑠 ′
30:                BestM𝐴 ← 𝑀 (𝐴𝑡 )
31:                BestM𝐵 ← 𝑀 (𝐵𝑡 )
32:                Best𝑤𝐴 ← W(𝐴,BestM𝐴 )
33:                Best𝑤𝐵 ← W(𝐵,BestM𝐵 )
34:                𝐴t ← 𝐴′
35:                𝐵t ← 𝐵 ′
36:    Δ𝑠𝑡 = 𝑠𝑡−1 − 𝑠𝑡
37:    if 𝑡 = 0 then
38:          for 𝑛 in FaceSections do
39:                𝐶𝑛 = Δ𝑠𝑡 ⋅ 𝑊 ̂ (𝐴,𝐵)
                                       𝑛
40:                𝐻 0𝐴 [𝑀 (𝐴,𝑛) = 1] ← 𝐶𝑛
41:                𝐻 0𝐵 [𝑀 (𝐵,𝑛) = 1] ← 𝐶𝑛
42:    𝐶Best ← Δ𝑠𝑡 ⋅ 𝑊     ̂ (𝐴,𝐵)
                                  Best
43:        -
       H1 𝐴 [BestM𝐴 =1] ← 𝐶Best
44:    H1- 𝐵 [BestM𝐵 =1] ← 𝐶Best

45: Output: H0𝐴 ,H1- 𝐴 ,H0𝐵 ,H1- 𝐵
feature individually or in conjunction with others. Considering this, analyzing an average map
can provide valuable comprehension of the significance of each facial feature. Incorporating
this information within an average similarity map of maps 𝑆0 (Single removal) and 𝑆1 (Greedy
removal) which is called 𝑆𝐴𝑉 𝐺 aligns with the notion that humans perceive and interpret faces
in a relational/configurational manner [28] (see Figure 3).


4. Experimental results


Figure 8: Similarity maps for each algorithm proposed in the case of VGGface2. Respectively S0 is the
output of the single removal, S1 is the greedy removal ones, and SAVG is the average map generated from
S0 , and S1 . The plot chart considers the contribution values (𝐶𝑛 ) for each section in the perturbation.


   This section shows the experimental results for a selected number of samples extracted by
the CelebA dataset [35] and tested for the FaceNet [37] model trained on Casia-WebFace [33]
and VGGfaces2 [34]. In Figure 8, we show the output generated by the proposed method. It
comprises three maps: the initial single removal map S0 , the greedy removal map S1 , and the
ultimate average map SAVG . The visualization uses orange to represent semantic areas that
are similar, while purple indicates differences in facial features. After the concept extraction, a
group of semantic areas is selected based on their importance. The table 1 displays the n-selected
semantic areas ranked by their importance. In our study 𝑛 = 8, this number can be changed as
needed.

Table 1
Concept extraction output of each model’s top n semantic areas (n=8). Area names are abbreviated with
the initial or the first two letters (i.e., E = eye). R and L are associated with the right and left sides.
                    VGGFace2      ”B”, ”CHER ”, ”MOL ”, ”ER ”, ”MOR ”, ”NR ”, ”FR ”, ”EL ”
                    Casia Net     ”B”, ”ER ”, ”MR ”, ”ML ”, ”EL ”, ”CHER ”, ”FR ”, ”CHEL ”

   Figure 8 also features a table showcasing the sections of the face categorized as similar
(orange) and dissimilar (purple), along with their respective contribution values to the final
similarity map. We will focus on analyzing the mean map SAVG , which utilizes the same color
scale. Regarding the nature of masking applied during perturbation, we investigated how it
impacted the algorithm’s output. In figure 9, we present two distinct case studies for both
models. The examined masking types encompass black masking, random noise masking, and
white masking. Upon observing the images, it becomes evident that, in general, there is minimal
sensitivity to the type of masking, especially between black and white masking. The most
notable deviation is associated with random noise masking, although this divergence remains
relatively modest. The maps reported in this study exclusively employ black masking. Figure


Figure 9: The SAVG map for pair examples from the CelebA dataset, generated using different patch
coloring for the models VGGface2 and CasiaNet Orange hues denote similar facial regions, while purple
highlights dissimilar ones

10 presents several instances of the algorithm’s output for both tested models. Specifically,
sections (a) and (c) demonstrate examples where facial comparisons are made between samples
of the same individual, while sections (b) and (d) involve comparisons with imposters. Even
when comparing faces of the same individual, certain areas are assessed as dissimilar, while
conversely, when confronting imposters, not all areas are consistently regarded as dissimilar.
The final score can offer additional insights by contextualizing which facial regions can be
modified to influence the outcome.


Figure 10: The SAVG map for pair examples. In sections (a) and (c), the similarity maps are generated
for genuine cases, while in sections (b) and (d) for impostor ones. The examples are generated from
VGGface2 (a,d) and Casia Net (b,c).
4.1. Experiments with Cut-and-Paste Patches
We conducted a “Cut-and-Paste Patches” test to validate this outcome, as previously introduced
by [16]. This experiment assesses whether replacing specific facial regions in one image with
a corresponding region from another is effectively detected by our algorithm and described
with high similarity in the similarity maps. We present the results in Figure 11. Specifically,
in column (a), we display the average similarity map of the two original images. In column
(b), one of the two images has been altered with a patch from the other (highlighted in green-
yellow). Finally, in column (c), we present the resultant output. Overall, we observe that regions
previously deemed dissimilar are now perceived as similar in the modified area, accompanied
by an increase in the final score. Additionally, we notice instances where semantic areas change


Figure 11: Cut-and-Paste patches test inspired by [38] for two samples output. In (a) the originals
SAVG , in (b) the pairs after applying the Cut-and-Paste Patches test, and in (c) the new average similarity
maps.


in their contribution despite not being included in the modification patch. The explanation for
this can be that the patches do not fit the exact dimensions as the semantic areas, and in some
cases, a rectangular patch, mainly centered on one point, may intersect multiple subsequently
affected semantic areas. This observation underscores the sensitivity of the proposed method,
particularly the segmentation carried out by Mediapipe, to facial regions. It is also noteworthy
that certain areas change in color even when they have not been directly modified – for instance,
the right eye in Case I, the left cheek in Case II, the left eye in Case III, and in Case IV, the
patch is not entirely recognized as similar. This discrepancy can be attributed to the fact that
while the test follows a part-based approach, network models tend to perceive faces holistically,
implying that altering a specific patch may lead to a change in perception of the entire face and
not just the modified area. This explanation aligns with the study of Jacob et al.[38], which
demonstrated that models trained on various datasets with the Thatcher effect [39] internalize a
holistic perception of faces. Moreover, it is essential to note that the maps under consideration
focus solely on the most influential areas, albeit their influence on the final score is limited.

4.2. Method limitations
While Mediapipe offers valuable tools for semantically segmenting facial features, it displays
a notable sensitivity to variations in facial orientation. Substantial deviations in facial pose
result in increasingly dissimilar masks, leading to proportionally divergent contributions. When
the masks exhibit high similarity, the simultaneous occlusion method gains coherence as it
conceals identical portions of the image. Another limitation arises when comparing a profiled
face with a frontal one. In such instances, Mediapipe can still identify facial features; however,
the application of occlusion to both profiles loses its contextual relevance, rendering the affected
areas visually less comprehensible. Consequently, the most suitable application of the method
pertains to front-facing subjects with poses as closely aligned as possible.


5. Conclusion and Future directions
In this paper, we have initiated an effort to bridge the gap between computer and human vision,
with the primary goal of improving the interpretability of facial verification algorithms. We
sought to gain insight into how machines perceive the semantic aspects of human faces during
verification, ultimately aligning the system’s output score more closely with human reasoning.
   We employed the Mediapipe tool to identify distinct semantic regions on the human face to
achieve this. These regions, representing human-conceptual knowledge, provided a compre-
hensive view of the critical concepts for our models. Leveraging this knowledge, we selected a
subset of the most significant semantic areas for the models. We also introduced a perturba-
tion algorithm that generated similarity maps, revealing how the models under examination
perceived these concepts as either similar or dissimilar.
   By contextualizing the system’s output score, we can align it more closely with human
reasoning. However, it is essential to note that our work is currently limited to experimentation.
As a result, future research directions could include exploring different segmentation methods,
conducting experiments across diverse models, comparing various methods to ours, or adapting
them to our approach. Additionally, including a user evaluation component could further
validate and enhance the effectiveness of our work.


References
 [1] G. Alfarsi, J. Jabbar, R. M. Tawafak, A. Alsidiri, M. Alsinani, Techniques for face verification:
     Literature review, in: 2019 International Arab Conference on Information Technology
     (ACIT), IEEE, 2019, pp. 107–112.
 [2] J. Lynch, Face off: Law enforcement use of face recognition technology, Available at SSRN
     3909038 (2020).
 [3] J. S. del Rio, D. Moctezuma, C. Conde, I. M. de Diego, E. Cabello, Automated border control
     e-gates and facial recognition systems, computers & security 62 (2016) 49–72.
 [4] D. J. Robertson, R. S. Kramer, A. M. Burton, Face averages enhance user recognition for
     smartphone security, PloS one 10 (2015) e0119460.
 [5] D. Zhang, N. Maslej, E. Brynjolfsson, J. Etchemendy, T. Lyons, J. Manyika, H. Ngo, J. C.
     Niebles, M. Sellitto, E. Sakhaee, et al., The ai index 2022 annual report. ai index steering
     committee, Stanford Institute for Human-Centered AI, Stanford University (2022) 123.
 [6] A. Olteanu, J. Garcia-Gathright, M. de Rijke, M. D. Ekstrand, A. Roegiest, A. Lipani,
     A. Beutel, A. Olteanu, A. Lucic, A.-A. Stoica, et al., Facts-ir: fairness, accountability,
     confidentiality, transparency, and safety in information retrieval, in: ACM SIGIR Forum,
     volume 53, ACM New York, NY, USA, 2021, pp. 20–43.
 [7] C. Davide, Can we open the black box of ai, Nature News 538 (2016) 20.
 [8] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
     methods for explaining black box models, ACM computing surveys (CSUR) 51 (2018) 1–42.
 [9] S. S. Kim, E. A. Watkins, O. Russakovsky, R. Fong, A. Monroy-Hernández, ” help me
     help the ai”: Understanding how explainability can support human-ai interaction, in:
     Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023,
     pp. 1–17.
[10] P. Bommer, M. Kretschmer, A. Hedström, D. Bareeva, M. M. Höhne, Finding the right XAI
     method - A guide for the evaluation and ranking of explainable AI methods in climate
     science, CoRR abs/2303.00652 (2023).
[11] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for
     discriminative localization, in: Proceedings of the IEEE conference on computer vision
     and pattern recognition, 2016, pp. 2921–2929.
[12] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual
     explanations from deep networks via gradient-based localization, in: Proceedings of the
     IEEE international conference on computer vision, 2017, pp. 618–626.
[13] V. Petsiuk, A. Das, K. Saenko, Rise: Randomized input sampling for explanation of
     black-box models, arXiv preprint arXiv:1806.07421 (2018).
[14] B. Yin, L. Tran, H. Li, X. Shen, X. Liu, Towards interpretable face recognition, in: Proceed-
     ings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9348–9357.
[15] D. Mery, B. Morris, On black-box explanation for face verification, in: Proceedings of the
     IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3418–3427.
[16] M. Knoche, T. Teepe, S. Hörmann, G. Rigoll, Explainable model-agnostic similarity and
     confidence in face verification, in: Proceedings of the IEEE/CVF Winter Conference on
     Applications of Computer Vision, 2023, pp. 711–718.
[17] M. T. Ribeiro, S. Singh, C. Guestrin, ”Why should I trust you?” explaining the predictions
     of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
     knowledge discovery and data mining, 2016, pp. 1135–1144.
[18] M. Ribera, À. Lapedriza, Can we do better explanations? a proposal of user-centered
     explainable ai, in: IUI Workshops, 2019. URL: https://api.semanticscholar.org/CorpusID:
     84832474.
[19] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, F. B. Viégas, R. Sayres, Interpretability
     beyond feature attribution: Quantitative testing with Concept Activation Vectors (TCAV),
     in: 35th International Conference on Machine Learning (ICML), 2018, pp. 2668–2677.
[20] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, S.-C. Zhu, Interpreting cnn knowledge via an explanatory
     graph, in: 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 4454–4463.
[21] A. Ghorbani, J. Wexler, J. Z. Y, B. Kim, Towards automatic concept-based explanations, in:
     Advances in Neural Information Processing Systems, volume 32, 2019.
[22] R. Tan, L. Gao, N. Khan, L. Guan, Interpretable artificial intelligence through locality
     guided neural networks, Neural Networks 155 (2022) 58–73.
[23] W. Zhang, B. Y. Lim, Towards relatable explainable ai with the perceptual process, in:
     Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022.
[24] Perceptual processing / edited by Edward C. Carterette and Morton P. Friedman., Handbook
     of perception ; v. 9, Academic Press, New York, 1978.
[25] M. L. Matthews, Discrimination of identikit constructions of faces: Evidence for a dual
     processing strategy, Perception & Psychophysics 23 (1978) 153–161.
[26] G. Davies, H. Ellis, J. Shepherd, Cue saliency in faces as assessed by the ‘photofit’technique,
     Perception 6 (1977) 263–269.
[27] A. Iskra, H. Gabrijelčič Tomc, Eye-tracking analysis of face observing and face recognition,
     Journal of Graphic Engineering and Design 7 (2016) 5–11.
[28] G. Rhodes, Configural coding, expertise, and the right hemisphere advantage for face
     recognition, Brain and cognition 22 (1993) 19–41.
[29] P. Karczmarek, W. Pedrycz, A. Kiersztyn, P. Rutka, A study in facial features saliency
     in face recognition: An analytic hierarchy process approach, Soft Comput. 21 (2017)
     7503–7517. doi:10.1007/s00500- 016- 2305- 9 .
[30] X. Li, Z. Yang, H. Wu, Face detection based on receptive field enhanced multi-task cascaded
     convolutional neural networks, IEEE Access 8 (2020) 174922–174930.
[31] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Pro-
     ceedings of the 31st International Conference on Neural Information Processing Systems
     (NeurIPS), 2017, p. 4768–4777.
[32] J. Castro, D. Gómez, J. Tejada, Polynomial calculation of the shapley value based on
     sampling, Computers & Operations Research 36 (2009) 1726–1730.
[33] D. Yi, Z. Lei, S. Liao, S. Z. Li, Learning face representation from scratch, arXiv (2014).
[34] F. V. Massoli, G. Amato, F. Falchi, Cross-resolution learning for face recognition, Image
     and Vision Computing 99 (2020) 103927.
[35] H. Zhang, W. Chen, J. Tian, Y. Wang, Y. Jin, Show, attend and translate: Unpaired multi-
     domain image-to-image translation with visual attention (2018) 1–11.
[36] J. de Borda, Mémoire sur les élections au scrutin, Histoire de L’Académie Royale des
     Sciences 102 (1781) 657–665.
[37] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition
     and clustering, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2015, pp. 815–823.
[38] G. Jacob, P. Rt, H. Katti, S. Arun, Qualitative similarities and differences in visual object
     representations between brains and deep networks, Nature Communications 12 (2021).
     doi:10.1038/s41467- 021- 22078- 3 .
[39] P. Thompson, Margaret thatcher: A new illusion, Perception 9 (1980) 483–484.