<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging Human Concepts and Computer Vision for Explainable Face Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miriam Doh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caroline Mazini Rodrigues</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Boutry</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Najman</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matei Mancas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hugues Bersini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIDIA lab - Université Libre de Bruxelles (ULB)</institution>
          ,
          <addr-line>Av. Adolphe Buyl 87,1060, Ixelles</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ISIA lab - Université de Mons (UMONS)</institution>
          ,
          <addr-line>Bd Dolez 31, 7000, Mons</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIGM - Laboratoire d'Informatique Gaspard-Monge, Université Gustave-Eifel</institution>
          ,
          <addr-line>77454, Marne-la-Vallée</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>LRE - Laboratoire de Recherche de l'EPITA</institution>
          ,
          <addr-line>14-16 Rue Voltaire, 94270, Kremlin-Bicêtre</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <issue>2010235</issue>
      <abstract>
        <p>With Artificial Intelligence (AI) influencing the decision-making process of sensitive applications such as Face Verification , it is fundamental to ensure the transparency, fairness, and accountability of decisions. Although Explainable Artificial Intelligence (XAI) techniques exist to clarify AI decisions, it is equally important to provide interpretability of these decisions to humans. In this paper, we present an approach to combine computer and human vision to increase the explanation's interpretability of a face verification algorithm. In particular, we are inspired by the human perceptual process to understand how machines perceive face's human-semantic areas during face comparison tasks. We use Mediapipe, which provides a segmentation technique that identifies distinct human-semantic facial regions, enabling the machine's perception analysis. Additionally, we adapted two model-agnostic algorithms to provide human-interpretable insights into the decision-making processes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Face verification</kwd>
        <kwd>Explainable AI (XAI)</kwd>
        <kwd>Interpretability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Face verification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims to confirm an individual’s identity based on facial features, with
applications in law enforcement [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], border control [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or smartphone security [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As AI becomes
prevalent in decision-making [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], ensuring model fairness, accountability, confidentiality, and
transparency is crucial [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, complex ML models are often seen as ’black boxes’ [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Explainable AI (XAI) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] addresses this challenge by enhancing AI interpretability to make AI
systems transparent and understandable to humans, thereby increasing trust in their decisions.
      </p>
      <p>
        Saliency maps have become the most popular XAI solution in computer vision, ofering
insights into the critical features considered in the decision-making process. However, in face
verification, decisions often rely on adjustable thresholds based on the specific application rather
than understandable semantic classes. This raises questions about the adequacy of identifying
the most important features in an image as the only explanation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Taking inspiration from
the human perceptual process, we propose a model-agnostic approach capable of determining
how the machine perceives similar semantic areas of the face when comparing two faces. Our
primary objective is to translate the XAI solution into human decision-making meaningfully.
However, incorporating human-based semantics in the models’ explanation process can also
introduce human bias to these same explanations. To increase human interpretability, we must
also assure the Faithfulness of explanations to the model’s reasoning. Faithfulness refers to
whether a feature, considered important for the model, changes the model’s decision [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>For face verification, the model extracts features for each face that will be compared.
Modifications in the features will also impact how similar are the two faces. Therefore, it is essential
to understand how face parts, such as an eye, would impact the final features.</p>
      <p>To translate the model’s knowledge to human knowledge as smoothly as possible, we first
perform the segmentation of face parts based on human semantics. By considering the impact
of those face parts on a set of face images, we can have a global view of the model’s knowledge.
Following the features’ extraction (through the model), we verify if two people are the same by
comparing their facial features. To understand the contribution of the chosen concepts to the
relation between two compared faces, we introduce an algorithm grounded in the perturbation
of facial regions linked to the extracted concepts, mirroring the human perceptual process of
face recognition. It encompasses evaluating corresponding semantic areas along a spectrum of
similarities, providing interpretation and contextualisation.</p>
      <p>We structured this paper as follows: in Section 2, we present state-of-the-art methods for
explaining the face verification task; in Section 3, we describe our framework, including the
model concept’s extraction and the perturbation methods for face comparison; in Section 4 we
include the experimental results and limitations; in Section 5 we conclude the work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Saliency maps, such as CAMs [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] and RISE [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], are crucial for interpreting deep-learning
models, revealing their inner workings. However, their primary development has centered on
object recognition, leaving the field of face analysis relatively unexplored.
      </p>
      <p>
        Despite its critical applications, research in face analysis has been limited. Works by
[
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
        ] mainly focus on individual pixel or low-level feature significance, which can be
challenging for human analysts and may not align with intuition. Conversely, LIME [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] employs
superpixels within the image, providing a user-friendly, concept-driven explanation. However,
this technique relies on a new model approximating the original, potentially obscuring the
actual reasons for the original model’s behavior [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Alternative approaches, such as TCAV [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and knowledge graphss [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], prioritize low-level
importance from pixels and aim to represent the model’s knowledge through concepts. TCAV
employs semantic concepts defined by users or discovered through image segment activations
(with method ACE [21]), while knowledge graphs identify repeating patterns across network
layers. Additionally, Tan et al. [22] introduced the Locality Guided Neural Network (LGNN),
designed to induce filter topology that enhances the visualization of concepts.
      </p>
      <p>Inspired by these methods, our approach combines human and model perspectives to identify
essential concepts for face verification. We acknowledge that relying solely on human concepts
can introduce bias while relying solely on the model can complicate interpretation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>inspired by how humans process stimuli (select, organize, interpret and compare)</p>
      <p>To help humans understand how AI systems make decisions, it is essential to present the
information in a way that aligns with human cognitive processes. Cognitive psychology provides
valuable insights into how people perceive and process information when identifying faces.
Taking inspiration from the flowchart proposed by [ 23], we aim to apply a similar method to
face verification (see Figure 1). The human perceptual process consists of three key phases:
selection, organization, and interpretation [24]. Cognitive psychology has shown that when
recognizing faces, our attention is particularly drawn to particular facial areas, such as the
eyes and nose [25, 26, 27]. Subsequently, in the perceptual process, these facial stimuli are
organized into meaningful concepts, adding semantics to the process. Our brains compare these
higher-level concepts to assess the similarity between items, facilitating face categorization.
This comparative analysis may involve matching a face to a remembered image or with another
face in front of us. In this context, we question the adequacy of salience maps used in computer
vision as an explanation and their alignment with our human reasoning processes.</p>
      <p>Based on cognitive psychology, we have developed a general flowchart shown in Figure 2.
Generally, face verification systems rely primarily on a matching score between two face images
 and  . This score,   , is computed using cosine similarity, which compares the feature vectors
fA and fB extracted from each image as follows:

 =</p>
      <p>fA ⋅ fB
||fA|| ||fB||
(1)</p>
      <p>
        The resulting score ranges from 0 to 1, with a score of 1 indicating identical images ( =  ).
As our approach is model-agnostic, we aim to explain the algorithm by perturbing the inputs to
study the system’s decision behaviour concerning the input-output relationship. Inspired by the
work of [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], our desired output is a similarity map indicating which face areas are considered
similar or dissimilar for both images, using an AI model as a feature extractor. To achieve this,
we perform semantic perturbation on images  and  , resulting in new images denoted as  ()
and  () where the  section is removed in both images. We obtain a new   (()) score from these
areas, such as the eyes, as similar or dissimilar. Considering Δ the diference between original
and new scores represented by Equation 2, if the   (()) decreases compared to   , it suggests
that the removed parts positively contribute to the similarity (Δ ≥ 0). Conversely, its increase
indicates a negative contribution (Δ &lt; 0).
      </p>
      <p>Δ =  

−   (())
(2)</p>
      <p>
        Compared to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], our objective is to incorporate semantic masking in the perturbation
process to increase interpretability by providing not only a map but also a chart related to
the semantic areas. We apply two types of perturbation algorithms inspired by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], allowing
us to study the face section area’s single or collaborative contributions and then incorporate
this information within an average similarity map. This single/collaborative approach aligns
with the notion that humans perceive and interpret faces in a relational/configurational way
[28] (see Figure 3). First-order features concern individual components that can be processed
independently (e.g., eyes, nose). Second-order features involve information acquired when
simultaneously processing two or more parts together (e.g., spacing between eyes). Furthermore,
higher-order features emerge from combinations of multiple first-order and/or second-order
features. In our case, the single removal procedure models the information associated with
ifrst-order or single features, and the greedy removal procedure addresses the second-order
features, wherein multiple parts are processed collectively.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Semantic Extraction</title>
        <p>To incorporate semantics, we employ Mediapipe Face masks, a versatile open-source framework
by Google, widely recognized for its face detection and landmark estimation capabilities. By
extracting landmarks from Mediapipe, we defined 13 polygons corresponding to distinct semantic
areas of the face (see Figure 4.a). The landmark estimation provided by Mediapipe is limited
to specific facial regions, and hair or ears were not included in the earlier facial subdivisions.
Nevertheless, this decision is consistent with previous research [29], which demonstrated that
some areas of the face are more influential than others. For example, removing the ears has
less impact on the final score than the eye area. Hence, we assumed these areas were not
primarily influential and did not include them in our face classes. Additionally, face verification
algorithms typically apply a preprocessing step for extracting the face area. Therefore, we
reduce the area outside the face by applying MCTNN [30], a deep learning-based face detection
algorithm. Overall, our subdivision of the face detected 13 distinct semantic classes, including
the background (figure 4.b). With this approach, the semantic areas vary in size, resulting in
larger maps having a more significant influence on the score than smaller ones. To mitigate
this undesired efect, we introduce a weight, denoted as  , related to the section  ∈ [1, ]
with  = 13 . The  , is defined as the rapport between the total area of the image  (Area )
and the area of the mask  (,) , Area (,) , indicating region  (white pixels in the mask). This
weight serves to counterbalance the diferences in magnitude. Moreover, due to the precise face
positioning achieved by Mediapipe, the masks obtained on images  and  may only partially
match. This discrepancy arises because the depicted faces may not have the same position
and expressiveness. For this reason, we define two weight  , =  (,  (,) ) = ArAearea(,)
associated to  and  ,
=  (,</p>
        <p>Area
(,) ) = Area (,)</p>
        <p>to  .
 ̂(,)  =
 ,</p>
        <p>. ,

∑=1  ,</p>
        <p>. ,
  = Δ ⋅  ̂(,) 
(3)
(4)
representing the relative weights associated with  (,)
and  (,)
masks, respectively.</p>
        <p>In this manner, the contribution of the mask, defined as   , is weighted by  ,
and  , ,</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Concepts Extraction</title>
          <p>
            Using Mediapipe for face part extraction provides a human-based semantic segmentation, yet
it may not align with how models perceive faces. To bridge this gap we introduce a model’s
concept extraction process. This involves filtering machine-important parts based on human
semantics. For evaluating the importance of facial parts, we employ KernelSHAP [31], which
combines LIME [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]’s interpretable components with Shapley values [32] from game theory
which look for each feature contribution to the final result. We extract model importance scores
for each of semantic parts. In face final representations with 512 features, for example, we will
have 512 importance scores per human-semantics part. In the process of face verification, every
feature change, negative or positive, is significant to determine faces’ similarity, with emphasis
on the magnitude of the change, instead of on the signal. If one feature of a human-semantic
part obtained a negative Shap value, the lack of this part reduced the feature value, and vice
versa. Therefore, negative and positive Shap values are equally important in our context. For
this reason, sum the absolute Shap values throughout all the representation features to obtain a
single importance value per part.
          </p>
          <p>(a)
(c)
(b)
(d)
images, especially for VGGfaces2. That is why we aggregate ranked importance over 200 images.</p>
          <p>Ultimately, we will have one importance value per semantic part (see Figure 5). However,
this remains a local importance, i.e., an importance score according to a single image dataset.
To increase globalism in the concepts’ extraction, we need to include information from multiple
images. Our solution is to combine the importance levels from a set of images by a ranking
combination strategy. Each image obtains 13 importance scores (one per human-semantics
part) that we can order. More significant scores are at the top of a ranking, as they were
considered more important for the model. We use 200 images from CelebA [35] dataset to obtain
200 rankings. From these rankings’ combinations, by a vote-based technique using BORDA
count [36], we obtain a final ranking with more globally important concepts at the top.</p>
          <p>The experiments will focus on the model’s top eight concepts determined by this process.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Proposed similarity maps</title>
          <p>
            The algorithm used to generate the similarity maps draws inspiration from the work of [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ],
where six algorithms were presented to create saliency maps. Specifically, we will employ the
single removal approach (S0) and the greedy removal approach (S1), with the possibility of
creating an average map of these two approaches (SAVG). Our approach incorporates significant
changes compared to previous research. First and foremost, we utilize semantically meaningful
masks to perturb the images, diverging from conventional circular or square masks with a fixed
shape. Moreover, since our objective is to generate community similarity maps between the
two images, both images undergo perturbation, contrary to previous approaches that typically
perturb only one of the images, thus aligning more closely with the strategy proposed by [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Single Removal - S0</title>
          <p>We define the two perturbed images as the pixel-wise multiplication of the images and the
relative semantic section mask of the same size with values between 0 and 1.
The single removal operation is computed for all the semantic areas. For each mask, the value
of the contribution map H0 is initialized with the   contribution associated with the mask:</p>
          <p>The similarity map is defined as the sum of the negative and the positive contributions
normalized by Equation 7 for all  ∈ [1, ] , to obtain 0  .</p>
          <p>′ =  ⋅ 
 0 (,)
=   ⋅  (,)
(5)
(6)
 0 (±,)
=
⎧ ∑ 0 (,)
⎨
⎩ ∑ 0 (,)
 0
≥0| 0 (,) |
(,)
&lt;0| 0 (,) |
if  0 (,)</p>
          <p>≥ 0
otherwise.
0  = 0  + ( 0 (+,)
+  0 (−,) ) ⋅  (,)</p>
          <p>(7)
+ −
We use the same Equations 6 and 7 to obtain  0 (,) ,  0 (,) ,  0 (,)
and 0  . This means
negative contributions are seen as dissimilar areas in the face image, while positive ones are
similar. The algorithm 1 gives us the similarity maps 0  and 0  as a result of single removal.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Greedy Removal-S1</title>
          <p>similarity between the two images and  is the best part removed (
from images  and  . In particular, the initial images are represented as  0 =  and  0 =  ,
and at each iteration,   and   are obtained by removing the principal parts of  −1 and  −1 ,
respectively. This means that at each iteration, the mask removed will be defined as the actual
section mask sum with the previous best mask removed:
 (  ) =  (  ,) + BestM

and  (  ) =  (  ,) + BestM

(8)
In greedy removal, calculating positive and negative contribution maps follows distinct
pro+ −
cedures. We also use Equations 6 and 7 to obtain  1 (.,) ,  1 (.,) ,  1 (.,) , 1  and 1  . To be
more concise, in algorithm 1, the presentation primarily focuses on calculating the negative
contribution map H1-. Indeed, in each iteration, the   value is set to 1. Consequently, the
removed areas correspond to those exhibiting negative contribution, as the condition  ′ &lt;  
dictates. Conversely, H1+ is computed, setting   value to 0 at each iteration with the condition
 ′ &gt;   . In our example, the iteration stops when the maximum number of iterations is reached
or when the score diference reaches a low enough point. In this case, that occurs at t = 7,
+ and H1-, the similarity map S1 is
where the score diference is only 0.009. After obtaining H1
obtained following the equation 7.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.5. Average Similarity map SAVG</title>
        <p>In subsection 3.1.3 and 3.1.4, the processes for determining similarity maps are outlined. Using
single and greedy removal techniques makes it possible to assess the significance of each facial
1: Input:
2:  -Face image A
3:  -Face image B</p>
        <p>-Initial Score
4:  
6:  
5:  -Minimal increment allowed</p>
        <p>-Max number of iteration
7: N, M ← size( )
8:  0 ← 
9:  0 ←</p>
        <p>← zeros(N, M)
10: H0
12: H1-
11: H0 ← zeros(N, M)
13: H1- ← zeros(N, M)</p>
        <p>← zeros(N, M)
14: BestM</p>
        <p>15: BestM
16:  = 0
17:  −1 ←  
18: Δ</p>
        <p>← 1
19: while  &lt;  

← zeros(N, M)
 ← zeros(N, M)
▷ height and width of face images
▷ initialization of the image
▷ initialization of the image
▷ initialization of the maps
▷ initialization of the maps
▷ initialization of the maps
▷ initialization of the maps
▷ initialization of mask A
▷ initialization of mask B
▷ initialization of iteration counter</p>
        <p>▷ initial matching score
▷ initialization of diference of scores
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:</p>
        <p>and Δ &gt;  do
 ←  + 1
for  in FaceSections do
 (  ) ←  (  ,) + BestM</p>
        <p>(  ) ←  (  ,) + BestM</p>
        <p>′=   − 1 ⋅  (  )
 ′ =   −′1 ⋅  (  )

′ ← 
if  ′ &lt;   then


′</p>
        <p>′
  = 
BestM</p>
        <p>BestM
Best</p>
        <p>←  (  )
 ←  (  )
← W( ,BestM )</p>
        <p>t ←  ′
Bet s←t   ←′ W( ,BestM )</p>
        <p>Δ

 =  −1 −  
if  = 0 then
for  in FaceSections do
  = Δ  ⋅  ̂(,) 
 0  [ (,)
 0  [ (,)</p>
        <p>⋅  ̂(,)
 Best ← Δ
H1- [BestM</p>
        <p>H1- [BestM =1] ←  Best</p>
        <p>Best
=1] ←  Best
= 1] ←  
= 1] ←  
feature individually or in conjunction with others. Considering this, analyzing an average map
can provide valuable comprehension of the significance of each facial feature. Incorporating
this information within an average similarity map of maps 0 (Single removal) and 1 (Greedy
removal) which is called    aligns with the notion that humans perceive and interpret faces
in a relational/configurational manner [ 28] (see Figure 3).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>This section shows the experimental results for a selected number of samples extracted by
the CelebA dataset [35] and tested for the FaceNet [37] model trained on Casia-WebFace [33]
and VGGfaces2 [34]. In Figure 8, we show the output generated by the proposed method. It
comprises three maps: the initial single removal map S0, the greedy removal map S1, and the
ultimate average map SAVG. The visualization uses orange to represent semantic areas that
are similar, while purple indicates diferences in facial features. After the concept extraction, a
group of semantic areas is selected based on their importance. The table 1 displays the n-selected
semantic areas ranked by their importance. In our study  = 8 , this number can be changed as
needed.</p>
      <p>
        Figure 8 also features a table showcasing the sections of the face categorized as similar
(orange) and dissimilar (purple), along with their respective contribution values to the final
similarity map. We will focus on analyzing the mean map SAVG, which utilizes the same color
scale. Regarding the nature of masking applied during perturbation, we investigated how it
impacted the algorithm’s output. In figure 9, we present two distinct case studies for both
models. The examined masking types encompass black masking, random noise masking, and
white masking. Upon observing the images, it becomes evident that, in general, there is minimal
sensitivity to the type of masking, especially between black and white masking. The most
notable deviation is associated with random noise masking, although this divergence remains
relatively modest. The maps reported in this study exclusively employ black masking. Figure
10 presents several instances of the algorithm’s output for both tested models. Specifically,
sections (a) and (c) demonstrate examples where facial comparisons are made between samples
of the same individual, while sections (b) and (d) involve comparisons with imposters. Even
when comparing faces of the same individual, certain areas are assessed as dissimilar, while
conversely, when confronting imposters, not all areas are consistently regarded as dissimilar.
The final score can ofer additional insights by contextualizing which facial regions can be
modified to influence the outcome.
4.1. Experiments with Cut-and-Paste Patches
We conducted a “Cut-and-Paste Patches” test to validate this outcome, as previously introduced
by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This experiment assesses whether replacing specific facial regions in one image with
a corresponding region from another is efectively detected by our algorithm and described
with high similarity in the similarity maps. We present the results in Figure 11. Specifically,
in column (a), we display the average similarity map of the two original images. In column
(b), one of the two images has been altered with a patch from the other (highlighted in
greenyellow). Finally, in column (c), we present the resultant output. Overall, we observe that regions
previously deemed dissimilar are now perceived as similar in the modified area, accompanied
by an increase in the final score. Additionally, we notice instances where semantic areas change
in their contribution despite not being included in the modification patch. The explanation for
this can be that the patches do not fit the exact dimensions as the semantic areas, and in some
cases, a rectangular patch, mainly centered on one point, may intersect multiple subsequently
afected semantic areas. This observation underscores the sensitivity of the proposed method,
particularly the segmentation carried out by Mediapipe, to facial regions. It is also noteworthy
that certain areas change in color even when they have not been directly modified – for instance,
the right eye in Case I, the left cheek in Case II, the left eye in Case III, and in Case IV, the
patch is not entirely recognized as similar. This discrepancy can be attributed to the fact that
while the test follows a part-based approach, network models tend to perceive faces holistically,
implying that altering a specific patch may lead to a change in perception of the entire face and
not just the modified area. This explanation aligns with the study of Jacob et al.[ 38], which
demonstrated that models trained on various datasets with the Thatcher efect [ 39] internalize a
holistic perception of faces. Moreover, it is essential to note that the maps under consideration
focus solely on the most influential areas, albeit their influence on the final score is limited.
      </p>
      <sec id="sec-4-1">
        <title>4.2. Method limitations</title>
        <p>While Mediapipe ofers valuable tools for semantically segmenting facial features, it displays
a notable sensitivity to variations in facial orientation. Substantial deviations in facial pose
result in increasingly dissimilar masks, leading to proportionally divergent contributions. When
the masks exhibit high similarity, the simultaneous occlusion method gains coherence as it
conceals identical portions of the image. Another limitation arises when comparing a profiled
face with a frontal one. In such instances, Mediapipe can still identify facial features; however,
the application of occlusion to both profiles loses its contextual relevance, rendering the afected
areas visually less comprehensible. Consequently, the most suitable application of the method
pertains to front-facing subjects with poses as closely aligned as possible.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future directions</title>
      <p>In this paper, we have initiated an efort to bridge the gap between computer and human vision,
with the primary goal of improving the interpretability of facial verification algorithms. We
sought to gain insight into how machines perceive the semantic aspects of human faces during
verification, ultimately aligning the system’s output score more closely with human reasoning.</p>
      <p>We employed the Mediapipe tool to identify distinct semantic regions on the human face to
achieve this. These regions, representing human-conceptual knowledge, provided a
comprehensive view of the critical concepts for our models. Leveraging this knowledge, we selected a
subset of the most significant semantic areas for the models. We also introduced a
perturbation algorithm that generated similarity maps, revealing how the models under examination
perceived these concepts as either similar or dissimilar.</p>
      <p>By contextualizing the system’s output score, we can align it more closely with human
reasoning. However, it is essential to note that our work is currently limited to experimentation.
As a result, future research directions could include exploring diferent segmentation methods,
conducting experiments across diverse models, comparing various methods to ours, or adapting
them to our approach. Additionally, including a user evaluation component could further
validate and enhance the efectiveness of our work.
graph, in: 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 4454–4463.
[21] A. Ghorbani, J. Wexler, J. Z. Y, B. Kim, Towards automatic concept-based explanations, in:</p>
      <p>Advances in Neural Information Processing Systems, volume 32, 2019.
[22] R. Tan, L. Gao, N. Khan, L. Guan, Interpretable artificial intelligence through locality
guided neural networks, Neural Networks 155 (2022) 58–73.
[23] W. Zhang, B. Y. Lim, Towards relatable explainable ai with the perceptual process, in:</p>
      <p>Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022.
[24] Perceptual processing / edited by Edward C. Carterette and Morton P. Friedman., Handbook
of perception ; v. 9, Academic Press, New York, 1978.
[25] M. L. Matthews, Discrimination of identikit constructions of faces: Evidence for a dual
processing strategy, Perception &amp; Psychophysics 23 (1978) 153–161.
[26] G. Davies, H. Ellis, J. Shepherd, Cue saliency in faces as assessed by the ‘photofit’technique,</p>
      <p>Perception 6 (1977) 263–269.
[27] A. Iskra, H. Gabrijelčič Tomc, Eye-tracking analysis of face observing and face recognition,</p>
      <p>Journal of Graphic Engineering and Design 7 (2016) 5–11.
[28] G. Rhodes, Configural coding, expertise, and the right hemisphere advantage for face
recognition, Brain and cognition 22 (1993) 19–41.
[29] P. Karczmarek, W. Pedrycz, A. Kiersztyn, P. Rutka, A study in facial features saliency
in face recognition: An analytic hierarchy process approach, Soft Comput. 21 (2017)
7503–7517. doi:10.1007/s00500- 016- 2305- 9.
[30] X. Li, Z. Yang, H. Wu, Face detection based on receptive field enhanced multi-task cascaded
convolutional neural networks, IEEE Access 8 (2020) 174922–174930.
[31] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in:
Proceedings of the 31st International Conference on Neural Information Processing Systems
(NeurIPS), 2017, p. 4768–4777.
[32] J. Castro, D. Gómez, J. Tejada, Polynomial calculation of the shapley value based on
sampling, Computers &amp; Operations Research 36 (2009) 1726–1730.
[33] D. Yi, Z. Lei, S. Liao, S. Z. Li, Learning face representation from scratch, arXiv (2014).
[34] F. V. Massoli, G. Amato, F. Falchi, Cross-resolution learning for face recognition, Image
and Vision Computing 99 (2020) 103927.
[35] H. Zhang, W. Chen, J. Tian, Y. Wang, Y. Jin, Show, attend and translate: Unpaired
multidomain image-to-image translation with visual attention (2018) 1–11.
[36] J. de Borda, Mémoire sur les élections au scrutin, Histoire de L’Académie Royale des</p>
      <p>Sciences 102 (1781) 657–665.
[37] F. Schrof, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition
and clustering, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 815–823.
[38] G. Jacob, P. Rt, H. Katti, S. Arun, Qualitative similarities and diferences in visual object
representations between brains and deep networks, Nature Communications 12 (2021).
doi:10.1038/s41467- 021- 22078- 3.
[39] P. Thompson, Margaret thatcher: A new illusion, Perception 9 (1980) 483–484.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Alfarsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jabbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Tawafak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alsidiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alsinani</surname>
          </string-name>
          ,
          <article-title>Techniques for face verification: Literature review</article-title>
          , in: 2019
          <source>International Arab Conference on Information Technology (ACIT)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <article-title>Face of: Law enforcement use of face recognition technology</article-title>
          ,
          <source>Available at SSRN</source>
          <volume>3909038</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>del Rio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moctezuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Conde</surname>
          </string-name>
          , I. M. de Diego, E. Cabello,
          <article-title>Automated border control e-gates and facial recognition systems</article-title>
          ,
          <source>computers &amp; security 62</source>
          (
          <year>2016</year>
          )
          <fpage>49</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Kramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <article-title>Face averages enhance user recognition for smartphone security</article-title>
          ,
          <source>PloS one 10</source>
          (
          <year>2015</year>
          )
          <article-title>e0119460</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Maslej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brynjolfsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Etchemendy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lyons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Manyika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sellitto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sakhaee</surname>
          </string-name>
          , et al.,
          <article-title>The ai index 2022 annual report</article-title>
          .
          <source>ai index steering committee</source>
          , Stanford Institute for
          <string-name>
            <surname>Human-Centered</surname>
            <given-names>AI</given-names>
          </string-name>
          , Stanford University (
          <year>2022</year>
          )
          <fpage>123</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Garcia-Gathright</surname>
          </string-name>
          , M. de Rijke,
          <string-name>
            <surname>M. D. Ekstrand</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roegiest</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beutel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Olteanu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lucic</surname>
            ,
            <given-names>A.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
          </string-name>
          , et al.,
          <article-title>Facts-ir: fairness, accountability, confidentiality, transparency, and safety in information retrieval</article-title>
          ,
          <source>in: ACM SIGIR Forum</source>
          , volume
          <volume>53</volume>
          , ACM New York, NY, USA,
          <year>2021</year>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Davide</surname>
          </string-name>
          ,
          <article-title>Can we open the black box of ai</article-title>
          ,
          <source>Nature News</source>
          <volume>538</volume>
          (
          <year>2016</year>
          )
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monroy-Hernández</surname>
          </string-name>
          ,
          <article-title>” help me help the ai”: Understanding how explainability can support human-ai interaction</article-title>
          ,
          <source>in: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kretschmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hedström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bareeva</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>M. Höhne, Finding the right XAI method - A guide for the evaluation and ranking of explainable AI methods in climate science</article-title>
          ,
          <source>CoRR abs/2303</source>
          .00652 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lapedriza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <article-title>Learning deep features for discriminative localization</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2921</fpage>
          -
          <lpage>2929</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-cam:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Petsiuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
          </string-name>
          ,
          <article-title>Rise: Randomized input sampling for explanation of black-box models</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>07421</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Towards interpretable face recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>9348</fpage>
          -
          <lpage>9357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <article-title>On black-box explanation for face verification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3418</fpage>
          -
          <lpage>3427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Knoche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Teepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hörmann</surname>
          </string-name>
          , G. Rigoll,
          <article-title>Explainable model-agnostic similarity and confidence in face verification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>711</fpage>
          -
          <lpage>718</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , ”
          <article-title>Why should I trust you?” explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ribera</surname>
          </string-name>
          , À. Lapedriza,
          <article-title>Can we do better explanations? a proposal of user-centered explainable ai</article-title>
          ,
          <source>in: IUI Workshops</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar.org/CorpusID: 84832474.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wexler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Viégas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sayres</surname>
          </string-name>
          ,
          <article-title>Interpretability beyond feature attribution: Quantitative testing with Concept Activation Vectors (TCAV)</article-title>
          ,
          <source>in: 35th International Conference on Machine Learning (ICML)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2668</fpage>
          -
          <lpage>2677</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Wu</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Interpreting cnn knowledge via an explanatory</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>