<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Advances of Mobile and Wearable Biometric, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Unconstrained Ear Recognition For Privacy-Preserving Authentication</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vishesh Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akshay Agarwal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IISER Bhopal</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>26</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Biometrics recognition-based person authentication is an effective medium to stop illegal access to the system; however, the concern of privacy has raised serious issues about its use. In this research, we have conducted an extensive study for privacy-preserving person authentication using ear modality. For that, we have proposed a novel triplet loss-based convolutional network architecture. To demonstrate its effectiveness, we have compared our proposed algorithm with several convolutional networks including vision transformers (ViT). Interestingly, it is also observed that the existing ear recognition datasets majorly cover non-Indian identities. Therefore, to handle the dearth of Indian ethnicity ear recognition datasets and to advance the research concerning Indians, we have collected the first-ever unconstrained Indian ear recognition dataset namely I-Ear. On the proposed dataset, the performance of the proposed triplet loss network is at least 5% better than baseline networks including ViT.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ear Recognition</kwd>
        <kwd>Indian Ear Dataset</kwd>
        <kwd>Privacy-Preserving Authentication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing demand for secure and automated identity systems has led to extensive research in
the fields of computer vision and intelligent systems. Biometrics, which offer invariance over
time, ease of acquisition, and uniqueness for each individual, have become the preferred choice
for human identification systems. Among various biometric characteristics, the human face has
received significant attention in research and development. However, facial recognition systems
possess several challenges of which privacy is the most critical one. In contrast, the human ear
has emerged as a promising biometric modality with unique advantages. The distinct shape and
features of the ear, such as the contours, ridges, and earlobes, along with non-intrusive acquisition
provide valuable information for identification purposes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Further, ear image recognition can
solve the privacy issue to some extent. Ongoing research and development in ear recognition
technology aim to exploit its capabilities and improve its performance in various applications,
which propelled us to research unconstrained ear datasets [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>In this research, we have utilized ear datasets that comprise both controlled and uncontrolled
environments, including ear images acquired from the web. However, the datasets contain a
limited number of samples per subject, making it impractical to build a deep-learning model from
scratch. To address this limitation, we have employed transfer learning techniques to achieve
person recognition using ear images. In our study, we initially utilized the feature extraction
capability of deep networks directly for ear recognition. We employed a non-trainable method
that involved extracting the features from the deep networks and then computing dissimilarity
metrics to compare and match ear images. This approach allowed us to leverage the discriminative
power of the deep networks’ learned representations. To improve the recognition rate further,
we performed a score fusion of the best-obtained models. Score fusion involves combining
the dissimilarity scores from multiple models to generate a final decision score. By fusing the
scores, we aimed to capture the complementary information and enhance the overall recognition
accuracy.</p>
      <p>Later, we proposed a triplet loss model for ear recognition utilizing the best-performing
pretrained model. The triplet loss model aims to learn a discriminative embedding space where the
distance between ear images of the same identity is minimized and the distance between dissimilar
identities’ ear images is maximized. On top of that, we also performed decision fusion using the
score obtained from the multiple classifiers for the triplet loss model and proposed a robust ear
biometric recognition technique. Overall, this research exhibits the following contributions:
• We propose an ear recognition network based on the amalgamation of multiple pre-trained
deep networks and a proposed triplet loss model. The pre-trained model brings the
computational advantage while performing recognition; whereas, the triplet model accurately
captures the ear-specific features using an effective base deep neural network;
• Extensive experiments are conducted to evaluate and compare different learning
methodologies in our study. The objective also aims to thoroughly analyze the performance of
multiple state-of-the-art deep networks, providing a comprehensive understanding of their
effectiveness for ear recognition;
• We also present a novel ear recognition dataset namely ‘I-Ear’, which will be made publicly
available to the research community.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Ear recognition, while gaining attention, is not as widely studied or popular as other biometric
modalities such as face, iris, and fingerprint recognition. Consequently, the availability of
largescale datasets specifically designed for ear recognition is still limited. To address the challenge
posed by the relatively small size of ear recognition datasets, Emersic et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a deep
learning-based averaging system. This approach aims to mitigate the risk of overfitting, which can
arise when training deep learning models on small datasets. Zhang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] focused on addressing
the challenges of ear verification under uncontrolled conditions using convolutional neural
networks (CNNs). Zhang et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focused on addressing the challenges of few-shot learning in
the context of ear recognition. Few-shot learning refers to the ability of a model to recognize new
classes or categories with only a limited number of training examples. Sinha et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed
a framework that localizes the outer ear image using a histogram of oriented gradients (HOG)
and support vector machine (SVM) and then uses CNNs to perform ear recognition. Hansley et
al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a fusion approach that combines both convolutional neural networks (CNNs)
and handcrafted features for ear recognition. They found that this fusion method outperformed
other state-of-the-art CNN-based approaches and concluded that handcrafted features could
complement deep learning methods in the context of ear recognition. The research by Štepec et
al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] contributes to the advancement of ear recognition by introducing a dual-path CNN model
that integrates global and local information and incorporates a constellation-based approach for
accurate ear images. The framework provides insights into effective strategies for encoding and
utilizing different aspects of ear characteristics, leading to improved recognition performance.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Ear Recognition Algorithm</title>
      <p>This paper delves into the utilization of state-of-the-art deep learning architectures for encoding
the distinctive features of ear images. To accomplish this, we conducted a thorough evaluation
of various image classification convolutional neural networks (CNNs) and a vision transformer
(ViT). Based on the effectiveness of an individual classification network, the top-performing
classifiers are then employed for score fusion to enhance the overall recognition accuracy. By
selecting the best models and optimizing their computational requirements, we strive to achieve a
balance between accuracy and efficiency in ear recognition. To achieve this objective, we explore
two distinct recognition algorithms for representation learning, which will be discussed in the
subsequent subsections. We aim to identify the most effective representation method, enabling us
to build ensemble models that enhance overall recognition performance. The Pseudocode of the
proposed ear recognition algorithms is shown in Algorithm 1 and Algorithm 2 and is described
separately in different subsections as follows:
3.1. Algorithm 1: Transfer Learning
1. Split each ear dataset into two sets: () gallery and () probe. Match each probe image
with the gallery images of different identities for person identification. Following the
standard biometrics recognition setting, a single image of each identity is used as a gallery
image;
2. To compute the biometric representation of the subject, passed each gallery image through
 number of CNN and Transformer. We get a  dimensional feature vector of each subject
called a biometric template.
3. Repeat step 2 for computing the biometric representation of probe images at the times of
testing.
4. Compute the Dissimilarity scores between the feature representation obtained using a
particular CNN or, Transformer for a probe image and  gallery images;
5. After comparing the feature representations of different CNNs, we obtain identity
dissimilarity scores using different distance metrics. These scores are then fused, employing the
average fusion technique, to generate a final decision score vector.
6. Once the fusion process is complete, an identity is selected based on the minimum distance
value corresponding to the label assigned to the probe image. If the predicted label matches
the ground truth label, the correct identification score is incremented by one.</p>
      <p>After accumulating the correct identification scores, the correct identification rate (CIR)
Algorithm 1: Proposed Ear Recognition Algorithm based on Dissimilarity Metric
Input: Gallery Images of  different identities,  probe images of  identities,  number
of CNNs and Transformer (ViT);
Using each CNNs and Transformer, compute the Feature Representation of  gallery
images such as  = (), and we get a matrix of size  ×  where  is the length of
the feature vector for each pre-trained CNN and Transformer;
for every  ∈  do</p>
      <p>Compute the features representation  of  using the pre-trained CNNs and</p>
      <p>Transformer;
for each () ∈  do</p>
      <p>Euclidean Dissimilarity;
(, ()) = √︀∑︀( − ())2
OR, Cosine Dissimilarity;</p>
      <p>· ()
(, ()) = 1 −
For each representatio||no||f||CN(N)||and Transformer, the aforementioned computation
is repeated;
Perform the dissimilarity score fusion or, Late fusion;
 = ⨁︀((,  ())),  = 1, 2, 3, ......,</p>
      <p>ID= ();
end
end
can be computed as follows
% =
︃( ∑︀1
==ℎ
 
)︃
× 100
The CIR is determined by dividing the total number of correctly identified samples by
the total number of samples in the dataset. This metric provides insight into the accuracy
of the identification process and serves as a measure of the system’s performance. By
calculating the CIR, we gain valuable information about the system’s effectiveness in
correctly identifying individuals based on their features.
3.2. Algorithm 2: Triplet Ear Recognition
1. Split each ear dataset into two sets: (i) gallery and (ii) probe. We have used two images
from each identity for gallery images and the remaining images i.e., 2, 2, and 13 images
for probe images from I-Ear, AWE, and KinEar datasets, respectively.
2. Triplet Selection:
• For each identity in the dataset, randomly select one image as the anchor.</p>
      <p>• Randomly select another image from the same identity as the positive example.</p>
      <p>• Create anchor-positive pairs by pairing the anchor image with its corresponding
positive image.
• Randomly select an image from a different identity as the negative example. Ensure
the negative image is dissimilar to the anchor image to create a challenging contrast.
• Repeat the dataset’s anchor-positive pair generation and negative example selection
process for multiple identities.
• Store the generated triplets, including the anchor, positive, and negative images, for
training your triplet loss model.
3. Pass selected batch of triplets through each CNNs and Transformer (ViT) and then use the
classifier to predict the probability distribution of each class for each triplet. In intermediate
steps between the feature extractor and classiefir, we aim to minimize the distance between
the anchor and positive image meanwhile maximize the distance between the anchor and
negative images.
4. Then compute the triplet-loss and cross-entropy loss using the triplet function and
crossentropy function defined in the pseudocode.
5. Compute the total loss and update the parameter of the feature extractor (CNNs and</p>
      <p>Transformer).
6. Backward pass and update model and classifier parameters:
• Compute the gradients of the triplet loss and cross-entropy loss concerning the model
and classifier parameters, respectively.
• Use an optimization algorithm (e.g., stochastic gradient descent) to update the model
and classifier parameters based on the computed gradients.</p>
      <p>7. Repeat steps 3-6 for multiple iterations or epochs:</p>
      <p>• Iterate over the training dataset in batches and perform forward and backward passes
to update the model parameters.
8. Monitor the training progress:
• Evaluate the model periodically using a validation dataset to track performance and
detect overfitting.</p>
      <p>• Adjust hyperparameters (e.g., learning rate, margin) if necessary.
9. Once the model is trained, pass the probe images through the model and compute the
probability score.
10. Probability scores of different feature extractors are fused (average fusion) to generate a
ifnal decision score vector.</p>
      <p>Algorithm 2: Proposed Ear Recognition Algorithm based on Triplet Loss</p>
      <p>Input:  Gallery Images of  different identities,  probe images of  identities, 
represent the feature extractor or, CNNs and Transformer (ViT) ,  represent the
classifier;
Divide Gallery images into Anchor (A), Positive (P), and Negative (N) sets of images.</p>
      <p>Such that  ∪  ∪  =  and for each triplet  ∩  ∩  = ;
Compute the feature representation of ,  and  using the feature extractor  as
 (),  ( ) and  ( ) these are matrices of size  × ,  ×  and  ×  respectively.
Where  is the length of the feature vector for each feature extractor;
Use classifier  to predict the probability belonging to each class as  ( ()),  ( ( ))
and  ( ( )) these all are of size 1 × 
for every  ∈  do
for each epoch do</p>
      <p>Triplet Loss =1;
( (),  ( ),  ( )) = {+( (),  ( ))− − ( (),  ( ))+, 0}
Where, +( (),  ( )) = √︀∑︀( () −  ( ))2
− ( (),  ( )) = √︀∑︀( () −  ( ))2
and  is a margin between positive and negative pairs, So;
1 = ∑︀</p>
      <p>=1 ( (),  ( ),  ( )) where  is the number of triplets.</p>
      <p>Cross Entropy Loss = 2;
 =  ( (),  ( ())) +  ( ( ),  ( ( ))) +  ( ( ),  ( ( ))) ,where
 (),  ( ),  ( ) are true probability distribution of , ,  respectively and
 ( ( ),  ( ( ))) = −  ( ) log( ( ( )))
2 = ∑︀</p>
      <p>=1 
Total Loss = ;
 = 1 + 2
Updating the Parameter;
 new =  old − . × 
 new =  old − . ×</p>
      <p>Repeat the above computation for each epoch, . = learning rate ;
end
Compute  new () and  new ( new ()))
Perform the score fusion or, Late fusion;
 () = ⨁︀( new ( new ()));</p>
      <p>ID= { ()};
end
11. An identity is selected corresponding to the maximum probability value to label the probe
image.</p>
      <p>The schematic diagrams of both algorithms are shown in Figure 1 and Figure 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>This section describes the various components associated with the ear recognition experiments
conducted in this research. We begin by providing an overview of the datasets used to perform ear
recognition. Moving forward, we discuss the deep neural networks used in our experiments. These
networks are chosen based on their proven efficacy in various computer vision tasks including
image classification. In the end, we present the analysis concerning different dissimilarity metrics
and the Triplet Loss Model used for evaluation.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          To ensure an unbiased study, we have conducted our experiments using several datasets that are
collected in uncontrolled environments. This approach allows us to evaluate the performance
of the identification system under real-world conditions, where various factors such as lighting
conditions, pose variations, and occlusions can significantly impact recognition accuracy. The
datasets used in this research are briefly described below:
1. KinEar dataset: The KinEar dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] contains ear images of 19 families with a total of
76 identities [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ];
2. I-Ear dataset: As part of our research efforts, we have prepared a first-ever Indian ear
recognition dataset captured in several unconstrained settings using mobile devices. In
total, the dataset contains ear images of 52 identities. The dataset can be accessed by using
the following link [I-Ear].
3. AWE dataset: Annotated Web Ears (AWE) dataset [
          <xref ref-type="bibr" rid="ref12 ref3 ref4">3, 12, 4</xref>
          ], contains ear images collected
from the web, ensuring a wide range of variability derived from unconstrained environments.
This dataset comprises images of 100 subjects, including some of the most famous people
from diverse ethnicities, genders, and age groups.
        </p>
        <p>The characteristics of each dataset are given in Table 1 and a few samples are shown in Figure
3.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. CNNs and ViT</title>
        <p>
          Various convolutional neural networks (CNNs) and a vision Transformer are explored to find an
effective architecture for the proposed ear recognition algorithm. These networks encompass a
wide spectrum, including sequentially connected networks like VGG [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], residual connected
networks such as ResNet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], compact CNNs like MobileNet [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], EfcfiientNet [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], and
Xception [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Additionally, we have ventured into novel network architecture search models
like NASMobileNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for feature extraction. We have also utilized a recent and
state-ofthe-art deep network model namely vision transformer (ViT) which is currently less explored
for ear recognition. Contrary to CNNs, the Vision Transformer (ViT) does not incorporate any
convolutional layers in its architecture. Instead, ViT adopts a different approach by splitting the
input image into smaller patches and directly feeding them into the ViT network along with the
attention module to find the most distinctive features in the local region of images.
        </p>
        <p>To obtain image representations, we extract features from the final dense layers of these
models. The feature dimensions () for different networks are as follows: (1) VGG16: 4096,
(2) EfficientNet: 1280,(3) Xception: 2048, (4) MobileNetV2: 1280, (5) NasNetMobile: 1056,
(6) ResNet50: 2048 and, (7) ViT: 768. These feature representations serve as robust biometric
descriptors that capture the distinctive characteristics of the input images. To perform the matching
for ear recognition, we have used two methods:
1. Non-Trainable Method: Multiple dissimilarity metrics, namely Cosine and Euclidean
are used to measure the distance between the image representation belonging to same and
different identities. These metrics offer a non-training property, enabling computational
efficiency and fast identification of identities.
2. Trainable Method: As described in Algorithm 2, in this method, we have trained the
triplet loss model for ear recognition.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>To evaluate the performance of our ear identification experiments, we employed various
performance curves and metrics. These evaluation metrics are listed below:
• CIR%: The CIR score, as defined earlier, represents the rate of correctly classified ear
images by the ear recognition systems.
• ROC: ROC stands for receiver operating characteristic curve and is useful for person
verification scenarios. The verification is a 1:1 matching scenario where along with the ear
image a person also provides the identity. The image is then matched with all the images in
the gallery and the score corresponding to the true identity is considered as genuine score.
If the genuine score is greater than the threshold, we considered a match else discard the
ear images.
• CMC: CMC stands for cumulative matching characteristics curve and is used to measure
the performance of ear recognition systems at different positions (rank) of matching. If
the true dissimilarity score is found at the first position, then it is referred to as a rank-1
matching. Else we keep checking the different positions until the true identity is found but
in turn, this increases the search space.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results and Analysis</title>
      <p>
        In this section, we present the results of ear identification experiments conducted on multiple
challenging datasets including the proposed dataset using different models and strategies. The
analysis can be divided into multiple categories based on the ingredients used to perform the
matching such as (i) analysis based on the performance of a CNN, (ii) analysis on the effectiveness
of ViT, (iii) analysis concerning the utilization of non-trainable matching method or impact of
training the proposed triplet loss model. The analysis on CNN reflects the effectiveness of
different CNN in performing ear recognition, later, these best-performing models are selected as a
base model in the triplet loss model. Finally, we used Grad-CAM visualization [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], to understand
and explain the working of CNN and ViT implemented for ear recognition. The visualization
helps in understanding the importance of different ear regions in identifying individuals.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Results and Analysis on KinEar Dataset</title>
        <p>We first present the analysis of the KinEar dataset using both the evaluation metrics namely
dissimilarity metrics and triplet loss. In this research, we have performed the ear identification,
which is a 1 :  matching scenario, where  is the total identities. The results of person
identification using the KinEar dataset are reported in Table 2. It is observed that the feature
embedding obtained using the network architecture search (NAS) network (NASNet) performs
worst on the ear identification task. We want to mention that similar to other CNNs, the NASNet
model is also pre-trained on the ImageNet dataset and only used as the image encoder. Out of all
the CNNs, the ResNet50 model performs the best for ear recognition on the KinEar dataset. The
prime reason might be the deep nature of the architecture in comparison to efficient architectures
such as EfficientNet and MobileNetV2. While in most cases, both distance metrics perform the
same, on ResNet embeddings Cosine dissimilarity metric outperforms the Euclidean distance by a
significant margin. In other words, cosine dissimilarity improves the ear recognition performance
by 3% as compared to the Euclidean metric on the KinEar dataset when the ResNet and VGG16
encoders are used for feature extraction. In the first case of improving the performance of
ear recognition, we have performed the average dissimilarity score fusion of a couple of
bestperforming models. In this research, based on the performance of the models, we have performed
the fusion of ResNet50 and VGG16. It is seen that this dissimilarity fusion does not improve
the recognition performance for Euclidean and with cosine slight reduction in the accuracy is
observed.</p>
        <p>In the second attempt of increasing the performance of ear recognition, we have trained the
triplet loss model using CNNs and ViT. While training the triplet loss model increases the
computational complexity of the ear recognition system but it shows a tremendous improvement
in the recognition performance. The results reported in Table 3 show the performance of the
best performing ResNet model boost to 48% in comparison to 44% obtained using the cosine
dissimilarity metric. In comparison to other models, ViT and EfficientNet show a significant
jump in performance. For example, the performance of ViT improves more than 2 times using
the triplet model as compared to the dissimilarity metric-based matching. Overall, the proposed
fusion using the triplet model improves the best matching performance on the KinEar dataset
and is 15% better than the fusion algorithm where an average of dissimilarity scores are used for
matching.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Results and Analysis on the Proposed I-Ear Dataset</title>
        <p>We have acquired the first-ever Indian recognition dataset using mobile devices in unconstrained
settings. We have captured the ear images of both the left and right ear and hence individually
both ears are used for matching. Similar to the experiments on the KinEar dataset, we have
performed the ear recognition on the I-Ear dataset using dissimilarity metric and triplet loss
models.</p>
        <p>In comparison to the KinEar dataset, the NASNet architectures show better performance on
the I-Ear dataset. Interestingly, the pre-trained ViT model outperformed the CNN models for
left ear recognition on the proposed I-Ear dataset. Whereas, the ResNet50 model performs the
best when right-ear images are used for matching. The average dissimilarity score fusion of
best-performing models shows a significant boost in recognition performance. Again overall,
the cosine similarity with score fusion performs the best and for the right ear, it yields at -least
5% better performance than the best-performing model. While the simple dissimilarity fusion
shows a jump in the recognition performance, the accuracy improves further with a fusion of
triplet loss-trained models. For left ear recognition, the triplet loss model fusion yields 5% better
performance than the dissimilarity score fusion. An improvement of 7% is also observed for right
ear recognition when triplet loss model fusion is implemented contrary to the dissimilarity fusion.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Experimental Results and Analysis on the AWE Dataset</title>
        <p>Since the triplet loss model shows significant improvements on both the earlier datasets namely
KinEar and I-Ear datasets, we have further evaluated it for unconstrained ear recognition to
evaluate its robustness in handling real-world factors. For that, we have used the AWE dataset
collected directly from the web containing the ear images captured in-the-wild settings. The
proposed fusion of the triplet loss model shows a significant jump in the recognition performance
even on this in-the-wild dataset as compared to the individual CNN and ViT. It is observed that
through the experiments on both I-Ear and AWE datasets the left ear yields higher recognition
performance than the right ear. We want to highlight while the ResNet model consistently
outperforms other networks, for right recognition, on the AWE dataset, the EfficientNet model
shows the highest accuracy. The results on the AWE dataset are reported in Table 6.</p>
        <p>The above-reported ear recognition performances are reported at rank-1, where it is assumed
that the true match always occurs at the first spot. In other words, in rank-1 identification, it is
assumed that only the true identity yields the lowest dissimilarity score. However, that might
not be the case due to several reasons including environmental factors in which the images
100
75
)
%
(
y 50
c
a
r
u
c
c
A
25
0</p>
        <p>Rank 1</p>
        <p>Rank 5</p>
        <p>Rank 8</p>
        <p>Rank 10
KIN</p>
        <p>AWE (Left)</p>
        <p>AWE (Right)</p>
        <p>I-Ear (Left)</p>
        <p>I-Ear (Right)
Dataset
I-Ear Right</p>
        <p>I-Ear Left</p>
        <p>KinEar
are captured which decreases the inter-class variation but increases the intra-class variations.
To tackle this, we have also performed the ear recognition at different ranks and to report the
performance at different ranks CMC values are used. As the rank increase, the search space to
ifnd the identity also increases; therefore, keeping in mind, we restrict the ear recognition till
rank-10. Figure 4 shows the ear recognition on different datasets using the proposed triplet loss
fusion network. As expected as soon as the rank increases, the matching performance increases
drastically. As shown in Table 6, the AWE dataset yields only 38% accuracy for the right ear,
and 41% accuracy for the left ear show a jump of at least 22% when the rank of identification
increases from 1 to 5. Due to the significantly high performance of the I-Ear dataset at rank-1
itself, the improvement with increasing rank is lower than the other datasets.</p>
        <p>The proposed triplet loss model-based ear recognition shows improved performance as
compared to the individual CNN and ViT models; however, it is not clear which feature or region led
to this performance. Therefore, to present a responsible ear recognition framework, we performed
the Grad CAM visualization to better understand the region of interest of the ear where these
models are focusing while extracting the features. We choose the best-performing networks
obtained through the extensive experiments performed on the multiple ear recognition datasets
including the proposed I-Ear dataset. Some examples of correctly classified and misclassified
images taken from each dataset are shown in Figure 5. It is observed when the models took
the correct decision, they are focusing on the ear region; whereas, the misclassification due to
undesired distortion of the ear share, blurriness of the images, and when the models are focusing
on the entire image where the background is higher than the foreground.</p>
        <p>Figure 6 shows the ROC curve when the proposed triplet loss algorithm is used for evaluation
using the score fusion of best-performing networks. On the AWE dataset, at 10− 3 false accept
rate, the proposed model yield close to 41% true accept rate for right ear recognition. The left ear
recognition shows a 6% higher true accept rate at the same false accept rate.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>While biometrics recognition is an important and accurate medium of person identification, an
issue of privacy can limit its deployment. Therefore, it is necessary not only to use the accurate
biometric modality but also can preserve the privacy of individuals. It is observed that ear
recognition has the potential to identify individuals while maintaining privacy; however, not
received significant attention. To advance, the research in privacy-preserving identity recognition,
ifrst, we have benchmarked the effectiveness of various state-of-the-art deep networks including a
vision transformer for unconstrained ear recognition. Later, based on the effectiveness of different
deep neural networks, we proposed a triplet model for ear recognition. We have also presented
a first-ever Indian ethnicity ear recognition dataset. We observed that the triplet model shows
better robustness in handling image variations than conventional CNNs. In the future, we aim to
develop a novel CNN inspired by the success of our proposed amalgamated model and increase
the size of our dataset to cover the wide demographic variation of Indians.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Alshazly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Linse</surname>
          </string-name>
          , E. Barth, T. Martinetz,
          <article-title>Ensembles of deep learning models and transfer learning for ear recognition</article-title>
          ,
          <source>Sensors</source>
          <volume>19</volume>
          (
          <year>2019</year>
          )
          <fpage>4139</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dvoršak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diwedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Štruc</surname>
          </string-name>
          , P. Peer, Ž. Emeršicˇ,
          <article-title>Kinship verification from ear images: An explorative study with deep learning models</article-title>
          ,
          <source>International Workshop on Biometrics and Forensics</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ž. Emeršicˇ</surname>
            , V. Štruc,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Peer</surname>
          </string-name>
          ,
          <article-title>Ear recognition: More than a survey</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>255</volume>
          (
          <year>2017</year>
          )
          <fpage>26</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ž. Emeršicˇ</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Meden</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Peer</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Štruc</surname>
          </string-name>
          ,
          <article-title>Evaluation and analysis of ear recognition models: performance, complexity and resource requirements, Neural computing and applications (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ž. Emeršicˇ</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Štepec</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Štruc</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Peer</surname>
          </string-name>
          ,
          <article-title>Training convolutional neural networks with limited training data for ear recognition in the wild</article-title>
          ,
          <source>arXiv preprint arXiv:1711.09952</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Ear verification under uncontrolled conditions with convolutional neural networks</article-title>
          ,
          <source>IET Biometrics 7</source>
          (
          <year>2018</year>
          )
          <fpage>185</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Few-shot learning for ear recognition</article-title>
          ,
          <source>in: Proceedings of the 2019 international conference on image, video and signal processing</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Ajmera</surname>
          </string-name>
          ,
          <article-title>Convolutional neural network-based human identification using outer ear images</article-title>
          ,
          <source>in: Soft Computing for Problem Solving: SocProS</source>
          <year>2017</year>
          , Volume
          <volume>2</volume>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Hansley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Segundo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Employing fusion of learned and handcrafted features for unconstrained ear recognition</article-title>
          ,
          <source>IET Biometrics 7</source>
          (
          <year>2018</year>
          )
          <fpage>215</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Štepec</surname>
          </string-name>
          , Ž. Emeršicˇ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Peer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Štruc</surname>
          </string-name>
          ,
          <article-title>Constellation-based deep ear recognition, Deep biometrics (</article-title>
          <year>2020</year>
          )
          <fpage>161</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dvoršak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Štruc</surname>
          </string-name>
          , P. Peer, Ž. Emeršicˇ,
          <article-title>Kinship verification from ear images: An explorative study with deep learning models</article-title>
          ,
          <source>in: 2022 International Workshop on Biometrics and Forensics (IWBF)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ž. Emeršicˇ</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Gabriel</surname>
            , V. Štruc,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Peer</surname>
          </string-name>
          ,
          <article-title>Convolutional encoder-decoder networks for pixel-wise ear detection and segmentation</article-title>
          ,
          <source>IET Biometrics 7</source>
          (
          <year>2018</year>
          )
          <fpage>175</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Efcfiientnetv2: Smaller models and faster training</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10096</fpage>
          -
          <lpage>10106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          , Xception:
          <article-title>Deep learning with depthwise separable convolutions</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1251</fpage>
          -
          <lpage>1258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Learning transferable architectures for scalable image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>8697</fpage>
          -
          <lpage>8710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-CAM:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>