<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ordinal Scale Evaluation of Smiling Intensity using Comparison-Based Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kei Shimonishi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kazuaki Kondo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hirotada Ueda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuichi Nakamura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyoto University</institution>
          ,
          <addr-line>Yoshida-honmachi, Sakyo, Kyoto</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The ability to evaluate both explicit facial expressions and intermediate expressions is helpful for human monitoring. Since intermediate facial expression is out of the scope of traditional studies, evaluation scores obtained from traditional facial expression recognition techniques are unreliable. In this paper, we propose an ordinal scale-based evaluation scheme for facial expression based on a comparison. The proposed framework is based on an ordinal scale; it is challenging to construct a standard scale that can be applied to multiple individuals. However, it is expected to be efective enough to track changes in the facial expressions of the specific individual, including intermediate expressions. We also propose an algorithm for selecting reference images from the data by taking into account the consistencies of the strong-weak relationships between reference images because the reference image selection significantly impacts the ordinal evaluation. Our approach is evaluated by conducting experiments with human annotators.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Facial expression recognition</kwd>
        <kwd>Siamese network</kwd>
        <kwd>ranking</kwd>
        <kwd>ordinal scales</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Transitions of facial expressions
Monitoring an individual’s Quality of Life (QOL) is
becoming increasingly important to maintain good mental ity
cBoencdauitsioendsiraencdt QdeOteLcitneqaurilryestraernedbsoi nthhearirnmgfualncdoint disitdioifi-ns. iiltsnneem
cult to accurately represent one’s internal state, estimat- S
ing internal state from external nonverbal information time
is desired. Facial expression is one of the modalities that
reflects an individual’s internal state and is expressed Time
with being influenced by mental condition. For example,
when an individual is not feeling well, the same smile Figure 1: An example of transition curve of smiling intensity
may appear weaker than usual. Therefore, monitoring fa- in daily life
cial expressions in daily life is a crucial clue to estimating
an individual’s QOL.</p>
      <p>
        The research field of facial expression recognition Though the traditional algorithm of FER seems able
(FER) has a long history, and it has already been put to evaluate intermediate facial expressions as a
probinto practical use as a technology, such as smiling shut- ability that a specific facial expression is represented,
ters. While traditional FER mainly focuses on recognizing the probability values are not so reliable, especially for
whether a clear facial expression is represented or not, evaluating intermediate expressions. This is because the
from the viewpoint of monitoring in daily life, evaluating intermediate facial expression was out of the scope of
the degree of expression for the individual is rather cru- the traditional studies; learning is likely to output a value
cial, especially for patients with dementia who have little close to the binary value of either no expression (0) or
or no facial expressions. Based on this point of view, this an expression (1). As a result, for example, when the
research aims to draw a curve of transitions of the indi- degree of smile expression is estimated for a series of
vidual’s degree of facial expressions, particularly smiling facial expressions, the value may change abruptly over a
intensity, as shown in Figure 1. series of times as shown in Figure 2. In addition, it is also
dificult for the machine learning algorithm to directly
(MMaLc4hCinMeHL),eAarAnAinIg20f2o4r, VCaongcnoiutviveer, BaCn,dCManeandtaal Health Workshop learn the intermediate facial expressions since it is
difi* Corresponding author. cult for even humans to give appropriate absolute values
$ shimonishi@i.kyoto-u.ac.jp (K. Shimonishi); for intermediate facial expressions.
kondo@ccm.media.kyoto-u.ac.jp (K. Kondo); Kondo et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed a network for recognizing
ueda.hirotada.2r@kyoto-u.ac.jp (H. Ueda); smiling based on “comparison” to address the issues of
yuichi@media.kyoto-u.ac.jp (Y. Nakamura)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License recognizing intermediate facial expressions. Their work
Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>Time
is based on the assumption that the problem of relatively
evaluating which of two images represents more smiles
by comparing two images is easier than absolutely
evaluating a degree of smiling from only one image.</p>
      <p>By borrowing this comparison-based idea to evaluate
facial expressions, we propose an approach to evaluate
smiling intensity with an ordinal scale. The basic idea of
this approach is that if we have multiple reference face
images for a specific individual and a method for
comparing facial expressions, we can evaluate the smiling
intensity of a new image of the individual through
pairwise comparison with the reference images, as shown in
Figure 3.</p>
      <p>Since the expression ratings in this method are based
on an ordinal scale, the degree of each rating is not the
same for multiple individuals. However, this ordinal
scale-based approach may satisfy our need to capture
changes in facial expressions for each individual.</p>
      <p>In addition, reference image selection is crucial for this
ordinal-based evaluation because they are considered an
evaluation space for facial expressions. Therefore, we
also propose an algorithm of reference image selection
from a large number of face image data of each
individual based on consistencies of comparison results within
images.</p>
      <p>In summary, the contributions of this paper are as
follows:
• We propose an approach to evaluating
intermediate smiling intensity by ordinal scales based on
comparisons.
• We propose an algorithm for selecting
appropriate reference images to construct appropriate
evaluation space.</p>
      <sec id="sec-1-1">
        <title>We briefly introduce related work in the next section.</title>
        <p>Then, we introduce an approach to evaluate facial
expressions by ordinal scales and an algorithm of reference
image selection. We evaluate our approach and
algorithm with human annotators, and finally, we conclude
our research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Facial expression recognition</title>
        <sec id="sec-2-1-1">
          <title>Facial expression recognition is widely utilized in several ifelds. Traditional studies mainly focused on determining whether a specific expression is represented or not.</title>
          <p>
            2.1.1. Facial Action Coding Systems
Facial Action Coding Systems (FACS) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] is a framework
proposed by Ekman et al. that classifies a face into
several parts (Action Units; AUs) based on the basic action
units of individual muscles and describes facial
expressions as a combination of these AU actions. Many facial
expression recognition applications have used FACS as
features, and for example, OpenFace [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] can analyze
multiple facial expressions in near real-time by automatically
recognizing the actions of AUs.
2.1.2. Deep neural network based approach
Although the FACS-based FER approach has been
successful, it has the limitation that the final results are
afected by the accuracy of FACS detection. This
limitation can become a problem, especially when trying to
capture subtle diferences in facial expressions because
the efect of observation noise cannot be ignored. On
the other hand, the end-to-end approach by the deep
neural network can be expected to reduce the efect of
such observation noise by eliminating the necessity of
explicit feature detection. For example, VGGNet [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] is
a traditional deep neural network, but it is known that
human facial features can be extracted well, and recent
research of FER also utilized VGGNet [
            <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
            ]
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Siamese structure-based recognition technique</title>
        <p>
          Siamese network [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is one of the deep neural networks
of metric learning. This network acquires two inputs and
returns the distance between the two inputs. By
applying the same structures and the same weights to feature
extraction layers of these two inputs and the distance
of the two inputs to the loss function, the network can
learn the distance space. The Siamese Network is a
network that determines whether two inputs are similar or
diferent and has been applied to handwritten signature
Itgt
rsye )ed
la r
        </p>
        <p>a
NN (sh
C
 
 
 
 
 
 
Forward comparison stream
Inverse comparison stream
rse )ed
lya ra
FC (sh</p>
        <p>In the inverse
comparison
‘Ascent’
‘Descent’
‘Ascent’
‘Descent’
In the forward
comparison
‘Ascent’
‘Descent’</p>
        <p>
          Ground truth
‘Ascent’
‘Descent’
Categorical cross
entropy loss
recognition [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and used as a framework for anomaly
detection [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. As one of its features, it is known as a
network that can be trained from a small number of training
data compared to conventional networks that perform
multi-valued discrimination and regression [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Kondo et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] proposed an approach to the
evaluation of facial expressions based on comparison inspired
by the Siamese structure. Their approach compares two
facial images and returns which of one image represents
more smiles, and they showed that the approach has the
potential to distinguish subtle facial expression
diferences. In addition, Zhang et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] extended their work
from a positive-neutral direction to a negative-neutral
direction.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Comparison-based smiling</title>
      <p>evaluation by ordinal scales</p>
      <sec id="sec-3-1">
        <title>3.1. Overview of the proposed framework</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. A network for facial expression comparison</title>
        <sec id="sec-3-2-1">
          <title>In this paper, we defined the recognition task as a simple</title>
          <p>
            two-category classification problem (i.e., determining
which of two input images represents the greater degree
of smiling) and construct a Siamese-based network to
recognize smiling similar to the network Kondo et al.
have developed [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>
            Figure 4 shows the structure of the proposed network
that accepts two input images and returns two likelihood
values corresponding to ascension and descension labels
relative to the degree of smiling. We employed the CNN
component of VGG16 [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] and two fully connected
layers with rectified linear units, a 0.25 dropout rate, and
SoftMax in the proposed method. The ground-truth
likelihood values for an input image pair were represented
as a two-element one-hot vector, with its element
corresponding to the ground truth label set to 1 and the
other element set to 0, respectively. We used categorical
cross-entropy loss to optimize the network parameters,
as follows:
As introduced in the Introduction, the basic idea of our
approach is a comparison-based evaluation. Kondo et
al.[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] has developed a Siamese-based smiling recogni-  = − ∑︁ {(ˆ) + (1 − )(1 − ˆ)} ,
tion network that takes two face images as input and 
recognizes which one is expressing smiling more. By (1)
borrowing this idea, once we develop a network that can where  = {0, 1}, , and ˆ denotes ascension and
determine which of two images represents more smiles, descension labels relative to the degree of smiling, the
and if we have multiple reference images, we can evalu- ground-truth label, and the predicted likelihood values,
ate the smile intensity of a new image through pairwise respectively.
comparison with the reference images as also introduced The previously proposed network by Kondo et al. was
in the Introduction. When it comes to determining smil- not designed to consider the order of inputs, resulting
ing intensity based on ordinal scales, although all the in instances where swapping the order of two inputs
comparison results are ideally consistent, the results are led to contradictory outputs. To address this issue, we
sometimes inconsistent due to an ambiguity of slightly input a permuted version of the two features extracted
diferent face images. Therefore, we apply a voting-based from two input images by the CNN component into the
evaluation and determine smiling scores by merging mul- fully connected layer in the latter stage and calculate the
tiple comparison results. In addition, we propose an al- categorical cross-entropy loss of inverted input, ,
gorithm to select appropriate reference images to reduce as same as , as shown in red arrows in Figure 4.
the ambiguity between reference images in the following Also, a loss of consistency of these two types of input is
section. calculated as
 = 1 − { ((,  ), (,  ))
voted to ranks larger than . In practice, add likelihood
values of “ascend” and “descend” to the ranks lower than
and higher than , respectively.
          </p>
          <p>Simply thinking, the degree of smiling in the
reference images can be determined by searching the position
whose scores are maximum. Here, the position  can be
derived as:
3.3. Voting-based evaluation
where  (, ) and  (, ) represent
probabilities that the degree of smiling of image  is larger
or smaller than that of image , respectively. In
other words,  (() &gt; ()) and  (() &lt;
()), where () represents a degree of smiling of
image . Also,  and  represent the likelihoods of
the forward comparison stream and inverse comparison
stream, respectively.</p>
          <p>In total, our network is trained to decrease the
following loss function:
To apply voting-based ordinal-scale evaluation as
described above, we first need to construct an evaluation
space with several reference images. Since the proposed
approach utilizes ordinal scales, the construction of the
evaluation space is crucial for the capability of the
approach. Although a straightforward way is to utilize all
the face data as reference images, the evaluation space
 =  +  + . (3) constructed by very similar or subtle diferent images is
unreliable due to the ambiguity of these images.</p>
          <p>Here, we expected that the CNN and the fully con- In this paper, we first consider all the data as baseline
nected components would be trained to compare ex- images and take pair-wise comparisons to sort all the
tracted features in order to project the results onto the data in a dataset and construct a baseline ranking. Then,
likelihood values of the ascension and descension labels, we select several images from the baseline ranking as
respectively. reference images and quantize the evaluation space by
taking into account consistency to address the issues due
to ambiguity.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Reference image selection</title>
      <p>Since the reference images may include some ambiguity 4.1. Baseline ranking construction
between neighboring images, it is dificult to directly
determine the degree of smiling of the new target image Figure 6 (a) shows a comparison table of the result of all
in a reference image set. Therefore, we apply a voting pair-wise comparisons in baseline images. Each color
technique to determine the final rank of the image. The shows a probability of how a target image has a stronger
algorithm votes to possible ranks using the result of each smile than a reference image, i.e.,  (,  ). The
comparison of reference images and a target image. As blue shows a pair whose target image has a stronger
a result, the most likely rank should have a maximum smile than the reference image, i.e.,  (,  ) &gt;
number of votes.  (,  ). In contrast, the red area shows a pair</p>
      <p>In particular, the procedure is as follows. Suppose that whose reference image has a stronger smile than a target
we have  reference images with its order of degree of image, i.e.,  (,  ) &lt;  (,  ). The white
smiling, i.e., () &gt; ( ), ∀ &lt; . A new target area represents that the target and the reference images
image  is compared to all reference images, and like- represent similar facial expressions.
lihood values that the degree of smiling of target image is By sorting the baseline images based on the sum of the
larger than that of a reference image  (, ) and probability values in each column of this table, a baseline
likelihood that the degree of smiling of target image is ranking considering the consistency of the strong-weak
lower than that of a reference image  (, ) for relationship can be constructed (Figure 6 (b)). In
particall reference images ( ∈ {1, . . . ,  }) are obtained. Be- ular, suppose we have  baseline images {1, . . . ,  }
cause if () &lt; () is estimated, the smile rank in total, and denote images sorted in descending order
of  is estimated as larger than , large values are by smiling intensity as {1, . . . , }. Since the
gae
m
i
frceneeeR</p>
      <p>Consistent
e
g
a
m
i
cne
ree Inconsistent
f
e
R</p>
      <p>(a) Aconsistency of strong-weak
relations in the baseline ranking images
(b) Aconsistency square of</p>
      <p>neighboring images
calculated the same as the bottom-right figure of Figure
6, the less red and white colors area is a better sign of
reference image selection.
strong-weak relationship between each image  to other To realize that, we focus on a square region of neighbor
images ^, ˆ ∈ {1, . . . ,  } are calculated as probabil- images as shown in Figure 7, and call this square
consisity values   and  , the total consistency values in tency square. Suppose the diferences between images
the baseline ranking is derived as are significant, and a strong-weak relationship is evident
⎧ in the images. In that case, the consistency values in the
 = ∑︁ ⎨⎪ ∑︁ (, ˆ) icnontshiestesqnucyarsequbaerceoamree ablsluoee.xpInecctoednttroabste, lwarhgeen,i.iem., acgelelss
 ⎪⎩{^|^∈{1,...,−1}} are similar, and therefore the diference between images
⎫ is ambiguous, the consistency values in the consistency
+ ∑︁ (, ˆ)⎪⎬ (5) saqnudarreedb.ecome low, i.e., cells in the square become white
{^|^∈{+1,...,}} ⎪⎭ The basic idea of building a consistent evaluation space
is to quantize images with low consistency values in the
By maximizing this total consistency, base- consistency square into a single class. As a result, the
line ranking images (1, . . . , ) = ambiguity between these images becomes “don’t care” in
arg max 1, . . . ,  can be obtained. From now the evaluation space, and the consistency of the evaluated
on, the subscript  will be used to sort the images in value becomes significant. That is, it is good to select
descending order of smiling degree. images where the total consistency values in the sum of</p>
      <p>An example of the consistency of the strong-weak consistency squares, as shown in the right of Figure 7,
relations in this rearranged table is shown in Figure becomes low.
6 (c) by replacing  (,  ) into  (,  ) In addition, neighbor images in the baseline ranking
when () &lt; ( ). We here denote probabil- should not be selected as reference images. In other
ities , to indicate this consistency as follows: words, to select good reference images, the evaluation
space is better to be divided evenly. To realize that, select
, = {︃ ((,,) ) iiff (()) &gt;&lt; (( )),. aofgcroounpsisotfernecfyersepnacceeimbeacgoemsseos tshmaatltlh.eTsoumsumof tuhpe, aitreias
(6) better to choose a group of images for which both the</p>
      <p>Ideally, all cells would be blue, i.e., consistency is nearly sum of consistency values and the sum of areas within
equal to 1. However, due to the ambiguity of compar- the consistency square is small. Here, selecting a group
ison results for similar facial expressions, there is also of images with high consistency values within the
consisambiguity in the consistency between the neighboring tency square is equivalent to selecting a group with low
baseline ranking. Therefore, reference image selection is inconsistency. In summary, a group of reference images
important to construct appropriate evaluation space. should be selected to maximize the total values of
inconsistencies in consistency squares divided by the sum of
4.2. Reference image selection the areas of consistency squares.</p>
      <p>In practice, a procedure of reference image selection
An important factor in selecting reference images is the is the following. Suppose there are  images in total,
consistency of the strong-weak relationships within ref- and we want to select one image as a reference image.
erence images. That is, when the consistency table is
(2) =
At first, baseline ranking is constructed as introduced in
the previous subsection and obtains consistency values
, (,  &lt;  ) for all pairs in the ranking. When
evaluation space is divided into two with image ( &lt;  ),
the sum of inconsistency values of the two spaces divided
by the sum of the area is calculated as:</p>
      <p>(7)
By searching the position whose (2) are maximum, the
best reference image  that divides evaluation space
into two can be obtained. Similar to this, by calculating
the sum of inconsistency in the consistency squares
divided by the sum of the area with diferent division
numbers  + 1,  reference images can be obtained.</p>
      <p>When it comes to calculating these values, we apply a
scheme of dynamic programming to reduce calculation
costs.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment</title>
      <sec id="sec-5-1">
        <title>We conducted an experiment to evaluate the following things:</title>
        <p>training data in this paper. In particular, we manually
an• How a proposed network can evaluate image pair. notated segments that we thought the degree of smiling
• How appropriately reference images can be se- ascended or descended monotonically. We then picked
lected regarding consistency of both network and each segment’s start and end frame to construct one pair
human annotators’ evaluations. with its label. The number of image pairs of each dataset
• How the selected reference images evaluate face were 216, 174, and 123, respectively. Also, all face images
images. in these pairs were utilized as baseline images. That is,
the size of the dataset is twice the size of the image pairs;
432, 348, 246, respectively.</p>
        <sec id="sec-5-1-1">
          <title>5.1. Dataset construction</title>
          <p>At first, the face image dataset is constructed by captur- 5.2. Evaluation scheme
ing participants’ face images. We conducted two types
of experiments to construct datasets with diferent situa- The procedure of evaluation was constructed the
followtions. In the first type of dataset, we asked a participant ing four steps; (1) the proposed network was trained for
to sit in front of the camera and to listen to funny radio. each individual by collected data and was evaluated by
In the second type of dataset, we asked a participant to the cross-validation scheme; (2) we constructed the
basesit in front of the laptop PC and play a simple game. We line ranking and evaluated the voting-based algorithm by
captured facial images of these participants during the determining the rank within the images in the baseline
experiment. The second experiment was still experimen- ranking; (3) reference images were selected from baseline
tal but closer to a natural scene than the first one. Each images as proposed in the previous section and evaluated
dataset was constructed only by one participant because by human annotators on how they were consistent; (4)
our focus was to build a model to evaluate each individ- we confirmed the smiling intensity of face images in the
ual. We collected two datasets of the first type, and one evaluation space constructed by reference images.
dataset of the second type. As for training our network, we utilized pre-trained</p>
          <p>
            Then, we added labels between image pairs that feature extraction layers of VGG-Face [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], which was
showed which of the two images expressed more smiles. trained on millions of face images for person
identificaThe annotation between images with a slight diference tion, and trained only fully connected layers.
in the degree of smiling was dificult, even for humans. It Regarding the evaluation of the voting-based
algomight cause a mistake in giving the correct labels. There- rithm, the rank of each baseline image was determined by
fore, we utilized image pairs with a clear diference as the baseline ranking itself. In this evaluation, the grand
truth of the rank of each baseline image was given as the
original rank of baseline ranking.
          </p>
          <p>As for evaluating selected reference images, nine
reference images were selected from the baseline images.</p>
          <p>Then, human annotators were asked to evaluate which
of the two images represented more smiling for the pair
of neighboring images in reference images. The images
up to the third nearest neighbor images were considered
a pair, and all image pairs were annotated twice by swap- Figure 9: Selected reference images of dataset 1. More smile
ping the left and right sides of the comparison image. images are located on the left side.</p>
          <p>After comparing the image pairs within each dataset, the
participants moved on to the next dataset. The order of
evaluated image pairs was randomized for each dataset,
but the order of the datasets was constant. We evaluated
the consistency of the reference images by how accurate
and consistent annotators evaluated the image pair. A
group of 9 images regularly extracted every /10 from
the baseline images {1, . . . , 9} was used as the
reference image for comparison. Seven participants be- Figure 10: Consistency of estimated rank and original rank
tween the ages of 21 and 27 (6 male and 1 female) were in baseline ranking
recruited as annotators.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.3. Results</title>
          <p>
            5.3.3. Selected reference images
5.3.1. Prediction accuracy Figure 11 shows selected reference images by the
proposed algorithm and equally picked up from the baseline
We first show the evaluation results of the trained ranking. The consistency table correlated to this result of
comparison network in terms of accuracy. In dataset 1 is shown in Figure 12. In this figure, the green
this evaluation, five-fold cross-validation was applied, line shows where the algorithm divides baseline ranking.
and prediction accuracies of three datasets were These results show that there still appears to be some
am99.5% (215/216), 100% (174/174), 98.3% (121/123), biguity between adjacent images, even with the proposed
respectively. Figure 8 shows examples of prediction re- approach. However, it appears to be reduced compared
sults with Gradient-weighted Class Activation Mapping to a group of images acquired at regular intervals.
(Grad CAM) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. In each figure, the first row shows the The consistency table calculated by these selected
refgrand truth label and estimation results, and the second erence images is shown in Figure 13. We can see that
row shows regions on which a network focuses for the almost all the cells represent blue. This result shows that
prediction as a heat map. From these results, we can see ambiguities within reference images are small.
that the network returns accurate prediction results by Figure 14 and Figure 15 show the quantitative
evaluacorrectly focusing on face regions, including the mouth tion results of selected reference images by annotators.
and eyes, which are well known as corresponding to In each figure, “proposed” and “baseline” represent the
smiling, even from the small dataset. result for the images up to the third nearest neighbor
reference images of the proposed algorithm and baseline
5.3.2. Consistency of voting based evaluation algorithm, respectively, and “proposed_adjacent” and
Figure 9 shows four examples of estimated ranks of im- “baseline_adjacent” represent the result for the images
ages in the baseline ranking, rank 1, 100, 200, and 400 only the nearest neighbor reference images. That is,
of dataset 1. The total number of baseline images was the dificulty of the evaluation becomes hard. Figure 14
219196, × an2d =3984,3r2e.spEeaccthiviemlya,gaenwdaasneasltmimoasttecdorarsercatnrkan1k, 1c0a3n, wsheocwosnasidpererdtihcetioorndaerccguivraecnyboyf tehveapluraotpioonserdesnuelttws.oHrkeraes,
be estimated by the voting algorithm. Figure 10 shows the grand truth of the prediction. Therefore, this result
all pairs of estimated rank and grand truth label of this also shows a correlation between network prediction and
evaluation in dataset 1. These results show the consis- human perceptions. Figure 15 shows the consistency of
tency and efectiveness of the voting-based algorithm, each participant’s evaluation. In particular, it shows how
as it predicted almost consistent values for all baseline much the same evaluation was given when the same
imimages. age pair displayed with the left and right sides swapped.
This high consistency indicates a low degree of
ambiguity between image pairs. In almost all cases, reference
Strongest smile
          </p>
          <p>Weakest smile
(a) Selected reference images by proposed algorithm</p>
          <p>Weakest smile
(b) Selected reference images by regularly</p>
          <p>picked up from baseline ranking</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Examples of face images evaluated by selected reference</title>
        <p>images and the proposed network are shown in Figure 16.</p>
        <p>Figure 12: Consistency table of dataset 1. The green line Since it is sometimes hard to qualitatively evaluate two
shows where the algorithm divides baseline ranking. adjacent images in a row, the four reference images skip
one rank at a time. The images with a smile level one
class lower than the reference image are listed, and each
row shows the same evaluation value. The images on the
images selected by the proposed algorithm obtain higher left side of the figure are recognized as having a higher
accuracies and higher consistencies. Since the smiles ex- degree of smiling. This result confirms that the proposed
pressed in the experimental time were quantized into ten method efectively evaluates the degree of smiling within
levels and the maximum value of the smile was not very the ordinal scale.
high, both methods have a certain degree of similarity be- Finally, a part of the transition of the smiling intensity
tween the neighboring reference image pairs. Therefore, during the experiment is shown in Figure 17. In this
evaluation by humans may be somewhat dificult even result, an evaluation score was smoothed by the median
with the proposed method. However, even in such a situ- iflter to trace the trend of transitions. We can see that
ation, we can confirm that the proposed method selects
Evaluated
images</p>
        <p>Strongest smile</p>
        <p>Weakest smile
the participant smiles several times in this period. It can
be seen that smiles of slightly stronger intensity than
the middle level occurred several times in succession in
the first half of this period. In comparison, smiles of
considerably stronger intensity occurred with a short
interval in the second half of this period.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we propose an approach to evaluate the
degree of smiling of individuals by ordinal scales based
on multiple comparisons for the purpose of monitoring
individuals. Suppose that we have enough data from
individual face images; we also propose an algorithm for
selecting appropriate reference images for the ordinal
evaluation.</p>
      <p>Experimental results show that our ordinal scale-based
evaluation can successfully give the degree of not only
clear smiling but also intermediate facial expressions. In
addition, we can see that an evaluation space constructed
by selected reference images by our algorithm is more
consistent and, therefore, considered to be reasonable.</p>
      <p>One of the future works is to map the proposed and
constructed ordinal scale to some physical index.
Although this paper proposed a method of selecting
reference images that are somewhat reasonable when
evaluated by humans, the validity of the scale would be
improved if it could be mapped to some physical index.
For example, by measuring the myoelectricity of facial
muscles, the degree of muscle activity could be used as
an index. In addition, the other future work is to apply
this technique to people whose facial expressions do not
change much, e.g., dementia patients, as we described in
the introduction section.</p>
    </sec>
    <sec id="sec-7">
      <title>Ethics</title>
      <p>Our method aims to monitor the daily health conditions
of a specific individual by evaluating the smiling intensity
using a model trained specifically for the individual’s
facial images. Since data for model training and smiling
intensity evaluation can be collected and processed at
terminals installed in each individual’s environment, it
is expected to reduce the risk of leakage of particularly
strong personal information such as facial images being
stored in the cloud in practical applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satoh</surname>
          </string-name>
          ,
          <article-title>Siamese-structure deep neural network recognizing changes in facial expression according to the degree of smiling</article-title>
          ,
          <source>in: Proc. of ICPR2020</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4605</fpage>
          -
          <lpage>4612</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICPR48806.
          <year>2021</year>
          .
          <volume>9411988</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>P. EKMAN,</surname>
          </string-name>
          <article-title>Facial action coding system (facs), A Human Face (</article-title>
          <year>2002</year>
          ). URL: https://ci.nii.ac.jp/naid/ 10025007347/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Amos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ludwiczuk</surname>
          </string-name>
          , M. Satyanarayanan,
          <article-title>OpenFace: A general-purpose face recognition library with mobile applications</article-title>
          ,
          <source>Technical Report, CMUCS-16-118</source>
          , CMU School of Computer Science,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>in: Proc. of ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Atabansi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Transfer learning technique with vgg-16 for near-infrared facial expression recognition</article-title>
          ,
          <source>Journal of Physics: Conference Series</source>
          <year>1873</year>
          (
          <year>2021</year>
          )
          <article-title>012033</article-title>
          . URL: https://dx. doi.org/10.1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1873</year>
          /1/012033. doi:
          <volume>10</volume>
          . 1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1873</year>
          /1/012033.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Facial expression recognition model based on improved vggnet</article-title>
          ,
          <source>in: 2023 4th International Conference on Electronic Communication and Artificial Intelligence (ICECAI)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>408</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICECAI58670.
          <year>2023</year>
          .
          <volume>10177007</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bromley</surname>
          </string-name>
          , I. Guyon,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , E. Säckinger,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Signature verification using a "siamese" time delay neural network</article-title>
          ,
          <source>in: Proc. of NIPS'93</source>
          ,
          <year>1993</year>
          , pp.
          <fpage>737</fpage>
          -
          <lpage>744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bromley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bentz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          , I. Guyon,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lecun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sackinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Signature veriifcation using a "siamese" time delay neural network</article-title>
          ,
          <source>International Journal of Pattern Recognition and Artificial Intelligence</source>
          <volume>7</volume>
          (
          <year>1993</year>
          )
          <article-title>25</article-title>
          . doi:
          <volume>10</volume>
          .1142/ S0218001493000339.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems</article-title>
          ,
          <source>IEEE Transactions on Industrial Informatics</source>
          <volume>17</volume>
          (
          <year>2021</year>
          )
          <fpage>5790</fpage>
          -
          <lpage>5798</lpage>
          . doi:
          <volume>10</volume>
          .1109/TII.
          <year>2020</year>
          .
          <volume>3047675</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Siamese neural networks for one-shot image recognition</article-title>
          ,
          <source>in: Proc. of the deep learning workshop in the 32nd International Conference on Machine Learning</source>
          , volume
          <volume>2</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shimonishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <article-title>Facial expression change recognition on neutralnegative axis based on siamese-structure deep neural network, in: Cross-Cultural Design. Product and Service Design, Mobility</article-title>
          and
          <string-name>
            <given-names>Automotive</given-names>
            <surname>Design</surname>
          </string-name>
          , Cities,
          <string-name>
            <given-names>Urban</given-names>
            <surname>Areas</surname>
          </string-name>
          , and Intelligent Environments Design: 14th International Conference, CCD 2022,
          <article-title>Held as Part of the 24th HCI International Conference</article-title>
          ,
          <string-name>
            <surname>HCII</surname>
          </string-name>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Deep face recognition</article-title>
          ,
          <source>in: British Machine Vision Conference</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-cam:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>128</volume>
          (
          <year>2019</year>
          )
          <fpage>336</fpage>
          -
          <lpage>359</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263-019-01228-7.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>