Ordinal Scale Evaluation of Smiling Intensity using
                                Comparison-Based Network
                                Kei Shimonishi1,* , Kazuaki Kondo1 , Hirotada Ueda1 and Yuichi Nakamura1
                                1
                                    Kyoto University, Yoshida-honmachi, Sakyo, Kyoto, Japan


                                                  Abstract
                                                  The ability to evaluate both explicit facial expressions and intermediate expressions is helpful for human monitoring. Since
                                                  intermediate facial expression is out of the scope of traditional studies, evaluation scores obtained from traditional facial
                                                  expression recognition techniques are unreliable. In this paper, we propose an ordinal scale-based evaluation scheme for
                                                  facial expression based on a comparison. The proposed framework is based on an ordinal scale; it is challenging to construct
                                                  a standard scale that can be applied to multiple individuals. However, it is expected to be effective enough to track changes
                                                  in the facial expressions of the specific individual, including intermediate expressions. We also propose an algorithm for
                                                  selecting reference images from the data by taking into account the consistencies of the strong-weak relationships between
                                                  reference images because the reference image selection significantly impacts the ordinal evaluation. Our approach is evaluated
                                                  by conducting experiments with human annotators.

                                                  Keywords
                                                  Facial expression recognition, Siamese network, ranking, ordinal scales


                                1. Introduction                                                                                           Transitions of facial expressions

                                Monitoring an individual’s Quality of Life (QOL) is be-
                                coming increasingly important to maintain good mental
                                                                                                                        Smile intensity


                                conditions and detect early trends in harmful conditions.
                                Because direct QOL inquires are bothering and it is diffi-
                                cult to accurately represent one’s internal state, estimat-
                                ing internal state from external nonverbal information                                                                    time


                                is desired. Facial expression is one of the modalities that
                                reflects an individual’s internal state and is expressed                                                                                                     Time

                                with being influenced by mental condition. For example,
                                when an individual is not feeling well, the same smile                             Figure 1: An example of transition curve of smiling intensity
                                may appear weaker than usual. Therefore, monitoring fa-                            in daily life
                                cial expressions in daily life is a crucial clue to estimating
                                an individual’s QOL.
                                   The research field of facial expression recognition                                                   Though the traditional algorithm of FER seems able
                                (FER) has a long history, and it has already been put                                                 to evaluate intermediate facial expressions as a prob-
                                into practical use as a technology, such as smiling shut-                                             ability that a specific facial expression is represented,
                                ters. While traditional FER mainly focuses on recognizing                                             the probability values are not so reliable, especially for
                                whether a clear facial expression is represented or not,                                              evaluating intermediate expressions. This is because the
                                from the viewpoint of monitoring in daily life, evaluating                                            intermediate facial expression was out of the scope of
                                the degree of expression for the individual is rather cru-                                            the traditional studies; learning is likely to output a value
                                cial, especially for patients with dementia who have little                                           close to the binary value of either no expression (0) or
                                or no facial expressions. Based on this point of view, this                                           an expression (1). As a result, for example, when the
                                research aims to draw a curve of transitions of the indi-                                             degree of smile expression is estimated for a series of
                                vidual’s degree of facial expressions, particularly smiling                                           facial expressions, the value may change abruptly over a
                                intensity, as shown in Figure 1.                                                                      series of times as shown in Figure 2. In addition, it is also
                                                                                                                                      difficult for the machine learning algorithm to directly
                                Machine Learning for Cognitive and Mental Health Workshop
                                (ML4CMH), AAAI 2024, Vancouver, BC, Canada                                                            learn the intermediate facial expressions since it is diffi-
                                *
                                  Corresponding author.                                                                               cult for even humans to give appropriate absolute values
                                $ shimonishi@i.kyoto-u.ac.jp (K. Shimonishi);                                                         for intermediate facial expressions.
                                kondo@ccm.media.kyoto-u.ac.jp (K. Kondo);                                                                Kondo et al. [1] proposed a network for recognizing
                                ueda.hirotada.2r@kyoto-u.ac.jp (H. Ueda);
                                                                                                                                      smiling based on “comparison” to address the issues of
                                yuichi@media.kyoto-u.ac.jp (Y. Nakamura)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License recognizing intermediate facial expressions. Their work
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                        We briefly introduce related work in the next section.
           1
Smiling score

                                                                     Then, we introduce an approach to evaluate facial ex-
                                                                     pressions by ordinal scales and an algorithm of reference
                                                                     image selection. We evaluate our approach and algo-
           0
                                                              Time

                                                                     rithm with human annotators, and finally, we conclude
                                                                     our research.
Figure 2: An example of sudden jump of evaluation scores for
intermediate facial expressions by traditional facial expression
recognition technique                                                2. Related Work
                                                                     2.1. Facial expression recognition
                                                                     Facial expression recognition is widely utilized in several
                                                                     fields. Traditional studies mainly focused on determining
                                                                     whether a specific expression is represented or not.

                                                                     2.1.1. Facial Action Coding Systems
Figure 3: Overview of evaluation method of facial expression
intensity based on comparison                                 Facial Action Coding Systems (FACS) [2] is a framework
                                                              proposed by Ekman et al. that classifies a face into sev-
                                                              eral parts (Action Units; AUs) based on the basic action
                                                              units of individual muscles and describes facial expres-
is based on the assumption that the problem of relatively
                                                              sions as a combination of these AU actions. Many facial
evaluating which of two images represents more smiles
                                                              expression recognition applications have used FACS as
by comparing two images is easier than absolutely eval-
                                                              features, and for example, OpenFace [3] can analyze mul-
uating a degree of smiling from only one image.
                                                              tiple facial expressions in near real-time by automatically
   By borrowing this comparison-based idea to evaluate
                                                              recognizing the actions of AUs.
facial expressions, we propose an approach to evaluate
smiling intensity with an ordinal scale. The basic idea of
this approach is that if we have multiple reference face 2.1.2. Deep neural network based approach
images for a specific individual and a method for com- Although the FACS-based FER approach has been suc-
paring facial expressions, we can evaluate the smiling cessful, it has the limitation that the final results are
intensity of a new image of the individual through pair- affected by the accuracy of FACS detection. This limita-
wise comparison with the reference images, as shown in tion can become a problem, especially when trying to
Figure 3.                                                     capture subtle differences in facial expressions because
   Since the expression ratings in this method are based the effect of observation noise cannot be ignored. On
on an ordinal scale, the degree of each rating is not the the other hand, the end-to-end approach by the deep
same for multiple individuals. However, this ordinal neural network can be expected to reduce the effect of
scale-based approach may satisfy our need to capture such observation noise by eliminating the necessity of
changes in facial expressions for each individual.            explicit feature detection. For example, VGGNet [4] is
   In addition, reference image selection is crucial for this a traditional deep neural network, but it is known that
ordinal-based evaluation because they are considered an human facial features can be extracted well, and recent
evaluation space for facial expressions. Therefore, we research of FER also utilized VGGNet [5, 6]
also propose an algorithm of reference image selection
from a large number of face image data of each individ-
ual based on consistencies of comparison results within 2.2. Siamese structure-based recognition
images.                                                              technique
   In summary, the contributions of this paper are as Siamese network [7] is one of the deep neural networks
follows:                                                      of metric learning. This network acquires two inputs and
                • We propose an approach to evaluating intermedi-    returns the distance between the two inputs. By apply-
                  ate smiling intensity by ordinal scales based on   ing the same structures and the same weights to feature
                  comparisons.                                       extraction layers of these two inputs and the distance
                                                                     of the two inputs to the loss function, the network can
                • We propose an algorithm for selecting appro-
                                                                     learn the distance space. The Siamese Network is a net-
                  priate reference images to construct appropriate
                                                                     work that determines whether two inputs are similar or
                  evaluation space.
                                                                     different and has been applied to handwritten signature
                                                  Forward comparison stream
               Iref
                                                                                                  ‘Ascent’
                                           𝒙𝑟𝑒𝑓   𝒙𝑟𝑒𝑓
                                                                                                 ‘Descent’
                                                                                                                         Ground truth
                                                  𝒙𝑡𝑔𝑡


                                                               FC layers
                                                               (shared)
                              CNN layers
                               (shared)
                                                                                                                             ‘Ascent’
                                                                              In the inverse   In the forward
                                                                              comparison       comparison                   ‘Descent’
               Itgt                               𝒙𝑡𝑔𝑡                           ‘Ascent’         ‘Ascent’
                                           𝒙𝑡𝑔𝑡                                 ‘Descent’        ‘Descent’
                                                  𝒙𝑟𝑒𝑓                                                          Categorical cross
                                                                                                                entropy loss
                                                  Inverse comparison stream


Figure 4: Siamese-based network to compare face images to evaluate the degree of smiling


recognition [8] and used as a framework for anomaly                        3.2. A network for facial expression
detection [9]. As one of its features, it is known as a net-                    comparison
work that can be trained from a small number of training
data compared to conventional networks that perform        In this paper, we defined the recognition task as a simple
multi-valued discrimination and regression [10].           two-category classification problem (i.e., determining
   Kondo et al. [1] proposed an approach to the evalua-    which of two input images represents the greater degree
tion of facial expressions based on comparison inspired    of smiling) and construct a Siamese-based network to
by the Siamese structure. Their approach compares two      recognize smiling similar to the network Kondo et al.
facial images and returns which of one image represents    have developed [1].
more smiles, and they showed that the approach has the        Figure 4 shows the structure of the proposed network
potential to distinguish subtle facial expression differ-  that accepts two input images and returns two likelihood
ences. In addition, Zhang et al. [11] extended their work  values corresponding to ascension and descension labels
from a positive-neutral direction to a negative-neutral    relative to the degree of smiling. We employed the CNN
direction.                                                 component of VGG16 [4] and two fully connected lay-
                                                           ers with rectified linear units, a 0.25 dropout rate, and
                                                           SoftMax in the proposed method. The ground-truth like-
3. Comparison-based smiling                                lihood values for an input image pair were represented
     evaluation by ordinal scales                          as a two-element one-hot vector, with its element cor-
                                                           responding to the ground truth label set to 1 and the
3.1. Overview of the proposed framework other element set to 0, respectively. We used categorical
                                                           cross-entropy loss to optimize the network parameters,
As introduced in the Introduction, the basic idea of our as follows:
approach is a comparison-based evaluation. Kondo et                      ∑︁
al.[1] has developed a Siamese-based smiling recogni-         𝐿𝑐𝑎𝑡 = −       {𝑦𝑖 𝑙𝑜𝑔(𝑦ˆ𝑖 ) + (1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑦ˆ𝑖 )} ,
tion network that takes two face images as input and                       𝑖

recognizes which one is expressing smiling more. By                                                                   (1)
borrowing this idea, once we develop a network that can    where   𝑖  =  {0, 1},  𝑦𝑖 , and 𝑦
                                                                                           ˆ 𝑖 denotes  ascension   and
determine which of two images represents more smiles, descension labels relative to the degree of smiling, the
and if we have multiple reference images, we can evalu- ground-truth label, and the predicted likelihood values,
ate the smile intensity of a new image through pairwise respectively.
comparison with the reference images as also introduced       The previously proposed network by Kondo et al. was
in the Introduction. When it comes to determining smil-    not  designed to consider the order of inputs, resulting
ing intensity based on ordinal scales, although all the    in instances   where swapping the order of two inputs
comparison results are ideally consistent, the results are led to contradictory outputs. To address this issue, we
sometimes inconsistent due to an ambiguity of slightly input a permuted version of the two features extracted
different face images. Therefore, we apply a voting-based from two input images by the CNN component into the
evaluation and determine smiling scores by merging mul- fully connected layer in the latter stage and calculate the
tiple comparison results. In addition, we propose an al- categorical cross-entropy loss of inverted input, 𝐿𝑖𝑛𝑣 ,
gorithm to select appropriate reference images to reduce as same as 𝐿𝑐𝑎𝑡 , as shown in red arrows in Figure 4.
the ambiguity between reference images in the following Also, a loss of consistency of these two types of input is
section.                                                   calculated as

                                                                           𝐿𝑐𝑜𝑛 = 1 − {𝑚𝑖𝑛(𝑃𝑓𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ))
                                                               voted to ranks larger than 𝑛. In practice, add likelihood
                                                               values of “ascend” and “descend” to the ranks lower than
                                                               and higher than 𝑛, respectively.
                                                                 Simply thinking, the degree of smiling in the refer-
                                                               ence images can be determined by searching the position
                                                               whose scores are maximum. Here, the position 𝑟 can be
                                                               derived as:
                                                                             {︃ 𝑟−1                     𝑁
                                                                                                                         }︃
                                                                                ∑︁ 𝐷𝑒𝑠                 ∑︁       𝐴𝑠
                                                               𝑟 = arg max          𝑃    (𝑛𝑒𝑤, 𝑛) +          𝑃 (𝑛𝑒𝑤, 𝑛) .
Figure 5: Voting-based evaluation                                         𝑟
                                                                               𝑛=1                     𝑛=𝑟+1
                                                                                                                       (4)
                                                                In addition, we here apply a mean-shift algorithm to
         +𝑚𝑖𝑛(𝑃𝑓𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ))}, (2) determine the evaluation score based on these probability
                                                             values.
where 𝑃 𝐴𝑠 (𝑛, 𝑚) and 𝑃 𝐷𝑒𝑠 (𝑛, 𝑚) represent probabil-
ities that the degree of smiling of image 𝐼𝑛 is larger
or smaller than that of image 𝐼𝑚 , respectively. In 4. Reference image selection
other words, 𝑃 (𝑠𝑛(𝐼𝑛 ) > 𝑠𝑛(𝐼𝑚 )) and 𝑃 (𝑠𝑛(𝐼𝑛 ) <
𝑠𝑛(𝐼𝑚 )), where 𝑠𝑛(𝐼) represents a degree of smiling of To apply voting-based ordinal-scale evaluation as de-
image 𝐼. Also, 𝑃𝑓 and 𝑃𝑖 represent the likelihoods of scribed above, we first need to construct an evaluation
the forward comparison stream and inverse comparison space with several reference images. Since the proposed
stream, respectively.                                        approach utilizes ordinal scales, the construction of the
   In total, our network is trained to decrease the follow- evaluation space is crucial for the capability of the ap-
ing loss function:                                           proach. Although a straightforward way is to utilize all
                                                             the face data as reference images, the evaluation space
                 𝐿 = 𝐿𝑐𝑎𝑡 + 𝐿𝑖𝑛𝑣 + 𝐿𝑐𝑜𝑛 .                (3) constructed by very similar or subtle different images is
                                                             unreliable due to the ambiguity of these images.
   Here, we expected that the CNN and the fully con-            In this paper, we first consider all the data as baseline
nected components would be trained to compare ex- images and take pair-wise comparisons to sort all the
tracted features in order to project the results onto the data in a dataset and construct a baseline ranking. Then,
likelihood values of the ascension and descension labels, we select several images from the baseline ranking as
respectively.                                                reference images and quantize the evaluation space by
                                                             taking into account consistency to address the issues due
3.3. Voting-based evaluation                                 to ambiguity.

Since the reference images may include some ambiguity
                                                               4.1. Baseline ranking construction
between neighboring images, it is difficult to directly
determine the degree of smiling of the new target image        Figure 6 (a) shows a comparison table of the result of all
in a reference image set. Therefore, we apply a voting         pair-wise comparisons in baseline images. Each color
technique to determine the final rank of the image. The        shows a probability of how a target image has a stronger
algorithm votes to possible ranks using the result of each     smile than a reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The
comparison of reference images and a target image. As          blue shows a pair whose target image has a stronger
a result, the most likely rank should have a maximum           smile than the reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) >
number of votes.                                               𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). In contrast, the red area shows a pair
   In particular, the procedure is as follows. Suppose that    whose reference image has a stronger smile than a target
we have 𝑁 reference images with its order of degree of         image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) < 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The white
smiling, i.e., 𝑠𝑛(𝐼𝑖 ) > 𝑠𝑛(𝐼𝑗 ), ∀𝑖 < 𝑗. A new target         area represents that the target and the reference images
image 𝐼𝑛𝑒𝑤 is compared to all reference images, and like-      represent similar facial expressions.
lihood values that the degree of smiling of target image is       By sorting the baseline images based on the sum of the
larger than that of a reference image 𝑃 𝐴𝑠 (𝑛𝑒𝑤, 𝑛) and        probability values in each column of this table, a baseline
likelihood that the degree of smiling of target image is       ranking considering the consistency of the strong-weak
lower than that of a reference image 𝑃 𝐷𝑒𝑠 (𝑛𝑒𝑤, 𝑛) for        relationship can be constructed (Figure 6 (b)). In partic-
all reference images (𝑛 ∈ {1, . . . , 𝑁 }) are obtained. Be-   ular, suppose we have 𝑁 baseline images {𝐼1 , . . . , 𝐼𝑁 }
cause if 𝑠𝑛(𝐼𝑛𝑒𝑤 ) < 𝑠𝑛(𝐼𝑛 ) is estimated, the smile rank      in total, and denote images sorted in descending order
of 𝐼𝑛𝑒𝑤 is estimated as larger than 𝑛, large values are        by smiling intensity as {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁
                                                                                                          𝑟𝑎𝑛𝑘
                                                                                                               }. Since the
                                                                              Strongest Target image Weakest
                                                                                         Smiling ranking smiling                           Strongest                                               Weakest
                                                                              smiling                                                                                   Target image
                                                                                                                                           smiling                                                 smiling


                                                                           Reference image
                                                                                             Smiling ranking
                           Target image                                                                                       Tgt<Ref
                                                                                                                                                                                      Consistent


                                                                                                                                           Reference image
                                                          sum
                                                                                                                   Tgt>Ref

                                                                                                                                                             Inconsistent
  Reference image


                                                                    (b) Strong-weak relations between images
                                                                                         Strongest              Weakest
                                                                                         smiling   Target image smiling

                                                           Re-arrange


                                                                                        Reference image
                                             Ref<Tgt
                                                                                                                        Consistent                                 (a) A consistency of strong-weak          (b) A consistency square of
                                                                                                                                                               relations in the baseline ranking images          neighboring images
                                             Ref>Tgt


                    sum                                                                                        Inconsistent
                                          Ranking score                                                                                 Figure 7: Consistencies of large small relationships in neigh-
                      (a) Comparison table                                                                                              bor images as a part of consistency table in baseline ranking.
                                                                         (c) A consistency of strong-weak
                                                                             relations between images


Figure 6: A baseline ranking made from a comparison table
                                                                                                                                        calculated the same as the bottom-right figure of Figure
                                                                                                                                        6, the less red and white colors area is a better sign of
                                                                                                                                        reference image selection.
strong-weak relationship between each image 𝐼𝑛 to other                                                                                    To realize that, we focus on a square region of neighbor
images 𝐼𝑛^ , 𝑛
             ˆ ∈ {1, . . . , 𝑁 }𝑛 are calculated as probabil-                                                                           images as shown in Figure 7, and call this square consis-
ity values 𝑃 𝐴𝑠 and 𝑃 𝐷𝑒𝑠 , the total consistency values in                                                                             tency square. Suppose the differences between images
the baseline ranking is derived as                                                                                                      are significant, and a strong-weak relationship is evident
           ⎧                                                                                                                            in the images. In that case, the consistency values in the
        𝑁                                                                                                                               consistency square are also expected to be large, i.e., cells
           ⎪
       ∑︁ ⎨               ∑︁
  𝐿=                                        𝑃𝐷𝑒𝑠 (𝑛, 𝑛ˆ)                                                                                in the square become blue. In contrast, when images
                                                                                                                                        are similar, and therefore the difference between images
           ⎪
        𝑛 ⎩{𝑛  ^ |𝐼𝑛    𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }}
                   ^ ∈{𝐼1           𝑛−1
                                                       ⎫                                                                                is ambiguous, the consistency values in the consistency
                                                                                                                                        square become low, i.e., cells in the square become white
                          ∑︁                           ⎪
                                                       ⎬
          +                                 𝑃𝐴𝑠 (𝑛, 𝑛
                                                    ˆ)   (5)                                                                            and red.
                                                       ⎪
               ^ |𝐼 ∈{𝐼 𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }}
              {𝑛                                                                                                                           The basic idea of building a consistent evaluation space
                                                       ⎭
                                      𝑛
                                      ^        𝑛+1              𝑁

                                                                                                                                        is to quantize images with low consistency values in the
By maximizing this total consistency, base-                                                                                             consistency square into a single class. As a result, the
line ranking images (𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁         𝑟𝑎𝑛𝑘
                                                      )       =                                                                         ambiguity between these images becomes “don’t care” in
arg max 𝐼1  𝑟𝑎𝑛𝑘
                  , . . . , 𝐼𝑁 𝐿 can be obtained. From now
                             𝑟𝑎𝑛𝑘
                                                                                                                                        the evaluation space, and the consistency of the evaluated
on, the subscript 𝑛 will be used to sort the images in                                                                                  value becomes significant. That is, it is good to select
descending order of smiling degree.                                                                                                     images where the total consistency values in the sum of
   An example of the consistency of the strong-weak                                                                                     consistency squares, as shown in the right of Figure 7,
relations in this rearranged table is shown in Figure                                                                                   becomes low.
6 (c) by replacing 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) into 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 )                                                                                 In addition, neighbor images in the baseline ranking
when 𝑠𝑛(𝐼𝑡𝑔𝑡 ) < 𝑠𝑛(𝐼𝑟𝑒𝑓 ). We here denote probabil-                                                                                    should not be selected as reference images. In other
ities 𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 to indicate this consistency as follows:                                                                                 words, to select good reference images, the evaluation
               {︃                                                                                                                       space is better to be divided evenly. To realize that, select
                  𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) > 𝑠𝑛(𝑟𝑒𝑓 ),                                                                               a group of reference images so that the sum of the area
 𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 =
                  𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) < 𝑠𝑛(𝑟𝑒𝑓 ).                                                                              of consistency space becomes small. To sum up, it is
                                                             (6)                                                                        better to choose a group of images for which both the
   Ideally, all cells would be blue, i.e., consistency is nearly                                                                        sum of consistency values and the sum of areas within
equal to 1. However, due to the ambiguity of compar-                                                                                    the consistency square is small. Here, selecting a group
ison results for similar facial expressions, there is also                                                                              of images with high consistency values within the consis-
ambiguity in the consistency between the neighboring                                                                                    tency square is equivalent to selecting a group with low
baseline ranking. Therefore, reference image selection is                                                                               inconsistency. In summary, a group of reference images
important to construct appropriate evaluation space.                                                                                    should be selected to maximize the total values of incon-
                                                                                                                                        sistencies in consistency squares divided by the sum of
4.2. Reference image selection                                                                                                          the areas of consistency squares.
                                                                                                                                           In practice, a procedure of reference image selection
An important factor in selecting reference images is the                                                                                is the following. Suppose there are 𝑁 images in total,
consistency of the strong-weak relationships within ref-                                                                                and we want to select one image as a reference image.
erence images. That is, when the consistency table is
At first, baseline ranking is constructed as introduced in
the previous subsection and obtains consistency values
𝐶𝑖,𝑗 (𝑖, 𝑗 < 𝑁 ) for all pairs in the ranking. When evalu-
ation space is divided into two with image 𝐼𝑚 (𝑚 < 𝑁 ),
the sum of inconsistency values of the two spaces divided
by the sum of the area is calculated as:
         ∑︀𝑚 ∑︀𝑚                    ∑︀𝑁 ∑︀𝑁
 (2)       𝑖=1    𝑗=1 (1 − 𝐶𝑖,𝑗 ) +   𝑖=𝑚 𝑗=𝑚 (1 − 𝐶𝑖,𝑗 )
𝐷𝑚   =                                                    .
                          𝑚2 + (𝑁 − 𝑚)2
                                                       (7)
                                   (2)
By searching the position whose 𝐷𝑚 are maximum, the
best reference image 𝐼𝑚 that divides evaluation space
into two can be obtained. Similar to this, by calculating
the sum of inconsistency in the consistency squares di-
vided by the sum of the area with different division num-
bers 𝑁𝑟𝑒𝑓 + 1, 𝑁𝑟𝑒𝑓 reference images can be obtained.
When it comes to calculating these values, we apply a
scheme of dynamic programming to reduce calculation
costs.                                                     Figure 8: Examples of prediction results by comparison net-
                                                              work. The first row shows label and prediction results, and
                                                              the second row shows regions on which a network focuses by
5. Experiment                                                 Grad-CAM.

We conducted an experiment to evaluate the following
things:
                                                              training data in this paper. In particular, we manually an-
     • How a proposed network can evaluate image pair.        notated segments that we thought the degree of smiling
     • How appropriately reference images can be se-          ascended or descended monotonically. We then picked
       lected regarding consistency of both network and       each segment’s start and end frame to construct one pair
       human annotators’ evaluations.                         with its label. The number of image pairs of each dataset
                                                              were 216, 174, and 123, respectively. Also, all face images
     • How the selected reference images evaluate face
                                                              in these pairs were utilized as baseline images. That is,
       images.
                                                              the size of the dataset is twice the size of the image pairs;
                                                              432, 348, 246, respectively.
5.1. Dataset construction
At first, the face image dataset is constructed by captur-    5.2. Evaluation scheme
ing participants’ face images. We conducted two types
                                                              The procedure of evaluation was constructed the follow-
of experiments to construct datasets with different situa-
                                                              ing four steps; (1) the proposed network was trained for
tions. In the first type of dataset, we asked a participant
                                                              each individual by collected data and was evaluated by
to sit in front of the camera and to listen to funny radio.
                                                              the cross-validation scheme; (2) we constructed the base-
In the second type of dataset, we asked a participant to
                                                              line ranking and evaluated the voting-based algorithm by
sit in front of the laptop PC and play a simple game. We
                                                              determining the rank within the images in the baseline
captured facial images of these participants during the
                                                              ranking; (3) reference images were selected from baseline
experiment. The second experiment was still experimen-
                                                              images as proposed in the previous section and evaluated
tal but closer to a natural scene than the first one. Each
                                                              by human annotators on how they were consistent; (4)
dataset was constructed only by one participant because
                                                              we confirmed the smiling intensity of face images in the
our focus was to build a model to evaluate each individ-
                                                              evaluation space constructed by reference images.
ual. We collected two datasets of the first type, and one
                                                                 As for training our network, we utilized pre-trained
dataset of the second type.
                                                              feature extraction layers of VGG-Face [12], which was
   Then, we added labels between image pairs that
                                                              trained on millions of face images for person identifica-
showed which of the two images expressed more smiles.
                                                              tion, and trained only fully connected layers.
The annotation between images with a slight difference
                                                                 Regarding the evaluation of the voting-based algo-
in the degree of smiling was difficult, even for humans. It
                                                              rithm, the rank of each baseline image was determined by
might cause a mistake in giving the correct labels. There-
                                                              the baseline ranking itself. In this evaluation, the grand
fore, we utilized image pairs with a clear difference as
                                                              truth of the rank of each baseline image was given as the
original rank of baseline ranking.
   As for evaluating selected reference images, nine ref-
erence images were selected from the baseline images.
Then, human annotators were asked to evaluate which
of the two images represented more smiling for the pair
of neighboring images in reference images. The images
up to the third nearest neighbor images were considered
a pair, and all image pairs were annotated twice by swap-        Figure 9: Selected reference images of dataset 1. More smile
ping the left and right sides of the comparison image.           images are located on the left side.
After comparing the image pairs within each dataset, the
participants moved on to the next dataset. The order of
evaluated image pairs was randomized for each dataset,
but the order of the datasets was constant. We evaluated
the consistency of the reference images by how accurate
and consistent annotators evaluated the image pair. A
group of 9 images regularly extracted every 𝑁/10 from
the baseline images {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼9𝑟𝑎𝑛𝑘 } was used as the
reference image for comparison. Seven participants be-           Figure 10: Consistency of estimated rank and original rank
tween the ages of 21 and 27 (6 male and 1 female) were           in baseline ranking
recruited as annotators.

5.3. Results                                                     5.3.3. Selected reference images

5.3.1. Prediction accuracy                                       Figure 11 shows selected reference images by the pro-
                                                                 posed algorithm and equally picked up from the baseline
We first show the evaluation results of the trained              ranking. The consistency table correlated to this result of
comparison network in terms of accuracy.              In         dataset 1 is shown in Figure 12. In this figure, the green
this evaluation, five-fold cross-validation was applied,         line shows where the algorithm divides baseline ranking.
and prediction accuracies of three datasets were                 These results show that there still appears to be some am-
99.5% (215/216), 100% (174/174), 98.3% (121/123),                biguity between adjacent images, even with the proposed
respectively. Figure 8 shows examples of prediction re-          approach. However, it appears to be reduced compared
sults with Gradient-weighted Class Activation Mapping            to a group of images acquired at regular intervals.
(Grad CAM) [13]. In each figure, the first row shows the            The consistency table calculated by these selected ref-
grand truth label and estimation results, and the second         erence images is shown in Figure 13. We can see that
row shows regions on which a network focuses for the             almost all the cells represent blue. This result shows that
prediction as a heat map. From these results, we can see         ambiguities within reference images are small.
that the network returns accurate prediction results by             Figure 14 and Figure 15 show the quantitative evalua-
correctly focusing on face regions, including the mouth          tion results of selected reference images by annotators.
and eyes, which are well known as corresponding to               In each figure, “proposed” and “baseline” represent the
smiling, even from the small dataset.                            result for the images up to the third nearest neighbor
                                                                 reference images of the proposed algorithm and baseline
5.3.2. Consistency of voting based evaluation                    algorithm, respectively, and “proposed_adjacent” and
                                                                 “baseline_adjacent” represent the result for the images
Figure 9 shows four examples of estimated ranks of im-
                                                                 only the nearest neighbor reference images. That is,
ages in the baseline ranking, rank 1, 100, 200, and 400
                                                                 the difficulty of the evaluation becomes hard. Figure 14
of dataset 1. The total number of baseline images was
                                                                 shows a prediction accuracy of evaluation results. Here,
216 × 2 = 432. Each image was estimated as rank 1, 103,
                                                                 we consider the order given by the proposed network as
199, and 398, respectively, and an almost correct rank can
                                                                 the grand truth of the prediction. Therefore, this result
be estimated by the voting algorithm. Figure 10 shows
                                                                 also shows a correlation between network prediction and
all pairs of estimated rank and grand truth label of this
                                                                 human perceptions. Figure 15 shows the consistency of
evaluation in dataset 1. These results show the consis-
                                                                 each participant’s evaluation. In particular, it shows how
tency and effectiveness of the voting-based algorithm,
                                                                 much the same evaluation was given when the same im-
as it predicted almost consistent values for all baseline
                                                                 age pair displayed with the left and right sides swapped.
images.
                                                                 This high consistency indicates a low degree of ambigu-
                                                                 ity between image pairs. In almost all cases, reference
  Strongest smile


                                              Weakest smile
   (a) Selected reference images by proposed algorithm


                                                                   Figure 13: Consistency table calculated by selected reference
                                                                   images of dataset 1.
  Strongest smile


                                                   Weakest smile
                (b) Selected reference images by regularly
                      picked up from baseline ranking

Figure 11: Selected reference images of dataset 1. More smile
images are located on the left side.


           Strongest                             Weakest
           smiling           Target image        smiling

                                                                   Figure 14: Accuracies of annotators evaluation. Adjacent
                                                                   means the image pair consists of the nearest neighbor images.
           Reference image


                                                                   image pairs with higher accuracy than the comparison
                                                                   method.

                                                                   5.3.4. Smiling intensity evaluation
                                                           Examples of face images evaluated by selected reference
                                                           images and the proposed network are shown in Figure 16.
Figure 12: Consistency table of dataset 1. The green line Since it is sometimes hard to qualitatively evaluate two
shows where the algorithm divides baseline ranking.        adjacent images in a row, the four reference images skip
                                                           one rank at a time. The images with a smile level one
                                                           class lower than the reference image are listed, and each
                                                           row shows the same evaluation value. The images on the
images selected by the proposed algorithm obtain higher
                                                           left side of the figure are recognized as having a higher
accuracies and higher consistencies. Since the smiles ex-
                                                           degree of smiling. This result confirms that the proposed
pressed in the experimental time were quantized into ten
                                                           method effectively evaluates the degree of smiling within
levels and the maximum value of the smile was not very
                                                           the ordinal scale.
high, both methods have a certain degree of similarity be-
                                                              Finally, a part of the transition of the smiling intensity
tween the neighboring reference image pairs. Therefore,
                                                           during the experiment is shown in Figure 17. In this
evaluation by humans may be somewhat difficult even
                                                           result, an evaluation score was smoothed by the median
with the proposed method. However, even in such a situ-
                                                           filter to trace the trend of transitions. We can see that
ation, we can confirm that the proposed method selects
                                                                       on multiple comparisons for the purpose of monitoring
                                                                       individuals. Suppose that we have enough data from
                                                                       individual face images; we also propose an algorithm for
                                                                       selecting appropriate reference images for the ordinal
                                                                       evaluation.
                                                                          Experimental results show that our ordinal scale-based
                                                                       evaluation can successfully give the degree of not only
                                                                       clear smiling but also intermediate facial expressions. In
                                                                       addition, we can see that an evaluation space constructed
                                                                       by selected reference images by our algorithm is more
                                                                       consistent and, therefore, considered to be reasonable.
                                                                          One of the future works is to map the proposed and
                                                                       constructed ordinal scale to some physical index. Al-
Figure 15: Consistencies of annotators evaluation of the same          though this paper proposed a method of selecting refer-
image pair. Adjacent means the image pair consists of the              ence images that are somewhat reasonable when evalu-
nearest neighbor images.                                               ated by humans, the validity of the scale would be im-
                                                                       proved if it could be mapped to some physical index.
                                                                       For example, by measuring the myoelectricity of facial
   Reference
   images
                                                                       muscles, the degree of muscle activity could be used as
                                                                       an index. In addition, the other future work is to apply
                                                                       this technique to people whose facial expressions do not
                                                                       change much, e.g., dementia patients, as we described in
                                                                       the introduction section.
  Evaluated
  images


                                                                       Ethics
          Strongest smile
                                                                       Our method aims to monitor the daily health conditions
                                                       Weakest smile
                                                                       of a specific individual by evaluating the smiling intensity
Figure 16: Example results of the evaluation of the degree of          using a model trained specifically for the individual’s
smiling                                                                facial images. Since data for model training and smiling
                                                                       intensity evaluation can be collected and processed at
                                                                       terminals installed in each individual’s environment, it
                                                                       is expected to reduce the risk of leakage of particularly
                                                                       strong personal information such as facial images being
                                                                       stored in the cloud in practical applications.


                                                                       References
Figure 17: A part of the transition of the degree of the smiling.       [1] K. Kondo, T. Nakamura, Y. Nakamura, S. Satoh,
Each grid line of time shows 10 seconds.                                    Siamese-structure deep neural network recognizing
                                                                            changes in facial expression according to the degree
                                                                            of smiling, in: Proc. of ICPR2020, 2021, pp. 4605–
the participant smiles several times in this period. It can                 4612. doi:10.1109/ICPR48806.2021.9411988.
be seen that smiles of slightly stronger intensity than                 [2] P. EKMAN, Facial action coding system (facs), A
the middle level occurred several times in succession in                    Human Face (2002). URL: https://ci.nii.ac.jp/naid/
the first half of this period. In comparison, smiles of                     10025007347/.
considerably stronger intensity occurred with a short                   [3] B. Amos, B. Ludwiczuk, M. Satyanarayanan, Open-
interval in the second half of this period.                                 Face: A general-purpose face recognition library
                                                                            with mobile applications, Technical Report, CMU-
                                                                            CS-16-118, CMU School of Computer Science, 2016.
6. Conclusion                                                           [4] K. Simonyan, A. Zisserman, Very deep convolu-
                                                                            tional networks for large-scale image recognition,
In this paper, we propose an approach to evaluate the                       in: Proc. of ICLR 2015, San Diego, CA, USA, May
degree of smiling of individuals by ordinal scales based                    7-9, 2015.
 [5] C. C. Atabansi, T. Chen, R. Cao, X. Xu, Transfer
     learning technique with vgg-16 for near-infrared fa-
     cial expression recognition, Journal of Physics: Con-
     ference Series 1873 (2021) 012033. URL: https://dx.
     doi.org/10.1088/1742-6596/1873/1/012033. doi:10.
     1088/1742-6596/1873/1/012033.
 [6] Y. Liu, Facial expression recognition model based
     on improved vggnet, in: 2023 4th International
     Conference on Electronic Communication and Ar-
     tificial Intelligence (ICECAI), 2023, pp. 404–408.
     doi:10.1109/ICECAI58670.2023.10177007.
 [7] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah,
     Signature verification using a "siamese" time delay
     neural network, in: Proc. of NIPS’93, 1993, pp. 737–
     744.
 [8] J. Bromley, J. Bentz, L. Bottou, I. Guyon, Y. Lecun,
     C. Moore, E. Sackinger, R. Shah, Signature veri-
     fication using a "siamese" time delay neural net-
     work, International Journal of Pattern Recognition
     and Artificial Intelligence 7 (1993) 25. doi:10.1142/
     S0218001493000339.
 [9] X. Zhou, W. Liang, S. Shimizu, J. Ma, Q. Jin,
     Siamese neural network based few-shot learning
     for anomaly detection in industrial cyber-physical
     systems, IEEE Transactions on Industrial Informat-
     ics 17 (2021) 5790–5798. doi:10.1109/TII.2020.
     3047675.
[10] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural
     networks for one-shot image recognition, in: Proc.
     of the deep learning workshop in the 32nd Interna-
     tional Conference on Machine Learning, volume 2,
     2015.
[11] J. Zhang, K. Shimonishi, K. Kondo, Y. Nakamura,
     Facial expression change recognition on neutral-
     negative axis based on siamese-structure deep neu-
     ral network, in: Cross-Cultural Design. Product and
     Service Design, Mobility and Automotive Design,
     Cities, Urban Areas, and Intelligent Environments
     Design: 14th International Conference, CCD 2022,
     Held as Part of the 24th HCI International Confer-
     ence, HCII 2022, 2022, pp. 583–598.
[12] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face
     recognition, in: British Machine Vision Conference,
     2015.
[13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
     D. Parikh, D. Batra, Grad-cam: Visual explanations
     from deep networks via gradient-based localization,
     International Journal of Computer Vision 128 (2019)
     336 – 359. doi:10.1007/s11263-019-01228-7.