=Paper= {{Paper |id=Vol-3649/Paper4 |storemode=property |title=Ordinal Scale Evaluation of Smiling Intensity using Comparison-Based Network |pdfUrl=https://ceur-ws.org/Vol-3649/Paper4.pdf |volume=Vol-3649 |authors=Kei Shimonishi,Kazuaki Kondo,Hirotada Ueda,Yuichi Nakamura |dblpUrl=https://dblp.org/rec/conf/aaai/ShimonishiKU024 }} ==Ordinal Scale Evaluation of Smiling Intensity using Comparison-Based Network== https://ceur-ws.org/Vol-3649/Paper4.pdf

Ordinal Scale Evaluation of Smiling Intensity using
Comparison-Based Network
Kei Shimonishi1,* , Kazuaki Kondo1 , Hirotada Ueda1 and Yuichi Nakamura1
1
Kyoto University, Yoshida-honmachi, Sakyo, Kyoto, Japan

Abstract
The ability to evaluate both explicit facial expressions and intermediate expressions is helpful for human monitoring. Since
intermediate facial expression is out of the scope of traditional studies, evaluation scores obtained from traditional facial
expression recognition techniques are unreliable. In this paper, we propose an ordinal scale-based evaluation scheme for
facial expression based on a comparison. The proposed framework is based on an ordinal scale; it is challenging to construct
a standard scale that can be applied to multiple individuals. However, it is expected to be effective enough to track changes
in the facial expressions of the specific individual, including intermediate expressions. We also propose an algorithm for
selecting reference images from the data by taking into account the consistencies of the strong-weak relationships between
reference images because the reference image selection significantly impacts the ordinal evaluation. Our approach is evaluated
by conducting experiments with human annotators.

Keywords
Facial expression recognition, Siamese network, ranking, ordinal scales

1. Introduction Transitions of facial expressions

Monitoring an individual’s Quality of Life (QOL) is be-
coming increasingly important to maintain good mental
Smile intensity

conditions and detect early trends in harmful conditions.
Because direct QOL inquires are bothering and it is diffi-
cult to accurately represent one’s internal state, estimat-
ing internal state from external nonverbal information time

is desired. Facial expression is one of the modalities that
reflects an individual’s internal state and is expressed Time

with being influenced by mental condition. For example,
when an individual is not feeling well, the same smile Figure 1: An example of transition curve of smiling intensity
may appear weaker than usual. Therefore, monitoring fa- in daily life
cial expressions in daily life is a crucial clue to estimating
an individual’s QOL.
The research field of facial expression recognition Though the traditional algorithm of FER seems able
(FER) has a long history, and it has already been put to evaluate intermediate facial expressions as a prob-
into practical use as a technology, such as smiling shut- ability that a specific facial expression is represented,
ters. While traditional FER mainly focuses on recognizing the probability values are not so reliable, especially for
whether a clear facial expression is represented or not, evaluating intermediate expressions. This is because the
from the viewpoint of monitoring in daily life, evaluating intermediate facial expression was out of the scope of
the degree of expression for the individual is rather cru- the traditional studies; learning is likely to output a value
cial, especially for patients with dementia who have little close to the binary value of either no expression (0) or
or no facial expressions. Based on this point of view, this an expression (1). As a result, for example, when the
research aims to draw a curve of transitions of the indi- degree of smile expression is estimated for a series of
vidual’s degree of facial expressions, particularly smiling facial expressions, the value may change abruptly over a
intensity, as shown in Figure 1. series of times as shown in Figure 2. In addition, it is also
difficult for the machine learning algorithm to directly
Machine Learning for Cognitive and Mental Health Workshop
(ML4CMH), AAAI 2024, Vancouver, BC, Canada learn the intermediate facial expressions since it is diffi-
*
Corresponding author. cult for even humans to give appropriate absolute values
$ shimonishi@i.kyoto-u.ac.jp (K. Shimonishi); for intermediate facial expressions.
kondo@ccm.media.kyoto-u.ac.jp (K. Kondo); Kondo et al. [1] proposed a network for recognizing
ueda.hirotada.2r@kyoto-u.ac.jp (H. Ueda);
smiling based on “comparison” to address the issues of
yuichi@media.kyoto-u.ac.jp (Y. Nakamura)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License recognizing intermediate facial expressions. Their work
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
We briefly introduce related work in the next section.
1
Smiling score

Then, we introduce an approach to evaluate facial ex-
pressions by ordinal scales and an algorithm of reference
image selection. We evaluate our approach and algo-
0
Time

rithm with human annotators, and finally, we conclude
our research.
Figure 2: An example of sudden jump of evaluation scores for
intermediate facial expressions by traditional facial expression
recognition technique 2. Related Work
2.1. Facial expression recognition
Facial expression recognition is widely utilized in several
fields. Traditional studies mainly focused on determining
whether a specific expression is represented or not.

2.1.1. Facial Action Coding Systems
Figure 3: Overview of evaluation method of facial expression
intensity based on comparison Facial Action Coding Systems (FACS) [2] is a framework
proposed by Ekman et al. that classifies a face into sev-
eral parts (Action Units; AUs) based on the basic action
units of individual muscles and describes facial expres-
is based on the assumption that the problem of relatively
sions as a combination of these AU actions. Many facial
evaluating which of two images represents more smiles
expression recognition applications have used FACS as
by comparing two images is easier than absolutely eval-
features, and for example, OpenFace [3] can analyze mul-
uating a degree of smiling from only one image.
tiple facial expressions in near real-time by automatically
By borrowing this comparison-based idea to evaluate
recognizing the actions of AUs.
facial expressions, we propose an approach to evaluate
smiling intensity with an ordinal scale. The basic idea of
this approach is that if we have multiple reference face 2.1.2. Deep neural network based approach
images for a specific individual and a method for com- Although the FACS-based FER approach has been suc-
paring facial expressions, we can evaluate the smiling cessful, it has the limitation that the final results are
intensity of a new image of the individual through pair- affected by the accuracy of FACS detection. This limita-
wise comparison with the reference images, as shown in tion can become a problem, especially when trying to
Figure 3. capture subtle differences in facial expressions because
Since the expression ratings in this method are based the effect of observation noise cannot be ignored. On
on an ordinal scale, the degree of each rating is not the the other hand, the end-to-end approach by the deep
same for multiple individuals. However, this ordinal neural network can be expected to reduce the effect of
scale-based approach may satisfy our need to capture such observation noise by eliminating the necessity of
changes in facial expressions for each individual. explicit feature detection. For example, VGGNet [4] is
In addition, reference image selection is crucial for this a traditional deep neural network, but it is known that
ordinal-based evaluation because they are considered an human facial features can be extracted well, and recent
evaluation space for facial expressions. Therefore, we research of FER also utilized VGGNet [5, 6]
also propose an algorithm of reference image selection
from a large number of face image data of each individ-
ual based on consistencies of comparison results within 2.2. Siamese structure-based recognition
images. technique
In summary, the contributions of this paper are as Siamese network [7] is one of the deep neural networks
follows: of metric learning. This network acquires two inputs and
• We propose an approach to evaluating intermedi- returns the distance between the two inputs. By apply-
ate smiling intensity by ordinal scales based on ing the same structures and the same weights to feature
comparisons. extraction layers of these two inputs and the distance
of the two inputs to the loss function, the network can
• We propose an algorithm for selecting appro-
learn the distance space. The Siamese Network is a net-
priate reference images to construct appropriate
work that determines whether two inputs are similar or
evaluation space.
different and has been applied to handwritten signature
Forward comparison stream
Iref
‘Ascent’
𝒙𝑟𝑒𝑓 𝒙𝑟𝑒𝑓
‘Descent’
Ground truth
𝒙𝑡𝑔𝑡

FC layers
(shared)
CNN layers
(shared)
‘Ascent’
In the inverse In the forward
comparison comparison ‘Descent’
Itgt 𝒙𝑡𝑔𝑡 ‘Ascent’ ‘Ascent’
𝒙𝑡𝑔𝑡 ‘Descent’ ‘Descent’
𝒙𝑟𝑒𝑓 Categorical cross
entropy loss
Inverse comparison stream

Figure 4: Siamese-based network to compare face images to evaluate the degree of smiling

recognition [8] and used as a framework for anomaly 3.2. A network for facial expression
detection [9]. As one of its features, it is known as a net- comparison
work that can be trained from a small number of training
data compared to conventional networks that perform In this paper, we defined the recognition task as a simple
multi-valued discrimination and regression [10]. two-category classification problem (i.e., determining
Kondo et al. [1] proposed an approach to the evalua- which of two input images represents the greater degree
tion of facial expressions based on comparison inspired of smiling) and construct a Siamese-based network to
by the Siamese structure. Their approach compares two recognize smiling similar to the network Kondo et al.
facial images and returns which of one image represents have developed [1].
more smiles, and they showed that the approach has the Figure 4 shows the structure of the proposed network
potential to distinguish subtle facial expression differ- that accepts two input images and returns two likelihood
ences. In addition, Zhang et al. [11] extended their work values corresponding to ascension and descension labels
from a positive-neutral direction to a negative-neutral relative to the degree of smiling. We employed the CNN
direction. component of VGG16 [4] and two fully connected lay-
ers with rectified linear units, a 0.25 dropout rate, and
SoftMax in the proposed method. The ground-truth like-
3. Comparison-based smiling lihood values for an input image pair were represented
evaluation by ordinal scales as a two-element one-hot vector, with its element cor-
responding to the ground truth label set to 1 and the
3.1. Overview of the proposed framework other element set to 0, respectively. We used categorical
cross-entropy loss to optimize the network parameters,
As introduced in the Introduction, the basic idea of our as follows:
approach is a comparison-based evaluation. Kondo et ∑︁
al.[1] has developed a Siamese-based smiling recogni- 𝐿𝑐𝑎𝑡 = − {𝑦𝑖 𝑙𝑜𝑔(𝑦ˆ𝑖 ) + (1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑦ˆ𝑖 )} ,
tion network that takes two face images as input and 𝑖

recognizes which one is expressing smiling more. By (1)
borrowing this idea, once we develop a network that can where 𝑖 = {0, 1}, 𝑦𝑖 , and 𝑦
ˆ 𝑖 denotes ascension and
determine which of two images represents more smiles, descension labels relative to the degree of smiling, the
and if we have multiple reference images, we can evalu- ground-truth label, and the predicted likelihood values,
ate the smile intensity of a new image through pairwise respectively.
comparison with the reference images as also introduced The previously proposed network by Kondo et al. was
in the Introduction. When it comes to determining smil- not designed to consider the order of inputs, resulting
ing intensity based on ordinal scales, although all the in instances where swapping the order of two inputs
comparison results are ideally consistent, the results are led to contradictory outputs. To address this issue, we
sometimes inconsistent due to an ambiguity of slightly input a permuted version of the two features extracted
different face images. Therefore, we apply a voting-based from two input images by the CNN component into the
evaluation and determine smiling scores by merging mul- fully connected layer in the latter stage and calculate the
tiple comparison results. In addition, we propose an al- categorical cross-entropy loss of inverted input, 𝐿𝑖𝑛𝑣 ,
gorithm to select appropriate reference images to reduce as same as 𝐿𝑐𝑎𝑡 , as shown in red arrows in Figure 4.
the ambiguity between reference images in the following Also, a loss of consistency of these two types of input is
section. calculated as

𝐿𝑐𝑜𝑛 = 1 − {𝑚𝑖𝑛(𝑃𝑓𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ))
voted to ranks larger than 𝑛. In practice, add likelihood
values of “ascend” and “descend” to the ranks lower than
and higher than 𝑛, respectively.
Simply thinking, the degree of smiling in the refer-
ence images can be determined by searching the position
whose scores are maximum. Here, the position 𝑟 can be
derived as:
{︃ 𝑟−1 𝑁
}︃
∑︁ 𝐷𝑒𝑠 ∑︁ 𝐴𝑠
𝑟 = arg max 𝑃 (𝑛𝑒𝑤, 𝑛) + 𝑃 (𝑛𝑒𝑤, 𝑛) .
Figure 5: Voting-based evaluation 𝑟
𝑛=1 𝑛=𝑟+1
(4)
In addition, we here apply a mean-shift algorithm to
+𝑚𝑖𝑛(𝑃𝑓𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ))}, (2) determine the evaluation score based on these probability
values.
where 𝑃 𝐴𝑠 (𝑛, 𝑚) and 𝑃 𝐷𝑒𝑠 (𝑛, 𝑚) represent probabil-
ities that the degree of smiling of image 𝐼𝑛 is larger
or smaller than that of image 𝐼𝑚 , respectively. In 4. Reference image selection
other words, 𝑃 (𝑠𝑛(𝐼𝑛 ) > 𝑠𝑛(𝐼𝑚 )) and 𝑃 (𝑠𝑛(𝐼𝑛 ) <
𝑠𝑛(𝐼𝑚 )), where 𝑠𝑛(𝐼) represents a degree of smiling of To apply voting-based ordinal-scale evaluation as de-
image 𝐼. Also, 𝑃𝑓 and 𝑃𝑖 represent the likelihoods of scribed above, we first need to construct an evaluation
the forward comparison stream and inverse comparison space with several reference images. Since the proposed
stream, respectively. approach utilizes ordinal scales, the construction of the
In total, our network is trained to decrease the follow- evaluation space is crucial for the capability of the ap-
ing loss function: proach. Although a straightforward way is to utilize all
the face data as reference images, the evaluation space
𝐿 = 𝐿𝑐𝑎𝑡 + 𝐿𝑖𝑛𝑣 + 𝐿𝑐𝑜𝑛 . (3) constructed by very similar or subtle different images is
unreliable due to the ambiguity of these images.
Here, we expected that the CNN and the fully con- In this paper, we first consider all the data as baseline
nected components would be trained to compare ex- images and take pair-wise comparisons to sort all the
tracted features in order to project the results onto the data in a dataset and construct a baseline ranking. Then,
likelihood values of the ascension and descension labels, we select several images from the baseline ranking as
respectively. reference images and quantize the evaluation space by
taking into account consistency to address the issues due
3.3. Voting-based evaluation to ambiguity.

Since the reference images may include some ambiguity
4.1. Baseline ranking construction
between neighboring images, it is difficult to directly
determine the degree of smiling of the new target image Figure 6 (a) shows a comparison table of the result of all
in a reference image set. Therefore, we apply a voting pair-wise comparisons in baseline images. Each color
technique to determine the final rank of the image. The shows a probability of how a target image has a stronger
algorithm votes to possible ranks using the result of each smile than a reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The
comparison of reference images and a target image. As blue shows a pair whose target image has a stronger
a result, the most likely rank should have a maximum smile than the reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) >
number of votes. 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). In contrast, the red area shows a pair
In particular, the procedure is as follows. Suppose that whose reference image has a stronger smile than a target
we have 𝑁 reference images with its order of degree of image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) < 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The white
smiling, i.e., 𝑠𝑛(𝐼𝑖 ) > 𝑠𝑛(𝐼𝑗 ), ∀𝑖 < 𝑗. A new target area represents that the target and the reference images
image 𝐼𝑛𝑒𝑤 is compared to all reference images, and like- represent similar facial expressions.
lihood values that the degree of smiling of target image is By sorting the baseline images based on the sum of the
larger than that of a reference image 𝑃 𝐴𝑠 (𝑛𝑒𝑤, 𝑛) and probability values in each column of this table, a baseline
likelihood that the degree of smiling of target image is ranking considering the consistency of the strong-weak
lower than that of a reference image 𝑃 𝐷𝑒𝑠 (𝑛𝑒𝑤, 𝑛) for relationship can be constructed (Figure 6 (b)). In partic-
all reference images (𝑛 ∈ {1, . . . , 𝑁 }) are obtained. Be- ular, suppose we have 𝑁 baseline images {𝐼1 , . . . , 𝐼𝑁 }
cause if 𝑠𝑛(𝐼𝑛𝑒𝑤 ) < 𝑠𝑛(𝐼𝑛 ) is estimated, the smile rank in total, and denote images sorted in descending order
of 𝐼𝑛𝑒𝑤 is estimated as larger than 𝑛, large values are by smiling intensity as {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁
𝑟𝑎𝑛𝑘
}. Since the
Strongest Target image Weakest
Smiling ranking smiling Strongest Weakest
smiling Target image
smiling smiling

Reference image
Smiling ranking
Target image TgtRef

Inconsistent
Reference image

(b) Strong-weak relations between images
Strongest Weakest
smiling Target image smiling

Re-arrange

Reference image
RefTgt

sum Inconsistent
Ranking score Figure 7: Consistencies of large small relationships in neigh-
(a) Comparison table bor images as a part of consistency table in baseline ranking.
(c) A consistency of strong-weak
relations between images

Figure 6: A baseline ranking made from a comparison table
calculated the same as the bottom-right figure of Figure
6, the less red and white colors area is a better sign of
reference image selection.
strong-weak relationship between each image 𝐼𝑛 to other To realize that, we focus on a square region of neighbor
images 𝐼𝑛^ , 𝑛
ˆ ∈ {1, . . . , 𝑁 }𝑛 are calculated as probabil- images as shown in Figure 7, and call this square consis-
ity values 𝑃 𝐴𝑠 and 𝑃 𝐷𝑒𝑠 , the total consistency values in tency square. Suppose the differences between images
the baseline ranking is derived as are significant, and a strong-weak relationship is evident
⎧ in the images. In that case, the consistency values in the
𝑁 consistency square are also expected to be large, i.e., cells
⎪
∑︁ ⎨ ∑︁
𝐿= 𝑃𝐷𝑒𝑠 (𝑛, 𝑛ˆ) in the square become blue. In contrast, when images
are similar, and therefore the difference between images
⎪
𝑛 ⎩{𝑛 ^ |𝐼𝑛 𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }}
^ ∈{𝐼1 𝑛−1
⎫ is ambiguous, the consistency values in the consistency
square become low, i.e., cells in the square become white
∑︁ ⎪
⎬
+ 𝑃𝐴𝑠 (𝑛, 𝑛
ˆ) (5) and red.
⎪
^ |𝐼 ∈{𝐼 𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }}
{𝑛 The basic idea of building a consistent evaluation space
⎭
𝑛
^ 𝑛+1 𝑁

is to quantize images with low consistency values in the
By maximizing this total consistency, base- consistency square into a single class. As a result, the
line ranking images (𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁 𝑟𝑎𝑛𝑘
) = ambiguity between these images becomes “don’t care” in
arg max 𝐼1 𝑟𝑎𝑛𝑘
, . . . , 𝐼𝑁 𝐿 can be obtained. From now
𝑟𝑎𝑛𝑘
the evaluation space, and the consistency of the evaluated
on, the subscript 𝑛 will be used to sort the images in value becomes significant. That is, it is good to select
descending order of smiling degree. images where the total consistency values in the sum of
An example of the consistency of the strong-weak consistency squares, as shown in the right of Figure 7,
relations in this rearranged table is shown in Figure becomes low.
6 (c) by replacing 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) into 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) In addition, neighbor images in the baseline ranking
when 𝑠𝑛(𝐼𝑡𝑔𝑡 ) < 𝑠𝑛(𝐼𝑟𝑒𝑓 ). We here denote probabil- should not be selected as reference images. In other
ities 𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 to indicate this consistency as follows: words, to select good reference images, the evaluation
{︃ space is better to be divided evenly. To realize that, select
𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) > 𝑠𝑛(𝑟𝑒𝑓 ), a group of reference images so that the sum of the area
𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 =
𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) < 𝑠𝑛(𝑟𝑒𝑓 ). of consistency space becomes small. To sum up, it is
(6) better to choose a group of images for which both the
Ideally, all cells would be blue, i.e., consistency is nearly sum of consistency values and the sum of areas within
equal to 1. However, due to the ambiguity of compar- the consistency square is small. Here, selecting a group
ison results for similar facial expressions, there is also of images with high consistency values within the consis-
ambiguity in the consistency between the neighboring tency square is equivalent to selecting a group with low
baseline ranking. Therefore, reference image selection is inconsistency. In summary, a group of reference images
important to construct appropriate evaluation space. should be selected to maximize the total values of incon-
sistencies in consistency squares divided by the sum of
4.2. Reference image selection the areas of consistency squares.
In practice, a procedure of reference image selection
An important factor in selecting reference images is the is the following. Suppose there are 𝑁 images in total,
consistency of the strong-weak relationships within ref- and we want to select one image as a reference image.
erence images. That is, when the consistency table is
At first, baseline ranking is constructed as introduced in
the previous subsection and obtains consistency values
𝐶𝑖,𝑗 (𝑖, 𝑗 < 𝑁 ) for all pairs in the ranking. When evalu-
ation space is divided into two with image 𝐼𝑚 (𝑚 < 𝑁 ),
the sum of inconsistency values of the two spaces divided
by the sum of the area is calculated as:
∑︀𝑚 ∑︀𝑚 ∑︀𝑁 ∑︀𝑁
(2) 𝑖=1 𝑗=1 (1 − 𝐶𝑖,𝑗 ) + 𝑖=𝑚 𝑗=𝑚 (1 − 𝐶𝑖,𝑗 )
𝐷𝑚 = .
𝑚2 + (𝑁 − 𝑚)2
(7)
(2)
By searching the position whose 𝐷𝑚 are maximum, the
best reference image 𝐼𝑚 that divides evaluation space
into two can be obtained. Similar to this, by calculating
the sum of inconsistency in the consistency squares di-
vided by the sum of the area with different division num-
bers 𝑁𝑟𝑒𝑓 + 1, 𝑁𝑟𝑒𝑓 reference images can be obtained.
When it comes to calculating these values, we apply a
scheme of dynamic programming to reduce calculation
costs. Figure 8: Examples of prediction results by comparison net-
work. The first row shows label and prediction results, and
the second row shows regions on which a network focuses by
5. Experiment Grad-CAM.

We conducted an experiment to evaluate the following
things:
training data in this paper. In particular, we manually an-
• How a proposed network can evaluate image pair. notated segments that we thought the degree of smiling
• How appropriately reference images can be se- ascended or descended monotonically. We then picked
lected regarding consistency of both network and each segment’s start and end frame to construct one pair
human annotators’ evaluations. with its label. The number of image pairs of each dataset
were 216, 174, and 123, respectively. Also, all face images
• How the selected reference images evaluate face
in these pairs were utilized as baseline images. That is,
images.
the size of the dataset is twice the size of the image pairs;
432, 348, 246, respectively.
5.1. Dataset construction
At first, the face image dataset is constructed by captur- 5.2. Evaluation scheme
ing participants’ face images. We conducted two types
The procedure of evaluation was constructed the follow-
of experiments to construct datasets with different situa-
ing four steps; (1) the proposed network was trained for
tions. In the first type of dataset, we asked a participant
each individual by collected data and was evaluated by
to sit in front of the camera and to listen to funny radio.
the cross-validation scheme; (2) we constructed the base-
In the second type of dataset, we asked a participant to
line ranking and evaluated the voting-based algorithm by
sit in front of the laptop PC and play a simple game. We
determining the rank within the images in the baseline
captured facial images of these participants during the
ranking; (3) reference images were selected from baseline
experiment. The second experiment was still experimen-
images as proposed in the previous section and evaluated
tal but closer to a natural scene than the first one. Each
by human annotators on how they were consistent; (4)
dataset was constructed only by one participant because
we confirmed the smiling intensity of face images in the
our focus was to build a model to evaluate each individ-
evaluation space constructed by reference images.
ual. We collected two datasets of the first type, and one
As for training our network, we utilized pre-trained
dataset of the second type.
feature extraction layers of VGG-Face [12], which was
Then, we added labels between image pairs that
trained on millions of face images for person identifica-
showed which of the two images expressed more smiles.
tion, and trained only fully connected layers.
The annotation between images with a slight difference
Regarding the evaluation of the voting-based algo-
in the degree of smiling was difficult, even for humans. It
rithm, the rank of each baseline image was determined by
might cause a mistake in giving the correct labels. There-
the baseline ranking itself. In this evaluation, the grand
fore, we utilized image pairs with a clear difference as
truth of the rank of each baseline image was given as the
original rank of baseline ranking.
As for evaluating selected reference images, nine ref-
erence images were selected from the baseline images.
Then, human annotators were asked to evaluate which
of the two images represented more smiling for the pair
of neighboring images in reference images. The images
up to the third nearest neighbor images were considered
a pair, and all image pairs were annotated twice by swap- Figure 9: Selected reference images of dataset 1. More smile
ping the left and right sides of the comparison image. images are located on the left side.
After comparing the image pairs within each dataset, the
participants moved on to the next dataset. The order of
evaluated image pairs was randomized for each dataset,
but the order of the datasets was constant. We evaluated
the consistency of the reference images by how accurate
and consistent annotators evaluated the image pair. A
group of 9 images regularly extracted every 𝑁/10 from
the baseline images {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼9𝑟𝑎𝑛𝑘 } was used as the
reference image for comparison. Seven participants be- Figure 10: Consistency of estimated rank and original rank
tween the ages of 21 and 27 (6 male and 1 female) were in baseline ranking
recruited as annotators.

5.3. Results 5.3.3. Selected reference images

5.3.1. Prediction accuracy Figure 11 shows selected reference images by the pro-
posed algorithm and equally picked up from the baseline
We first show the evaluation results of the trained ranking. The consistency table correlated to this result of
comparison network in terms of accuracy. In dataset 1 is shown in Figure 12. In this figure, the green
this evaluation, five-fold cross-validation was applied, line shows where the algorithm divides baseline ranking.
and prediction accuracies of three datasets were These results show that there still appears to be some am-
99.5% (215/216), 100% (174/174), 98.3% (121/123), biguity between adjacent images, even with the proposed
respectively. Figure 8 shows examples of prediction re- approach. However, it appears to be reduced compared
sults with Gradient-weighted Class Activation Mapping to a group of images acquired at regular intervals.
(Grad CAM) [13]. In each figure, the first row shows the The consistency table calculated by these selected ref-
grand truth label and estimation results, and the second erence images is shown in Figure 13. We can see that
row shows regions on which a network focuses for the almost all the cells represent blue. This result shows that
prediction as a heat map. From these results, we can see ambiguities within reference images are small.
that the network returns accurate prediction results by Figure 14 and Figure 15 show the quantitative evalua-
correctly focusing on face regions, including the mouth tion results of selected reference images by annotators.
and eyes, which are well known as corresponding to In each figure, “proposed” and “baseline” represent the
smiling, even from the small dataset. result for the images up to the third nearest neighbor
reference images of the proposed algorithm and baseline
5.3.2. Consistency of voting based evaluation algorithm, respectively, and “proposed_adjacent” and
“baseline_adjacent” represent the result for the images
Figure 9 shows four examples of estimated ranks of im-
only the nearest neighbor reference images. That is,
ages in the baseline ranking, rank 1, 100, 200, and 400
the difficulty of the evaluation becomes hard. Figure 14
of dataset 1. The total number of baseline images was
shows a prediction accuracy of evaluation results. Here,
216 × 2 = 432. Each image was estimated as rank 1, 103,
we consider the order given by the proposed network as
199, and 398, respectively, and an almost correct rank can
the grand truth of the prediction. Therefore, this result
be estimated by the voting algorithm. Figure 10 shows
also shows a correlation between network prediction and
all pairs of estimated rank and grand truth label of this
human perceptions. Figure 15 shows the consistency of
evaluation in dataset 1. These results show the consis-
each participant’s evaluation. In particular, it shows how
tency and effectiveness of the voting-based algorithm,
much the same evaluation was given when the same im-
as it predicted almost consistent values for all baseline
age pair displayed with the left and right sides swapped.
images.
This high consistency indicates a low degree of ambigu-
ity between image pairs. In almost all cases, reference
Strongest smile

Weakest smile
(a) Selected reference images by proposed algorithm

Figure 13: Consistency table calculated by selected reference
images of dataset 1.
Strongest smile

Weakest smile
(b) Selected reference images by regularly
picked up from baseline ranking

Figure 11: Selected reference images of dataset 1. More smile
images are located on the left side.

Strongest Weakest
smiling Target image smiling

Figure 14: Accuracies of annotators evaluation. Adjacent
means the image pair consists of the nearest neighbor images.
Reference image

image pairs with higher accuracy than the comparison
method.

5.3.4. Smiling intensity evaluation
Examples of face images evaluated by selected reference
images and the proposed network are shown in Figure 16.
Figure 12: Consistency table of dataset 1. The green line Since it is sometimes hard to qualitatively evaluate two
shows where the algorithm divides baseline ranking. adjacent images in a row, the four reference images skip
one rank at a time. The images with a smile level one
class lower than the reference image are listed, and each
row shows the same evaluation value. The images on the
images selected by the proposed algorithm obtain higher
left side of the figure are recognized as having a higher
accuracies and higher consistencies. Since the smiles ex-
degree of smiling. This result confirms that the proposed
pressed in the experimental time were quantized into ten
method effectively evaluates the degree of smiling within
levels and the maximum value of the smile was not very
the ordinal scale.
high, both methods have a certain degree of similarity be-
Finally, a part of the transition of the smiling intensity
tween the neighboring reference image pairs. Therefore,
during the experiment is shown in Figure 17. In this
evaluation by humans may be somewhat difficult even
result, an evaluation score was smoothed by the median
with the proposed method. However, even in such a situ-
filter to trace the trend of transitions. We can see that
ation, we can confirm that the proposed method selects
on multiple comparisons for the purpose of monitoring
individuals. Suppose that we have enough data from
individual face images; we also propose an algorithm for
selecting appropriate reference images for the ordinal
evaluation.
Experimental results show that our ordinal scale-based
evaluation can successfully give the degree of not only
clear smiling but also intermediate facial expressions. In
addition, we can see that an evaluation space constructed
by selected reference images by our algorithm is more
consistent and, therefore, considered to be reasonable.
One of the future works is to map the proposed and
constructed ordinal scale to some physical index. Al-
Figure 15: Consistencies of annotators evaluation of the same though this paper proposed a method of selecting refer-
image pair. Adjacent means the image pair consists of the ence images that are somewhat reasonable when evalu-
nearest neighbor images. ated by humans, the validity of the scale would be im-
proved if it could be mapped to some physical index.
For example, by measuring the myoelectricity of facial
Reference
images
muscles, the degree of muscle activity could be used as
an index. In addition, the other future work is to apply
this technique to people whose facial expressions do not
change much, e.g., dementia patients, as we described in
the introduction section.
Evaluated
images

Ethics
Strongest smile
Our method aims to monitor the daily health conditions
Weakest smile
of a specific individual by evaluating the smiling intensity
Figure 16: Example results of the evaluation of the degree of using a model trained specifically for the individual’s
smiling facial images. Since data for model training and smiling
intensity evaluation can be collected and processed at
terminals installed in each individual’s environment, it
is expected to reduce the risk of leakage of particularly
strong personal information such as facial images being
stored in the cloud in practical applications.

References
Figure 17: A part of the transition of the degree of the smiling. [1] K. Kondo, T. Nakamura, Y. Nakamura, S. Satoh,
Each grid line of time shows 10 seconds. Siamese-structure deep neural network recognizing
changes in facial expression according to the degree
of smiling, in: Proc. of ICPR2020, 2021, pp. 4605–
the participant smiles several times in this period. It can 4612. doi:10.1109/ICPR48806.2021.9411988.
be seen that smiles of slightly stronger intensity than [2] P. EKMAN, Facial action coding system (facs), A
the middle level occurred several times in succession in Human Face (2002). URL: https://ci.nii.ac.jp/naid/
the first half of this period. In comparison, smiles of 10025007347/.
considerably stronger intensity occurred with a short [3] B. Amos, B. Ludwiczuk, M. Satyanarayanan, Open-
interval in the second half of this period. Face: A general-purpose face recognition library
with mobile applications, Technical Report, CMU-
CS-16-118, CMU School of Computer Science, 2016.
6. Conclusion [4] K. Simonyan, A. Zisserman, Very deep convolu-
tional networks for large-scale image recognition,
In this paper, we propose an approach to evaluate the in: Proc. of ICLR 2015, San Diego, CA, USA, May
degree of smiling of individuals by ordinal scales based 7-9, 2015.
[5] C. C. Atabansi, T. Chen, R. Cao, X. Xu, Transfer
learning technique with vgg-16 for near-infrared fa-
cial expression recognition, Journal of Physics: Con-
ference Series 1873 (2021) 012033. URL: https://dx.
doi.org/10.1088/1742-6596/1873/1/012033. doi:10.
1088/1742-6596/1873/1/012033.
[6] Y. Liu, Facial expression recognition model based
on improved vggnet, in: 2023 4th International
Conference on Electronic Communication and Ar-
tificial Intelligence (ICECAI), 2023, pp. 404–408.
doi:10.1109/ICECAI58670.2023.10177007.
[7] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah,
Signature verification using a "siamese" time delay
neural network, in: Proc. of NIPS’93, 1993, pp. 737–
744.
[8] J. Bromley, J. Bentz, L. Bottou, I. Guyon, Y. Lecun,
C. Moore, E. Sackinger, R. Shah, Signature veri-
fication using a "siamese" time delay neural net-
work, International Journal of Pattern Recognition
and Artificial Intelligence 7 (1993) 25. doi:10.1142/
S0218001493000339.
[9] X. Zhou, W. Liang, S. Shimizu, J. Ma, Q. Jin,
Siamese neural network based few-shot learning
for anomaly detection in industrial cyber-physical
systems, IEEE Transactions on Industrial Informat-
ics 17 (2021) 5790–5798. doi:10.1109/TII.2020.
3047675.
[10] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural
networks for one-shot image recognition, in: Proc.
of the deep learning workshop in the 32nd Interna-
tional Conference on Machine Learning, volume 2,
2015.
[11] J. Zhang, K. Shimonishi, K. Kondo, Y. Nakamura,
Facial expression change recognition on neutral-
negative axis based on siamese-structure deep neu-
ral network, in: Cross-Cultural Design. Product and
Service Design, Mobility and Automotive Design,
Cities, Urban Areas, and Intelligent Environments
Design: 14th International Conference, CCD 2022,
Held as Part of the 24th HCI International Confer-
ence, HCII 2022, 2022, pp. 583–598.
[12] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face
recognition, in: British Machine Vision Conference,
2015.
[13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
D. Parikh, D. Batra, Grad-cam: Visual explanations
from deep networks via gradient-based localization,
International Journal of Computer Vision 128 (2019)
336 – 359. doi:10.1007/s11263-019-01228-7.