Ordinal Scale Evaluation of Smiling Intensity using Comparison-Based Network Kei Shimonishi1,* , Kazuaki Kondo1 , Hirotada Ueda1 and Yuichi Nakamura1 1 Kyoto University, Yoshida-honmachi, Sakyo, Kyoto, Japan Abstract The ability to evaluate both explicit facial expressions and intermediate expressions is helpful for human monitoring. Since intermediate facial expression is out of the scope of traditional studies, evaluation scores obtained from traditional facial expression recognition techniques are unreliable. In this paper, we propose an ordinal scale-based evaluation scheme for facial expression based on a comparison. The proposed framework is based on an ordinal scale; it is challenging to construct a standard scale that can be applied to multiple individuals. However, it is expected to be effective enough to track changes in the facial expressions of the specific individual, including intermediate expressions. We also propose an algorithm for selecting reference images from the data by taking into account the consistencies of the strong-weak relationships between reference images because the reference image selection significantly impacts the ordinal evaluation. Our approach is evaluated by conducting experiments with human annotators. Keywords Facial expression recognition, Siamese network, ranking, ordinal scales 1. Introduction Transitions of facial expressions Monitoring an individual’s Quality of Life (QOL) is be- coming increasingly important to maintain good mental Smile intensity conditions and detect early trends in harmful conditions. Because direct QOL inquires are bothering and it is diffi- cult to accurately represent one’s internal state, estimat- ing internal state from external nonverbal information time is desired. Facial expression is one of the modalities that reflects an individual’s internal state and is expressed Time with being influenced by mental condition. For example, when an individual is not feeling well, the same smile Figure 1: An example of transition curve of smiling intensity may appear weaker than usual. Therefore, monitoring fa- in daily life cial expressions in daily life is a crucial clue to estimating an individual’s QOL. The research field of facial expression recognition Though the traditional algorithm of FER seems able (FER) has a long history, and it has already been put to evaluate intermediate facial expressions as a prob- into practical use as a technology, such as smiling shut- ability that a specific facial expression is represented, ters. While traditional FER mainly focuses on recognizing the probability values are not so reliable, especially for whether a clear facial expression is represented or not, evaluating intermediate expressions. This is because the from the viewpoint of monitoring in daily life, evaluating intermediate facial expression was out of the scope of the degree of expression for the individual is rather cru- the traditional studies; learning is likely to output a value cial, especially for patients with dementia who have little close to the binary value of either no expression (0) or or no facial expressions. Based on this point of view, this an expression (1). As a result, for example, when the research aims to draw a curve of transitions of the indi- degree of smile expression is estimated for a series of vidual’s degree of facial expressions, particularly smiling facial expressions, the value may change abruptly over a intensity, as shown in Figure 1. series of times as shown in Figure 2. In addition, it is also difficult for the machine learning algorithm to directly Machine Learning for Cognitive and Mental Health Workshop (ML4CMH), AAAI 2024, Vancouver, BC, Canada learn the intermediate facial expressions since it is diffi- * Corresponding author. cult for even humans to give appropriate absolute values $ shimonishi@i.kyoto-u.ac.jp (K. Shimonishi); for intermediate facial expressions. kondo@ccm.media.kyoto-u.ac.jp (K. Kondo); Kondo et al. [1] proposed a network for recognizing ueda.hirotada.2r@kyoto-u.ac.jp (H. Ueda); smiling based on “comparison” to address the issues of yuichi@media.kyoto-u.ac.jp (Y. Nakamura) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License recognizing intermediate facial expressions. Their work Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings We briefly introduce related work in the next section. 1 Smiling score Then, we introduce an approach to evaluate facial ex- pressions by ordinal scales and an algorithm of reference image selection. We evaluate our approach and algo- 0 Time rithm with human annotators, and finally, we conclude our research. Figure 2: An example of sudden jump of evaluation scores for intermediate facial expressions by traditional facial expression recognition technique 2. Related Work 2.1. Facial expression recognition Facial expression recognition is widely utilized in several fields. Traditional studies mainly focused on determining whether a specific expression is represented or not. 2.1.1. Facial Action Coding Systems Figure 3: Overview of evaluation method of facial expression intensity based on comparison Facial Action Coding Systems (FACS) [2] is a framework proposed by Ekman et al. that classifies a face into sev- eral parts (Action Units; AUs) based on the basic action units of individual muscles and describes facial expres- is based on the assumption that the problem of relatively sions as a combination of these AU actions. Many facial evaluating which of two images represents more smiles expression recognition applications have used FACS as by comparing two images is easier than absolutely eval- features, and for example, OpenFace [3] can analyze mul- uating a degree of smiling from only one image. tiple facial expressions in near real-time by automatically By borrowing this comparison-based idea to evaluate recognizing the actions of AUs. facial expressions, we propose an approach to evaluate smiling intensity with an ordinal scale. The basic idea of this approach is that if we have multiple reference face 2.1.2. Deep neural network based approach images for a specific individual and a method for com- Although the FACS-based FER approach has been suc- paring facial expressions, we can evaluate the smiling cessful, it has the limitation that the final results are intensity of a new image of the individual through pair- affected by the accuracy of FACS detection. This limita- wise comparison with the reference images, as shown in tion can become a problem, especially when trying to Figure 3. capture subtle differences in facial expressions because Since the expression ratings in this method are based the effect of observation noise cannot be ignored. On on an ordinal scale, the degree of each rating is not the the other hand, the end-to-end approach by the deep same for multiple individuals. However, this ordinal neural network can be expected to reduce the effect of scale-based approach may satisfy our need to capture such observation noise by eliminating the necessity of changes in facial expressions for each individual. explicit feature detection. For example, VGGNet [4] is In addition, reference image selection is crucial for this a traditional deep neural network, but it is known that ordinal-based evaluation because they are considered an human facial features can be extracted well, and recent evaluation space for facial expressions. Therefore, we research of FER also utilized VGGNet [5, 6] also propose an algorithm of reference image selection from a large number of face image data of each individ- ual based on consistencies of comparison results within 2.2. Siamese structure-based recognition images. technique In summary, the contributions of this paper are as Siamese network [7] is one of the deep neural networks follows: of metric learning. This network acquires two inputs and • We propose an approach to evaluating intermedi- returns the distance between the two inputs. By apply- ate smiling intensity by ordinal scales based on ing the same structures and the same weights to feature comparisons. extraction layers of these two inputs and the distance of the two inputs to the loss function, the network can • We propose an algorithm for selecting appro- learn the distance space. The Siamese Network is a net- priate reference images to construct appropriate work that determines whether two inputs are similar or evaluation space. different and has been applied to handwritten signature Forward comparison stream Iref ‘Ascent’ 𝒙𝑟𝑒𝑓 𝒙𝑟𝑒𝑓 ‘Descent’ Ground truth 𝒙𝑡𝑔𝑡 FC layers (shared) CNN layers (shared) ‘Ascent’ In the inverse In the forward comparison comparison ‘Descent’ Itgt 𝒙𝑡𝑔𝑡 ‘Ascent’ ‘Ascent’ 𝒙𝑡𝑔𝑡 ‘Descent’ ‘Descent’ 𝒙𝑟𝑒𝑓 Categorical cross entropy loss Inverse comparison stream Figure 4: Siamese-based network to compare face images to evaluate the degree of smiling recognition [8] and used as a framework for anomaly 3.2. A network for facial expression detection [9]. As one of its features, it is known as a net- comparison work that can be trained from a small number of training data compared to conventional networks that perform In this paper, we defined the recognition task as a simple multi-valued discrimination and regression [10]. two-category classification problem (i.e., determining Kondo et al. [1] proposed an approach to the evalua- which of two input images represents the greater degree tion of facial expressions based on comparison inspired of smiling) and construct a Siamese-based network to by the Siamese structure. Their approach compares two recognize smiling similar to the network Kondo et al. facial images and returns which of one image represents have developed [1]. more smiles, and they showed that the approach has the Figure 4 shows the structure of the proposed network potential to distinguish subtle facial expression differ- that accepts two input images and returns two likelihood ences. In addition, Zhang et al. [11] extended their work values corresponding to ascension and descension labels from a positive-neutral direction to a negative-neutral relative to the degree of smiling. We employed the CNN direction. component of VGG16 [4] and two fully connected lay- ers with rectified linear units, a 0.25 dropout rate, and SoftMax in the proposed method. The ground-truth like- 3. Comparison-based smiling lihood values for an input image pair were represented evaluation by ordinal scales as a two-element one-hot vector, with its element cor- responding to the ground truth label set to 1 and the 3.1. Overview of the proposed framework other element set to 0, respectively. We used categorical cross-entropy loss to optimize the network parameters, As introduced in the Introduction, the basic idea of our as follows: approach is a comparison-based evaluation. Kondo et ∑︁ al.[1] has developed a Siamese-based smiling recogni- 𝐿𝑐𝑎𝑡 = − {𝑦𝑖 𝑙𝑜𝑔(𝑦ˆ𝑖 ) + (1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑦ˆ𝑖 )} , tion network that takes two face images as input and 𝑖 recognizes which one is expressing smiling more. By (1) borrowing this idea, once we develop a network that can where 𝑖 = {0, 1}, 𝑦𝑖 , and 𝑦 ˆ 𝑖 denotes ascension and determine which of two images represents more smiles, descension labels relative to the degree of smiling, the and if we have multiple reference images, we can evalu- ground-truth label, and the predicted likelihood values, ate the smile intensity of a new image through pairwise respectively. comparison with the reference images as also introduced The previously proposed network by Kondo et al. was in the Introduction. When it comes to determining smil- not designed to consider the order of inputs, resulting ing intensity based on ordinal scales, although all the in instances where swapping the order of two inputs comparison results are ideally consistent, the results are led to contradictory outputs. To address this issue, we sometimes inconsistent due to an ambiguity of slightly input a permuted version of the two features extracted different face images. Therefore, we apply a voting-based from two input images by the CNN component into the evaluation and determine smiling scores by merging mul- fully connected layer in the latter stage and calculate the tiple comparison results. In addition, we propose an al- categorical cross-entropy loss of inverted input, 𝐿𝑖𝑛𝑣 , gorithm to select appropriate reference images to reduce as same as 𝐿𝑐𝑎𝑡 , as shown in red arrows in Figure 4. the ambiguity between reference images in the following Also, a loss of consistency of these two types of input is section. calculated as 𝐿𝑐𝑜𝑛 = 1 − {𝑚𝑖𝑛(𝑃𝑓𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 )) voted to ranks larger than 𝑛. In practice, add likelihood values of “ascend” and “descend” to the ranks lower than and higher than 𝑛, respectively. Simply thinking, the degree of smiling in the refer- ence images can be determined by searching the position whose scores are maximum. Here, the position 𝑟 can be derived as: {︃ 𝑟−1 𝑁 }︃ ∑︁ 𝐷𝑒𝑠 ∑︁ 𝐴𝑠 𝑟 = arg max 𝑃 (𝑛𝑒𝑤, 𝑛) + 𝑃 (𝑛𝑒𝑤, 𝑛) . Figure 5: Voting-based evaluation 𝑟 𝑛=1 𝑛=𝑟+1 (4) In addition, we here apply a mean-shift algorithm to +𝑚𝑖𝑛(𝑃𝑓𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ), 𝑃𝑖𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ))}, (2) determine the evaluation score based on these probability values. where 𝑃 𝐴𝑠 (𝑛, 𝑚) and 𝑃 𝐷𝑒𝑠 (𝑛, 𝑚) represent probabil- ities that the degree of smiling of image 𝐼𝑛 is larger or smaller than that of image 𝐼𝑚 , respectively. In 4. Reference image selection other words, 𝑃 (𝑠𝑛(𝐼𝑛 ) > 𝑠𝑛(𝐼𝑚 )) and 𝑃 (𝑠𝑛(𝐼𝑛 ) < 𝑠𝑛(𝐼𝑚 )), where 𝑠𝑛(𝐼) represents a degree of smiling of To apply voting-based ordinal-scale evaluation as de- image 𝐼. Also, 𝑃𝑓 and 𝑃𝑖 represent the likelihoods of scribed above, we first need to construct an evaluation the forward comparison stream and inverse comparison space with several reference images. Since the proposed stream, respectively. approach utilizes ordinal scales, the construction of the In total, our network is trained to decrease the follow- evaluation space is crucial for the capability of the ap- ing loss function: proach. Although a straightforward way is to utilize all the face data as reference images, the evaluation space 𝐿 = 𝐿𝑐𝑎𝑡 + 𝐿𝑖𝑛𝑣 + 𝐿𝑐𝑜𝑛 . (3) constructed by very similar or subtle different images is unreliable due to the ambiguity of these images. Here, we expected that the CNN and the fully con- In this paper, we first consider all the data as baseline nected components would be trained to compare ex- images and take pair-wise comparisons to sort all the tracted features in order to project the results onto the data in a dataset and construct a baseline ranking. Then, likelihood values of the ascension and descension labels, we select several images from the baseline ranking as respectively. reference images and quantize the evaluation space by taking into account consistency to address the issues due 3.3. Voting-based evaluation to ambiguity. Since the reference images may include some ambiguity 4.1. Baseline ranking construction between neighboring images, it is difficult to directly determine the degree of smiling of the new target image Figure 6 (a) shows a comparison table of the result of all in a reference image set. Therefore, we apply a voting pair-wise comparisons in baseline images. Each color technique to determine the final rank of the image. The shows a probability of how a target image has a stronger algorithm votes to possible ranks using the result of each smile than a reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The comparison of reference images and a target image. As blue shows a pair whose target image has a stronger a result, the most likely rank should have a maximum smile than the reference image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) > number of votes. 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). In contrast, the red area shows a pair In particular, the procedure is as follows. Suppose that whose reference image has a stronger smile than a target we have 𝑁 reference images with its order of degree of image, i.e., 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) < 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ). The white smiling, i.e., 𝑠𝑛(𝐼𝑖 ) > 𝑠𝑛(𝐼𝑗 ), ∀𝑖 < 𝑗. A new target area represents that the target and the reference images image 𝐼𝑛𝑒𝑤 is compared to all reference images, and like- represent similar facial expressions. lihood values that the degree of smiling of target image is By sorting the baseline images based on the sum of the larger than that of a reference image 𝑃 𝐴𝑠 (𝑛𝑒𝑤, 𝑛) and probability values in each column of this table, a baseline likelihood that the degree of smiling of target image is ranking considering the consistency of the strong-weak lower than that of a reference image 𝑃 𝐷𝑒𝑠 (𝑛𝑒𝑤, 𝑛) for relationship can be constructed (Figure 6 (b)). In partic- all reference images (𝑛 ∈ {1, . . . , 𝑁 }) are obtained. Be- ular, suppose we have 𝑁 baseline images {𝐼1 , . . . , 𝐼𝑁 } cause if 𝑠𝑛(𝐼𝑛𝑒𝑤 ) < 𝑠𝑛(𝐼𝑛 ) is estimated, the smile rank in total, and denote images sorted in descending order of 𝐼𝑛𝑒𝑤 is estimated as larger than 𝑛, large values are by smiling intensity as {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁 𝑟𝑎𝑛𝑘 }. Since the Strongest Target image Weakest Smiling ranking smiling Strongest Weakest smiling Target image smiling smiling Reference image Smiling ranking Target image TgtRef Inconsistent Reference image (b) Strong-weak relations between images Strongest Weakest smiling Target image smiling Re-arrange Reference image RefTgt sum Inconsistent Ranking score Figure 7: Consistencies of large small relationships in neigh- (a) Comparison table bor images as a part of consistency table in baseline ranking. (c) A consistency of strong-weak relations between images Figure 6: A baseline ranking made from a comparison table calculated the same as the bottom-right figure of Figure 6, the less red and white colors area is a better sign of reference image selection. strong-weak relationship between each image 𝐼𝑛 to other To realize that, we focus on a square region of neighbor images 𝐼𝑛^ , 𝑛 ˆ ∈ {1, . . . , 𝑁 }𝑛 are calculated as probabil- images as shown in Figure 7, and call this square consis- ity values 𝑃 𝐴𝑠 and 𝑃 𝐷𝑒𝑠 , the total consistency values in tency square. Suppose the differences between images the baseline ranking is derived as are significant, and a strong-weak relationship is evident ⎧ in the images. In that case, the consistency values in the 𝑁 consistency square are also expected to be large, i.e., cells ⎪ ∑︁ ⎨ ∑︁ 𝐿= 𝑃𝐷𝑒𝑠 (𝑛, 𝑛ˆ) in the square become blue. In contrast, when images are similar, and therefore the difference between images ⎪ 𝑛 ⎩{𝑛 ^ |𝐼𝑛 𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }} ^ ∈{𝐼1 𝑛−1 ⎫ is ambiguous, the consistency values in the consistency square become low, i.e., cells in the square become white ∑︁ ⎪ ⎬ + 𝑃𝐴𝑠 (𝑛, 𝑛 ˆ) (5) and red. ⎪ ^ |𝐼 ∈{𝐼 𝑟𝑎𝑛𝑘 ,...,𝐼 𝑟𝑎𝑛𝑘 }} {𝑛 The basic idea of building a consistent evaluation space ⎭ 𝑛 ^ 𝑛+1 𝑁 is to quantize images with low consistency values in the By maximizing this total consistency, base- consistency square into a single class. As a result, the line ranking images (𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁 𝑟𝑎𝑛𝑘 ) = ambiguity between these images becomes “don’t care” in arg max 𝐼1 𝑟𝑎𝑛𝑘 , . . . , 𝐼𝑁 𝐿 can be obtained. From now 𝑟𝑎𝑛𝑘 the evaluation space, and the consistency of the evaluated on, the subscript 𝑛 will be used to sort the images in value becomes significant. That is, it is good to select descending order of smiling degree. images where the total consistency values in the sum of An example of the consistency of the strong-weak consistency squares, as shown in the right of Figure 7, relations in this rearranged table is shown in Figure becomes low. 6 (c) by replacing 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) into 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) In addition, neighbor images in the baseline ranking when 𝑠𝑛(𝐼𝑡𝑔𝑡 ) < 𝑠𝑛(𝐼𝑟𝑒𝑓 ). We here denote probabil- should not be selected as reference images. In other ities 𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 to indicate this consistency as follows: words, to select good reference images, the evaluation {︃ space is better to be divided evenly. To realize that, select 𝑃 𝐴𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) > 𝑠𝑛(𝑟𝑒𝑓 ), a group of reference images so that the sum of the area 𝐶𝑡𝑔𝑡,𝑟𝑒𝑓 = 𝑃 𝐷𝑒𝑠 (𝑡𝑔𝑡, 𝑟𝑒𝑓 ) if 𝑠𝑛(𝑡𝑔𝑡) < 𝑠𝑛(𝑟𝑒𝑓 ). of consistency space becomes small. To sum up, it is (6) better to choose a group of images for which both the Ideally, all cells would be blue, i.e., consistency is nearly sum of consistency values and the sum of areas within equal to 1. However, due to the ambiguity of compar- the consistency square is small. Here, selecting a group ison results for similar facial expressions, there is also of images with high consistency values within the consis- ambiguity in the consistency between the neighboring tency square is equivalent to selecting a group with low baseline ranking. Therefore, reference image selection is inconsistency. In summary, a group of reference images important to construct appropriate evaluation space. should be selected to maximize the total values of incon- sistencies in consistency squares divided by the sum of 4.2. Reference image selection the areas of consistency squares. In practice, a procedure of reference image selection An important factor in selecting reference images is the is the following. Suppose there are 𝑁 images in total, consistency of the strong-weak relationships within ref- and we want to select one image as a reference image. erence images. That is, when the consistency table is At first, baseline ranking is constructed as introduced in the previous subsection and obtains consistency values 𝐶𝑖,𝑗 (𝑖, 𝑗 < 𝑁 ) for all pairs in the ranking. When evalu- ation space is divided into two with image 𝐼𝑚 (𝑚 < 𝑁 ), the sum of inconsistency values of the two spaces divided by the sum of the area is calculated as: ∑︀𝑚 ∑︀𝑚 ∑︀𝑁 ∑︀𝑁 (2) 𝑖=1 𝑗=1 (1 − 𝐶𝑖,𝑗 ) + 𝑖=𝑚 𝑗=𝑚 (1 − 𝐶𝑖,𝑗 ) 𝐷𝑚 = . 𝑚2 + (𝑁 − 𝑚)2 (7) (2) By searching the position whose 𝐷𝑚 are maximum, the best reference image 𝐼𝑚 that divides evaluation space into two can be obtained. Similar to this, by calculating the sum of inconsistency in the consistency squares di- vided by the sum of the area with different division num- bers 𝑁𝑟𝑒𝑓 + 1, 𝑁𝑟𝑒𝑓 reference images can be obtained. When it comes to calculating these values, we apply a scheme of dynamic programming to reduce calculation costs. Figure 8: Examples of prediction results by comparison net- work. The first row shows label and prediction results, and the second row shows regions on which a network focuses by 5. Experiment Grad-CAM. We conducted an experiment to evaluate the following things: training data in this paper. In particular, we manually an- • How a proposed network can evaluate image pair. notated segments that we thought the degree of smiling • How appropriately reference images can be se- ascended or descended monotonically. We then picked lected regarding consistency of both network and each segment’s start and end frame to construct one pair human annotators’ evaluations. with its label. The number of image pairs of each dataset were 216, 174, and 123, respectively. Also, all face images • How the selected reference images evaluate face in these pairs were utilized as baseline images. That is, images. the size of the dataset is twice the size of the image pairs; 432, 348, 246, respectively. 5.1. Dataset construction At first, the face image dataset is constructed by captur- 5.2. Evaluation scheme ing participants’ face images. We conducted two types The procedure of evaluation was constructed the follow- of experiments to construct datasets with different situa- ing four steps; (1) the proposed network was trained for tions. In the first type of dataset, we asked a participant each individual by collected data and was evaluated by to sit in front of the camera and to listen to funny radio. the cross-validation scheme; (2) we constructed the base- In the second type of dataset, we asked a participant to line ranking and evaluated the voting-based algorithm by sit in front of the laptop PC and play a simple game. We determining the rank within the images in the baseline captured facial images of these participants during the ranking; (3) reference images were selected from baseline experiment. The second experiment was still experimen- images as proposed in the previous section and evaluated tal but closer to a natural scene than the first one. Each by human annotators on how they were consistent; (4) dataset was constructed only by one participant because we confirmed the smiling intensity of face images in the our focus was to build a model to evaluate each individ- evaluation space constructed by reference images. ual. We collected two datasets of the first type, and one As for training our network, we utilized pre-trained dataset of the second type. feature extraction layers of VGG-Face [12], which was Then, we added labels between image pairs that trained on millions of face images for person identifica- showed which of the two images expressed more smiles. tion, and trained only fully connected layers. The annotation between images with a slight difference Regarding the evaluation of the voting-based algo- in the degree of smiling was difficult, even for humans. It rithm, the rank of each baseline image was determined by might cause a mistake in giving the correct labels. There- the baseline ranking itself. In this evaluation, the grand fore, we utilized image pairs with a clear difference as truth of the rank of each baseline image was given as the original rank of baseline ranking. As for evaluating selected reference images, nine ref- erence images were selected from the baseline images. Then, human annotators were asked to evaluate which of the two images represented more smiling for the pair of neighboring images in reference images. The images up to the third nearest neighbor images were considered a pair, and all image pairs were annotated twice by swap- Figure 9: Selected reference images of dataset 1. More smile ping the left and right sides of the comparison image. images are located on the left side. After comparing the image pairs within each dataset, the participants moved on to the next dataset. The order of evaluated image pairs was randomized for each dataset, but the order of the datasets was constant. We evaluated the consistency of the reference images by how accurate and consistent annotators evaluated the image pair. A group of 9 images regularly extracted every 𝑁/10 from the baseline images {𝐼1𝑟𝑎𝑛𝑘 , . . . , 𝐼9𝑟𝑎𝑛𝑘 } was used as the reference image for comparison. Seven participants be- Figure 10: Consistency of estimated rank and original rank tween the ages of 21 and 27 (6 male and 1 female) were in baseline ranking recruited as annotators. 5.3. Results 5.3.3. Selected reference images 5.3.1. Prediction accuracy Figure 11 shows selected reference images by the pro- posed algorithm and equally picked up from the baseline We first show the evaluation results of the trained ranking. The consistency table correlated to this result of comparison network in terms of accuracy. In dataset 1 is shown in Figure 12. In this figure, the green this evaluation, five-fold cross-validation was applied, line shows where the algorithm divides baseline ranking. and prediction accuracies of three datasets were These results show that there still appears to be some am- 99.5% (215/216), 100% (174/174), 98.3% (121/123), biguity between adjacent images, even with the proposed respectively. Figure 8 shows examples of prediction re- approach. However, it appears to be reduced compared sults with Gradient-weighted Class Activation Mapping to a group of images acquired at regular intervals. (Grad CAM) [13]. In each figure, the first row shows the The consistency table calculated by these selected ref- grand truth label and estimation results, and the second erence images is shown in Figure 13. We can see that row shows regions on which a network focuses for the almost all the cells represent blue. This result shows that prediction as a heat map. From these results, we can see ambiguities within reference images are small. that the network returns accurate prediction results by Figure 14 and Figure 15 show the quantitative evalua- correctly focusing on face regions, including the mouth tion results of selected reference images by annotators. and eyes, which are well known as corresponding to In each figure, “proposed” and “baseline” represent the smiling, even from the small dataset. result for the images up to the third nearest neighbor reference images of the proposed algorithm and baseline 5.3.2. Consistency of voting based evaluation algorithm, respectively, and “proposed_adjacent” and “baseline_adjacent” represent the result for the images Figure 9 shows four examples of estimated ranks of im- only the nearest neighbor reference images. That is, ages in the baseline ranking, rank 1, 100, 200, and 400 the difficulty of the evaluation becomes hard. Figure 14 of dataset 1. The total number of baseline images was shows a prediction accuracy of evaluation results. Here, 216 × 2 = 432. Each image was estimated as rank 1, 103, we consider the order given by the proposed network as 199, and 398, respectively, and an almost correct rank can the grand truth of the prediction. Therefore, this result be estimated by the voting algorithm. Figure 10 shows also shows a correlation between network prediction and all pairs of estimated rank and grand truth label of this human perceptions. Figure 15 shows the consistency of evaluation in dataset 1. These results show the consis- each participant’s evaluation. In particular, it shows how tency and effectiveness of the voting-based algorithm, much the same evaluation was given when the same im- as it predicted almost consistent values for all baseline age pair displayed with the left and right sides swapped. images. This high consistency indicates a low degree of ambigu- ity between image pairs. In almost all cases, reference Strongest smile Weakest smile (a) Selected reference images by proposed algorithm Figure 13: Consistency table calculated by selected reference images of dataset 1. Strongest smile Weakest smile (b) Selected reference images by regularly picked up from baseline ranking Figure 11: Selected reference images of dataset 1. More smile images are located on the left side. Strongest Weakest smiling Target image smiling Figure 14: Accuracies of annotators evaluation. Adjacent means the image pair consists of the nearest neighbor images. Reference image image pairs with higher accuracy than the comparison method. 5.3.4. Smiling intensity evaluation Examples of face images evaluated by selected reference images and the proposed network are shown in Figure 16. Figure 12: Consistency table of dataset 1. The green line Since it is sometimes hard to qualitatively evaluate two shows where the algorithm divides baseline ranking. adjacent images in a row, the four reference images skip one rank at a time. The images with a smile level one class lower than the reference image are listed, and each row shows the same evaluation value. The images on the images selected by the proposed algorithm obtain higher left side of the figure are recognized as having a higher accuracies and higher consistencies. Since the smiles ex- degree of smiling. This result confirms that the proposed pressed in the experimental time were quantized into ten method effectively evaluates the degree of smiling within levels and the maximum value of the smile was not very the ordinal scale. high, both methods have a certain degree of similarity be- Finally, a part of the transition of the smiling intensity tween the neighboring reference image pairs. Therefore, during the experiment is shown in Figure 17. In this evaluation by humans may be somewhat difficult even result, an evaluation score was smoothed by the median with the proposed method. However, even in such a situ- filter to trace the trend of transitions. We can see that ation, we can confirm that the proposed method selects on multiple comparisons for the purpose of monitoring individuals. Suppose that we have enough data from individual face images; we also propose an algorithm for selecting appropriate reference images for the ordinal evaluation. Experimental results show that our ordinal scale-based evaluation can successfully give the degree of not only clear smiling but also intermediate facial expressions. In addition, we can see that an evaluation space constructed by selected reference images by our algorithm is more consistent and, therefore, considered to be reasonable. One of the future works is to map the proposed and constructed ordinal scale to some physical index. Al- Figure 15: Consistencies of annotators evaluation of the same though this paper proposed a method of selecting refer- image pair. Adjacent means the image pair consists of the ence images that are somewhat reasonable when evalu- nearest neighbor images. ated by humans, the validity of the scale would be im- proved if it could be mapped to some physical index. For example, by measuring the myoelectricity of facial Reference images muscles, the degree of muscle activity could be used as an index. In addition, the other future work is to apply this technique to people whose facial expressions do not change much, e.g., dementia patients, as we described in the introduction section. Evaluated images Ethics Strongest smile Our method aims to monitor the daily health conditions Weakest smile of a specific individual by evaluating the smiling intensity Figure 16: Example results of the evaluation of the degree of using a model trained specifically for the individual’s smiling facial images. Since data for model training and smiling intensity evaluation can be collected and processed at terminals installed in each individual’s environment, it is expected to reduce the risk of leakage of particularly strong personal information such as facial images being stored in the cloud in practical applications. References Figure 17: A part of the transition of the degree of the smiling. [1] K. Kondo, T. Nakamura, Y. Nakamura, S. Satoh, Each grid line of time shows 10 seconds. Siamese-structure deep neural network recognizing changes in facial expression according to the degree of smiling, in: Proc. of ICPR2020, 2021, pp. 4605– the participant smiles several times in this period. It can 4612. doi:10.1109/ICPR48806.2021.9411988. be seen that smiles of slightly stronger intensity than [2] P. EKMAN, Facial action coding system (facs), A the middle level occurred several times in succession in Human Face (2002). URL: https://ci.nii.ac.jp/naid/ the first half of this period. In comparison, smiles of 10025007347/. considerably stronger intensity occurred with a short [3] B. Amos, B. Ludwiczuk, M. Satyanarayanan, Open- interval in the second half of this period. Face: A general-purpose face recognition library with mobile applications, Technical Report, CMU- CS-16-118, CMU School of Computer Science, 2016. 6. Conclusion [4] K. Simonyan, A. Zisserman, Very deep convolu- tional networks for large-scale image recognition, In this paper, we propose an approach to evaluate the in: Proc. of ICLR 2015, San Diego, CA, USA, May degree of smiling of individuals by ordinal scales based 7-9, 2015. [5] C. C. Atabansi, T. Chen, R. Cao, X. Xu, Transfer learning technique with vgg-16 for near-infrared fa- cial expression recognition, Journal of Physics: Con- ference Series 1873 (2021) 012033. URL: https://dx. doi.org/10.1088/1742-6596/1873/1/012033. doi:10. 1088/1742-6596/1873/1/012033. [6] Y. Liu, Facial expression recognition model based on improved vggnet, in: 2023 4th International Conference on Electronic Communication and Ar- tificial Intelligence (ICECAI), 2023, pp. 404–408. doi:10.1109/ICECAI58670.2023.10177007. [7] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a "siamese" time delay neural network, in: Proc. of NIPS’93, 1993, pp. 737– 744. [8] J. Bromley, J. Bentz, L. Bottou, I. Guyon, Y. Lecun, C. Moore, E. Sackinger, R. Shah, Signature veri- fication using a "siamese" time delay neural net- work, International Journal of Pattern Recognition and Artificial Intelligence 7 (1993) 25. doi:10.1142/ S0218001493000339. [9] X. Zhou, W. Liang, S. Shimizu, J. Ma, Q. Jin, Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems, IEEE Transactions on Industrial Informat- ics 17 (2021) 5790–5798. doi:10.1109/TII.2020. 3047675. [10] G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recognition, in: Proc. of the deep learning workshop in the 32nd Interna- tional Conference on Machine Learning, volume 2, 2015. [11] J. Zhang, K. Shimonishi, K. Kondo, Y. Nakamura, Facial expression change recognition on neutral- negative axis based on siamese-structure deep neu- ral network, in: Cross-Cultural Design. Product and Service Design, Mobility and Automotive Design, Cities, Urban Areas, and Intelligent Environments Design: 14th International Conference, CCD 2022, Held as Part of the 24th HCI International Confer- ence, HCII 2022, 2022, pp. 583–598. [12] O. M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in: British Machine Vision Conference, 2015. [13] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, International Journal of Computer Vision 128 (2019) 336 – 359. doi:10.1007/s11263-019-01228-7.