1. Introduction

Ordinal Scale Evaluation of Smiling Intensity using Comparison-Based Network

Kei Shimonishi

Kazuaki Kondo

Hirotada Ueda

Yuichi Nakamura

0 0 Kyoto University , Yoshida-honmachi, Sakyo, Kyoto , Japan

The ability to evaluate both explicit facial expressions and intermediate expressions is helpful for human monitoring. Since intermediate facial expression is out of the scope of traditional studies, evaluation scores obtained from traditional facial expression recognition techniques are unreliable. In this paper, we propose an ordinal scale-based evaluation scheme for facial expression based on a comparison. The proposed framework is based on an ordinal scale; it is challenging to construct a standard scale that can be applied to multiple individuals. However, it is expected to be efective enough to track changes in the facial expressions of the specific individual, including intermediate expressions. We also propose an algorithm for selecting reference images from the data by taking into account the consistencies of the strong-weak relationships between reference images because the reference image selection significantly impacts the ordinal evaluation. Our approach is evaluated by conducting experiments with human annotators.

eol>Facial expression recognition Siamese network ranking ordinal scales

1. Introduction

Transitions of facial expressions Monitoring an individual’s Quality of Life (QOL) is becoming increasingly important to maintain good mental ity cBoencdauitsioendsiraencdt QdeOteLcitneqaurilryestraernedbsoi nthhearirnmgfualncdoint disitdioifi-ns. iiltsnneem cult to accurately represent one’s internal state, estimat- S ing internal state from external nonverbal information time is desired. Facial expression is one of the modalities that reflects an individual’s internal state and is expressed Time with being influenced by mental condition. For example, when an individual is not feeling well, the same smile Figure 1: An example of transition curve of smiling intensity may appear weaker than usual. Therefore, monitoring fa- in daily life cial expressions in daily life is a crucial clue to estimating an individual’s QOL.

The research field of facial expression recognition Though the traditional algorithm of FER seems able (FER) has a long history, and it has already been put to evaluate intermediate facial expressions as a probinto practical use as a technology, such as smiling shut- ability that a specific facial expression is represented, ters. While traditional FER mainly focuses on recognizing the probability values are not so reliable, especially for whether a clear facial expression is represented or not, evaluating intermediate expressions. This is because the from the viewpoint of monitoring in daily life, evaluating intermediate facial expression was out of the scope of the degree of expression for the individual is rather cru- the traditional studies; learning is likely to output a value cial, especially for patients with dementia who have little close to the binary value of either no expression (0) or or no facial expressions. Based on this point of view, this an expression (1). As a result, for example, when the research aims to draw a curve of transitions of the indi- degree of smile expression is estimated for a series of vidual’s degree of facial expressions, particularly smiling facial expressions, the value may change abruptly over a intensity, as shown in Figure 1. series of times as shown in Figure 2. In addition, it is also dificult for the machine learning algorithm to directly (MMaLc4hCinMeHL),eAarAnAinIg20f2o4r, VCaongcnoiutviveer, BaCn,dCManeandtaal Health Workshop learn the intermediate facial expressions since it is difi* Corresponding author. cult for even humans to give appropriate absolute values $ shimonishi@i.kyoto-u.ac.jp (K. Shimonishi); for intermediate facial expressions. kondo@ccm.media.kyoto-u.ac.jp (K. Kondo); Kondo et al. [ 1 ] proposed a network for recognizing ueda.hirotada.2r@kyoto-u.ac.jp (H. Ueda); smiling based on “comparison” to address the issues of yuichi@media.kyoto-u.ac.jp (Y. Nakamura) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License recognizing intermediate facial expressions. Their work Attribution 4.0 International (CC BY 4.0).

Time is based on the assumption that the problem of relatively evaluating which of two images represents more smiles by comparing two images is easier than absolutely evaluating a degree of smiling from only one image.

By borrowing this comparison-based idea to evaluate facial expressions, we propose an approach to evaluate smiling intensity with an ordinal scale. The basic idea of this approach is that if we have multiple reference face images for a specific individual and a method for comparing facial expressions, we can evaluate the smiling intensity of a new image of the individual through pairwise comparison with the reference images, as shown in Figure 3.

Since the expression ratings in this method are based on an ordinal scale, the degree of each rating is not the same for multiple individuals. However, this ordinal scale-based approach may satisfy our need to capture changes in facial expressions for each individual.

In addition, reference image selection is crucial for this ordinal-based evaluation because they are considered an evaluation space for facial expressions. Therefore, we also propose an algorithm of reference image selection from a large number of face image data of each individual based on consistencies of comparison results within images.

In summary, the contributions of this paper are as follows: • We propose an approach to evaluating intermediate smiling intensity by ordinal scales based on comparisons. • We propose an algorithm for selecting appropriate reference images to construct appropriate evaluation space.

We briefly introduce related work in the next section.

Then, we introduce an approach to evaluate facial expressions by ordinal scales and an algorithm of reference image selection. We evaluate our approach and algorithm with human annotators, and finally, we conclude our research.

2. Related Work 2.1. Facial expression recognition Facial expression recognition is widely utilized in several ifelds. Traditional studies mainly focused on determining whether a specific expression is represented or not.

2.1.1. Facial Action Coding Systems Facial Action Coding Systems (FACS) [ 2 ] is a framework proposed by Ekman et al. that classifies a face into several parts (Action Units; AUs) based on the basic action units of individual muscles and describes facial expressions as a combination of these AU actions. Many facial expression recognition applications have used FACS as features, and for example, OpenFace [ 3 ] can analyze multiple facial expressions in near real-time by automatically recognizing the actions of AUs. 2.1.2. Deep neural network based approach Although the FACS-based FER approach has been successful, it has the limitation that the final results are afected by the accuracy of FACS detection. This limitation can become a problem, especially when trying to capture subtle diferences in facial expressions because the efect of observation noise cannot be ignored. On the other hand, the end-to-end approach by the deep neural network can be expected to reduce the efect of such observation noise by eliminating the necessity of explicit feature detection. For example, VGGNet [ 4 ] is a traditional deep neural network, but it is known that human facial features can be extracted well, and recent research of FER also utilized VGGNet [ 5, 6 ]

2.2. Siamese structure-based recognition technique

Siamese network [ 7 ] is one of the deep neural networks of metric learning. This network acquires two inputs and returns the distance between the two inputs. By applying the same structures and the same weights to feature extraction layers of these two inputs and the distance of the two inputs to the loss function, the network can learn the distance space. The Siamese Network is a network that determines whether two inputs are similar or diferent and has been applied to handwritten signature Itgt rsye )ed la r

a NN (sh C Forward comparison stream Inverse comparison stream rse )ed lya ra FC (sh

In the inverse comparison ‘Ascent’ ‘Descent’ ‘Ascent’ ‘Descent’ In the forward comparison ‘Ascent’ ‘Descent’

Ground truth ‘Ascent’ ‘Descent’ Categorical cross entropy loss recognition [ 8 ] and used as a framework for anomaly detection [ 9 ]. As one of its features, it is known as a network that can be trained from a small number of training data compared to conventional networks that perform multi-valued discrimination and regression [ 10 ].

Kondo et al. [ 1 ] proposed an approach to the evaluation of facial expressions based on comparison inspired by the Siamese structure. Their approach compares two facial images and returns which of one image represents more smiles, and they showed that the approach has the potential to distinguish subtle facial expression diferences. In addition, Zhang et al. [ 11 ] extended their work from a positive-neutral direction to a negative-neutral direction.

3. Comparison-based smiling

evaluation by ordinal scales

3.1. Overview of the proposed framework 3.2. A network for facial expression comparison In this paper, we defined the recognition task as a simple

two-category classification problem (i.e., determining which of two input images represents the greater degree of smiling) and construct a Siamese-based network to recognize smiling similar to the network Kondo et al. have developed [ 1 ].

Figure 4 shows the structure of the proposed network that accepts two input images and returns two likelihood values corresponding to ascension and descension labels relative to the degree of smiling. We employed the CNN component of VGG16 [ 4 ] and two fully connected layers with rectified linear units, a 0.25 dropout rate, and SoftMax in the proposed method. The ground-truth likelihood values for an input image pair were represented as a two-element one-hot vector, with its element corresponding to the ground truth label set to 1 and the other element set to 0, respectively. We used categorical cross-entropy loss to optimize the network parameters, as follows: As introduced in the Introduction, the basic idea of our approach is a comparison-based evaluation. Kondo et al.[ 1 ] has developed a Siamese-based smiling recogni- = − ∑︁ {(ˆ) + (1 − )(1 − ˆ)} , tion network that takes two face images as input and recognizes which one is expressing smiling more. By (1) borrowing this idea, once we develop a network that can where = {0, 1}, , and ˆ denotes ascension and determine which of two images represents more smiles, descension labels relative to the degree of smiling, the and if we have multiple reference images, we can evalu- ground-truth label, and the predicted likelihood values, ate the smile intensity of a new image through pairwise respectively. comparison with the reference images as also introduced The previously proposed network by Kondo et al. was in the Introduction. When it comes to determining smil- not designed to consider the order of inputs, resulting ing intensity based on ordinal scales, although all the in instances where swapping the order of two inputs comparison results are ideally consistent, the results are led to contradictory outputs. To address this issue, we sometimes inconsistent due to an ambiguity of slightly input a permuted version of the two features extracted diferent face images. Therefore, we apply a voting-based from two input images by the CNN component into the evaluation and determine smiling scores by merging mul- fully connected layer in the latter stage and calculate the tiple comparison results. In addition, we propose an al- categorical cross-entropy loss of inverted input, , gorithm to select appropriate reference images to reduce as same as , as shown in red arrows in Figure 4. the ambiguity between reference images in the following Also, a loss of consistency of these two types of input is section. calculated as = 1 − { ((, ), (, )) voted to ranks larger than . In practice, add likelihood values of “ascend” and “descend” to the ranks lower than and higher than , respectively.

Simply thinking, the degree of smiling in the reference images can be determined by searching the position whose scores are maximum. Here, the position can be derived as: 3.3. Voting-based evaluation where (, ) and (, ) represent probabilities that the degree of smiling of image is larger or smaller than that of image , respectively. In other words, (() > ()) and (() < ()), where () represents a degree of smiling of image . Also, and represent the likelihoods of the forward comparison stream and inverse comparison stream, respectively.

In total, our network is trained to decrease the following loss function: To apply voting-based ordinal-scale evaluation as described above, we first need to construct an evaluation space with several reference images. Since the proposed approach utilizes ordinal scales, the construction of the evaluation space is crucial for the capability of the approach. Although a straightforward way is to utilize all the face data as reference images, the evaluation space = + + . (3) constructed by very similar or subtle diferent images is unreliable due to the ambiguity of these images.

Here, we expected that the CNN and the fully con- In this paper, we first consider all the data as baseline nected components would be trained to compare ex- images and take pair-wise comparisons to sort all the tracted features in order to project the results onto the data in a dataset and construct a baseline ranking. Then, likelihood values of the ascension and descension labels, we select several images from the baseline ranking as respectively. reference images and quantize the evaluation space by taking into account consistency to address the issues due to ambiguity.

4. Reference image selection

Since the reference images may include some ambiguity 4.1. Baseline ranking construction between neighboring images, it is dificult to directly determine the degree of smiling of the new target image Figure 6 (a) shows a comparison table of the result of all in a reference image set. Therefore, we apply a voting pair-wise comparisons in baseline images. Each color technique to determine the final rank of the image. The shows a probability of how a target image has a stronger algorithm votes to possible ranks using the result of each smile than a reference image, i.e., (, ). The comparison of reference images and a target image. As blue shows a pair whose target image has a stronger a result, the most likely rank should have a maximum smile than the reference image, i.e., (, ) > number of votes. (, ). In contrast, the red area shows a pair

In particular, the procedure is as follows. Suppose that whose reference image has a stronger smile than a target we have reference images with its order of degree of image, i.e., (, ) < (, ). The white smiling, i.e., () > ( ), ∀ < . A new target area represents that the target and the reference images image is compared to all reference images, and like- represent similar facial expressions. lihood values that the degree of smiling of target image is By sorting the baseline images based on the sum of the larger than that of a reference image (, ) and probability values in each column of this table, a baseline likelihood that the degree of smiling of target image is ranking considering the consistency of the strong-weak lower than that of a reference image (, ) for relationship can be constructed (Figure 6 (b)). In particall reference images ( ∈ {1, . . . , }) are obtained. Be- ular, suppose we have baseline images {1, . . . , } cause if () < () is estimated, the smile rank in total, and denote images sorted in descending order of is estimated as larger than , large values are by smiling intensity as {1, . . . , }. Since the gae m i frceneeeR

Consistent e g a m i cne ree Inconsistent f e R

(a) Aconsistency of strong-weak relations in the baseline ranking images (b) Aconsistency square of

neighboring images calculated the same as the bottom-right figure of Figure 6, the less red and white colors area is a better sign of reference image selection. strong-weak relationship between each image to other To realize that, we focus on a square region of neighbor images ^, ˆ ∈ {1, . . . , } are calculated as probabil- images as shown in Figure 7, and call this square consisity values and , the total consistency values in tency square. Suppose the diferences between images the baseline ranking is derived as are significant, and a strong-weak relationship is evident ⎧ in the images. In that case, the consistency values in the = ∑︁ ⎨⎪ ∑︁ (, ˆ) icnontshiestesqnucyarsequbaerceoamree ablsluoee.xpInecctoednttroabste, lwarhgeen,i.iem., acgelelss ⎪⎩{^|^∈{1,...,−1}} are similar, and therefore the diference between images ⎫ is ambiguous, the consistency values in the consistency + ∑︁ (, ˆ)⎪⎬ (5) saqnudarreedb.ecome low, i.e., cells in the square become white {^|^∈{+1,...,}} ⎪⎭ The basic idea of building a consistent evaluation space is to quantize images with low consistency values in the By maximizing this total consistency, base- consistency square into a single class. As a result, the line ranking images (1, . . . , ) = ambiguity between these images becomes “don’t care” in arg max 1, . . . , can be obtained. From now the evaluation space, and the consistency of the evaluated on, the subscript will be used to sort the images in value becomes significant. That is, it is good to select descending order of smiling degree. images where the total consistency values in the sum of

An example of the consistency of the strong-weak consistency squares, as shown in the right of Figure 7, relations in this rearranged table is shown in Figure becomes low. 6 (c) by replacing (, ) into (, ) In addition, neighbor images in the baseline ranking when () < ( ). We here denote probabil- should not be selected as reference images. In other ities , to indicate this consistency as follows: words, to select good reference images, the evaluation space is better to be divided evenly. To realize that, select , = {︃ ((,,) ) iiff (()) >< (( )),. aofgcroounpsisotfernecfyersepnacceeimbeacgoemsseos tshmaatltlh.eTsoumsumof tuhpe, aitreias (6) better to choose a group of images for which both the

Ideally, all cells would be blue, i.e., consistency is nearly sum of consistency values and the sum of areas within equal to 1. However, due to the ambiguity of compar- the consistency square is small. Here, selecting a group ison results for similar facial expressions, there is also of images with high consistency values within the consisambiguity in the consistency between the neighboring tency square is equivalent to selecting a group with low baseline ranking. Therefore, reference image selection is inconsistency. In summary, a group of reference images important to construct appropriate evaluation space. should be selected to maximize the total values of inconsistencies in consistency squares divided by the sum of 4.2. Reference image selection the areas of consistency squares.

In practice, a procedure of reference image selection An important factor in selecting reference images is the is the following. Suppose there are images in total, consistency of the strong-weak relationships within ref- and we want to select one image as a reference image. erence images. That is, when the consistency table is (2) = At first, baseline ranking is constructed as introduced in the previous subsection and obtains consistency values , (, < ) for all pairs in the ranking. When evaluation space is divided into two with image ( < ), the sum of inconsistency values of the two spaces divided by the sum of the area is calculated as:

(7) By searching the position whose (2) are maximum, the best reference image that divides evaluation space into two can be obtained. Similar to this, by calculating the sum of inconsistency in the consistency squares divided by the sum of the area with diferent division numbers + 1, reference images can be obtained.

When it comes to calculating these values, we apply a scheme of dynamic programming to reduce calculation costs.

5. Experiment We conducted an experiment to evaluate the following things:

training data in this paper. In particular, we manually an• How a proposed network can evaluate image pair. notated segments that we thought the degree of smiling • How appropriately reference images can be se- ascended or descended monotonically. We then picked lected regarding consistency of both network and each segment’s start and end frame to construct one pair human annotators’ evaluations. with its label. The number of image pairs of each dataset • How the selected reference images evaluate face were 216, 174, and 123, respectively. Also, all face images images. in these pairs were utilized as baseline images. That is, the size of the dataset is twice the size of the image pairs; 432, 348, 246, respectively.

5.1. Dataset construction

At first, the face image dataset is constructed by captur- 5.2. Evaluation scheme ing participants’ face images. We conducted two types of experiments to construct datasets with diferent situa- The procedure of evaluation was constructed the followtions. In the first type of dataset, we asked a participant ing four steps; (1) the proposed network was trained for to sit in front of the camera and to listen to funny radio. each individual by collected data and was evaluated by In the second type of dataset, we asked a participant to the cross-validation scheme; (2) we constructed the basesit in front of the laptop PC and play a simple game. We line ranking and evaluated the voting-based algorithm by captured facial images of these participants during the determining the rank within the images in the baseline experiment. The second experiment was still experimen- ranking; (3) reference images were selected from baseline tal but closer to a natural scene than the first one. Each images as proposed in the previous section and evaluated dataset was constructed only by one participant because by human annotators on how they were consistent; (4) our focus was to build a model to evaluate each individ- we confirmed the smiling intensity of face images in the ual. We collected two datasets of the first type, and one evaluation space constructed by reference images. dataset of the second type. As for training our network, we utilized pre-trained

Then, we added labels between image pairs that feature extraction layers of VGG-Face [ 12 ], which was showed which of the two images expressed more smiles. trained on millions of face images for person identificaThe annotation between images with a slight diference tion, and trained only fully connected layers. in the degree of smiling was dificult, even for humans. It Regarding the evaluation of the voting-based algomight cause a mistake in giving the correct labels. There- rithm, the rank of each baseline image was determined by fore, we utilized image pairs with a clear diference as the baseline ranking itself. In this evaluation, the grand truth of the rank of each baseline image was given as the original rank of baseline ranking.

As for evaluating selected reference images, nine reference images were selected from the baseline images.

Then, human annotators were asked to evaluate which of the two images represented more smiling for the pair of neighboring images in reference images. The images up to the third nearest neighbor images were considered a pair, and all image pairs were annotated twice by swap- Figure 9: Selected reference images of dataset 1. More smile ping the left and right sides of the comparison image. images are located on the left side.

After comparing the image pairs within each dataset, the participants moved on to the next dataset. The order of evaluated image pairs was randomized for each dataset, but the order of the datasets was constant. We evaluated the consistency of the reference images by how accurate and consistent annotators evaluated the image pair. A group of 9 images regularly extracted every /10 from the baseline images {1, . . . , 9} was used as the reference image for comparison. Seven participants be- Figure 10: Consistency of estimated rank and original rank tween the ages of 21 and 27 (6 male and 1 female) were in baseline ranking recruited as annotators.

5.3. Results

5.3.3. Selected reference images 5.3.1. Prediction accuracy Figure 11 shows selected reference images by the proposed algorithm and equally picked up from the baseline We first show the evaluation results of the trained ranking. The consistency table correlated to this result of comparison network in terms of accuracy. In dataset 1 is shown in Figure 12. In this figure, the green this evaluation, five-fold cross-validation was applied, line shows where the algorithm divides baseline ranking. and prediction accuracies of three datasets were These results show that there still appears to be some am99.5% (215/216), 100% (174/174), 98.3% (121/123), biguity between adjacent images, even with the proposed respectively. Figure 8 shows examples of prediction re- approach. However, it appears to be reduced compared sults with Gradient-weighted Class Activation Mapping to a group of images acquired at regular intervals. (Grad CAM) [ 13 ]. In each figure, the first row shows the The consistency table calculated by these selected refgrand truth label and estimation results, and the second erence images is shown in Figure 13. We can see that row shows regions on which a network focuses for the almost all the cells represent blue. This result shows that prediction as a heat map. From these results, we can see ambiguities within reference images are small. that the network returns accurate prediction results by Figure 14 and Figure 15 show the quantitative evaluacorrectly focusing on face regions, including the mouth tion results of selected reference images by annotators. and eyes, which are well known as corresponding to In each figure, “proposed” and “baseline” represent the smiling, even from the small dataset. result for the images up to the third nearest neighbor reference images of the proposed algorithm and baseline 5.3.2. Consistency of voting based evaluation algorithm, respectively, and “proposed_adjacent” and Figure 9 shows four examples of estimated ranks of im- “baseline_adjacent” represent the result for the images ages in the baseline ranking, rank 1, 100, 200, and 400 only the nearest neighbor reference images. That is, of dataset 1. The total number of baseline images was the dificulty of the evaluation becomes hard. Figure 14 219196, × an2d =3984,3r2e.spEeaccthiviemlya,gaenwdaasneasltmimoasttecdorarsercatnrkan1k, 1c0a3n, wsheocwosnasidpererdtihcetioorndaerccguivraecnyboyf tehveapluraotpioonserdesnuelttws.oHrkeraes, be estimated by the voting algorithm. Figure 10 shows the grand truth of the prediction. Therefore, this result all pairs of estimated rank and grand truth label of this also shows a correlation between network prediction and evaluation in dataset 1. These results show the consis- human perceptions. Figure 15 shows the consistency of tency and efectiveness of the voting-based algorithm, each participant’s evaluation. In particular, it shows how as it predicted almost consistent values for all baseline much the same evaluation was given when the same imimages. age pair displayed with the left and right sides swapped. This high consistency indicates a low degree of ambiguity between image pairs. In almost all cases, reference Strongest smile

Weakest smile (a) Selected reference images by proposed algorithm

Weakest smile (b) Selected reference images by regularly

picked up from baseline ranking

Examples of face images evaluated by selected reference

images and the proposed network are shown in Figure 16.

Figure 12: Consistency table of dataset 1. The green line Since it is sometimes hard to qualitatively evaluate two shows where the algorithm divides baseline ranking. adjacent images in a row, the four reference images skip one rank at a time. The images with a smile level one class lower than the reference image are listed, and each row shows the same evaluation value. The images on the images selected by the proposed algorithm obtain higher left side of the figure are recognized as having a higher accuracies and higher consistencies. Since the smiles ex- degree of smiling. This result confirms that the proposed pressed in the experimental time were quantized into ten method efectively evaluates the degree of smiling within levels and the maximum value of the smile was not very the ordinal scale. high, both methods have a certain degree of similarity be- Finally, a part of the transition of the smiling intensity tween the neighboring reference image pairs. Therefore, during the experiment is shown in Figure 17. In this evaluation by humans may be somewhat dificult even result, an evaluation score was smoothed by the median with the proposed method. However, even in such a situ- iflter to trace the trend of transitions. We can see that ation, we can confirm that the proposed method selects Evaluated images

Strongest smile

Weakest smile the participant smiles several times in this period. It can be seen that smiles of slightly stronger intensity than the middle level occurred several times in succession in the first half of this period. In comparison, smiles of considerably stronger intensity occurred with a short interval in the second half of this period.

6. Conclusion

In this paper, we propose an approach to evaluate the degree of smiling of individuals by ordinal scales based on multiple comparisons for the purpose of monitoring individuals. Suppose that we have enough data from individual face images; we also propose an algorithm for selecting appropriate reference images for the ordinal evaluation.

Experimental results show that our ordinal scale-based evaluation can successfully give the degree of not only clear smiling but also intermediate facial expressions. In addition, we can see that an evaluation space constructed by selected reference images by our algorithm is more consistent and, therefore, considered to be reasonable.

One of the future works is to map the proposed and constructed ordinal scale to some physical index. Although this paper proposed a method of selecting reference images that are somewhat reasonable when evaluated by humans, the validity of the scale would be improved if it could be mapped to some physical index. For example, by measuring the myoelectricity of facial muscles, the degree of muscle activity could be used as an index. In addition, the other future work is to apply this technique to people whose facial expressions do not change much, e.g., dementia patients, as we described in the introduction section.

Ethics

Our method aims to monitor the daily health conditions of a specific individual by evaluating the smiling intensity using a model trained specifically for the individual’s facial images. Since data for model training and smiling intensity evaluation can be collected and processed at terminals installed in each individual’s environment, it is expected to reduce the risk of leakage of particularly strong personal information such as facial images being stored in the cloud in practical applications.

[1]

Kondo ,

Nakamura ,

Satoh , Siamese-structure deep neural network recognizing changes in facial expression according to the degree of smiling , in: Proc. of ICPR2020 , 2021 , pp. 4605 - 4612 . doi: 10 .1109/ICPR48806. 2021 . 9411988 .

[2] P. EKMAN,

Facial action coding system (facs), A Human Face (

2002 ). URL: https://ci.nii.ac.jp/naid/ 10025007347/.

[3]

Amos ,

Ludwiczuk , M. Satyanarayanan, OpenFace: A general-purpose face recognition library with mobile applications , Technical Report, CMUCS-16-118 , CMU School of Computer Science, 2016 .

[4]

Simonyan ,

Zisserman , Very deep convolutional networks for large-scale image recognition , in: Proc. of ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 .

[5]

C. C.

Atabansi ,

Chen ,

Cao ,

Xu , Transfer learning technique with vgg-16 for near-infrared facial expression recognition , Journal of Physics: Conference Series 1873 ( 2021 ) 012033 . URL: https://dx. doi.org/10.1088/ 1742 - 6596 / 1873 /1/012033. doi: 10 . 1088/ 1742 - 6596 / 1873 /1/012033.

[6]

Liu , Facial expression recognition model based on improved vggnet , in: 2023 4th International Conference on Electronic Communication and Artificial Intelligence (ICECAI) , 2023 , pp. 404 - 408 . doi: 10 .1109/ICECAI58670. 2023 . 10177007 .

[7]

Bromley , I. Guyon,

LeCun , E. Säckinger,

Shah , Signature verification using a "siamese" time delay neural network , in: Proc. of NIPS'93 , 1993 , pp. 737 - 744 .

[8]

Bromley ,

Bentz ,

Bottou , I. Guyon,

Lecun ,

Moore ,

Sackinger ,

Shah , Signature veriifcation using a "siamese" time delay neural network , International Journal of Pattern Recognition and Artificial Intelligence 7 ( 1993 ) 25 . doi: 10 .1142/ S0218001493000339.

[9]

Zhou ,

Liang ,

Shimizu ,

Ma ,

Jin , Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems , IEEE Transactions on Industrial Informatics 17 ( 2021 ) 5790 - 5798 . doi: 10 .1109/TII. 2020 . 3047675 .

[10]

Koch ,

Zemel ,

Salakhutdinov , Siamese neural networks for one-shot image recognition , in: Proc. of the deep learning workshop in the 32nd International Conference on Machine Learning , volume 2 , 2015 .

[11]

Zhang ,

Shimonishi ,

Kondo ,

Nakamura , Facial expression change recognition on neutralnegative axis based on siamese-structure deep neural network, in: Cross-Cultural Design. Product and Service Design, Mobility and

Automotive

Design , Cities,

Urban

Areas , and Intelligent Environments Design: 14th International Conference, CCD 2022, Held as Part of the 24th HCI International Conference , HCII 2022 , 2022 , pp. 583 - 598 .

[12]

O. M.

Parkhi ,

Vedaldi ,

Zisserman , Deep face recognition , in: British Machine Vision Conference , 2015 .

[13]

R. R.

Selvaraju ,

Cogswell , A. Das , R.

Vedantam , D.

Parikh , D.

Batra , Grad-cam: Visual explanations from deep networks via gradient-based localization , International Journal of Computer Vision 128 ( 2019 ) 336 - 359 . doi: 10 .1007/s11263-019-01228-7.