<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context-Dependent Models for Predicting and Characterizing Facial Expressiveness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Je rey M. Girard</string-name>
          <email>jgirard2g@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Louis-Philippe Morency</string-name>
          <email>morency@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA 15213</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, extensive research has emerged in a ective computing on topics like automatic emotion recognition and determining the signals that characterize individual emotions. Much less studied, however, is expressiveness|the extent to which someone shows any feeling or emotion. Expressiveness is related to personality and mental health and plays a crucial role in social interaction. As such, the ability to automatically detect or predict expressiveness can facilitate signi cant advancements in areas ranging from psychiatric care to arti cial social intelligence. Motivated by these potential applications, we present an extension of the BP4D+ dataset [27] with human ratings of expressiveness and develop methods for (1) automatically predicting expressiveness from visual data and (2) de ning relationships between interpretable visual signals and expressiveness. In addition, we study the emotional context in which expressiveness occurs and hypothesize that di erent sets of signals are indicative of expressiveness in di erent contexts (e.g., in response to surprise or in response to pain). Analysis of our statistical models con rms our hypothesis. Consequently, by looking at expressiveness separately in distinct emotional contexts, our predictive models show signi cant improvements over baselines and achieve comparable results to human performance in terms of correlation with the ground truth.</p>
      </abstract>
      <kwd-group>
        <kwd>expressiveness</kwd>
        <kwd>emotion</kwd>
        <kwd>facial expression</kwd>
        <kwd>a ective com- puting</kwd>
        <kwd>machine learning</kwd>
        <kwd>computer vision</kwd>
        <kwd>statistical models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Although humans constantly experience internal reactions to the stimuli around
them, they do not always externally display or communicate those reactions. We
refer to the degree to which a person does show his or her thoughts, feelings,
or responses at a given point in time as expressiveness. That is, a person
being highly expressive at a given moment can be said to be passionate or even
dramatic, whereas a person being low in expressiveness can be said to be stoic
or impassive. In addition to varying moment-to-moment, a person's tendency
toward high or low expressiveness in general can also be considered a trait or
disposition [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this paper, we study momentary expressiveness, or expressiveness at a
given moment in time. This quantity has not been previously explored in detail.
We have two primary goals: (1) to automatically predict momentary
expressiveness from visual data and (2) to learn and understand interpretable signals of
expressiveness and how they vary in di erent emotional contexts. In the following
subsections, we motivate the need for research on these two topics.
Prediction of Expressiveness The ability to automatically sense and
predict a person's expressiveness is important for applications in arti cial social
intelligence and especially healthcare. For an example of how expressiveness
might be useful in arti cal social intelligence, as many customer-facing areas
become increasingly automated, the computers, robots, and virtual agents who
now interact with humans must be aware of expressiveness in order to
interact with humans in appropriate ways (e.g., a highly expressive display might
need to be a orded more attention than a less expressive one). With regard to
healthcare, expressiveness holds promise as an indicator of mental health
conditions like depression, mania, and schizophrenia, which have all been linked to
distinct changes in expressiveness. Depression is associated with reduced
expressiveness of positive emotions and increased expressiveness of certain negative
emotions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]; mania is associated with increased overall expressiveness [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]; and
schizophrenia is associated with blunted expressiveness and inappropriate a ect,
or expressiveness for the \wrong" emotion given the context [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Because these
relationships are known, predicting an individual's expressiveness can provide a
supplemental measure of the presence or severity of speci c mental health
conditions. An automatic predictor of expressiveness therefore has the potential to
support clinical diagnosis and assessment of behavioral symptoms.
Understanding Signals of Expressiveness Intuitively, overall impressions
of expressiveness are grounded in visual signals like facial expression, gestures,
body posture, and motion. However, the signals that correspond to high
expressiveness in a particular emotional context do not necessarily correspond to high
expressiveness in a di erent emotional context. For example, a person who has
just been startled may express his or her reaction strongly by inching, which
results in a fast and large amount of body movement. On the other hand, a
person who is in pain may show that feeling by moving slowly and minimally
because he or she is attempting to regulate their emotion. In the former
scenario, quick movement corresponds to high expressiveness, whereas in the latter
scenario, quick movement corresponds to low expressiveness.
      </p>
      <p>We aim to formalize the relationship between interpretable visual signals and
expressiveness through statistical analysis. Furthermore, we hypothesize that the
speci c signals that contribute to expressiveness vary somewhat under di erent
contexts and seek to con rm this hypothesis by modeling expressiveness in
different emotional states.
Contributions To realize our goals, we must collect data about how
expressiveness is perceived in spontaneous (i.e., not acted) behavior and develop techniques
to analyze, model, and predict it. As such, we address the gap in the literature
through the following contributions.</p>
      <p>
        We introduce an extension of the BP4D+ emotion elicitation dataset [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
with human ratings of central aspects of expressiveness: response strength,
emotion intensity, and body and facial movement. We also describe a method
for generating a single expressiveness score from these ratings using a latent
variable representation of expressiveness.
      </p>
      <p>We present statistical and deep learning models that are able to predict
expressiveness from visual data. We perform experiments on a test set of
the BP4D+ extended dataset, establish baselines, and show that our models
are able to signi cantly outperform those baselines and for some metrics
even approach human performance, particularly when taking context into
consideration.</p>
      <p>We present context-speci c and context-agnostic statistical models that
reveal interpretable relationships between visual signals and expressiveness. We
conduct an analysis of these relationships over three emotional contexts|
startle, pain, and disgust|that supports our hypothesis that the set of visual
signals that are important to expressiveness varies depending on the
emotional context.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Although little prior work has been conducted on direct prediction of
expressiveness, advances have been made in the adjacent eld of emotion recognition.
Likewise, within the scope of psychology, there exists a substantial body of
literature dedicated to determining the visual features that characterize di erent
emotions; however, to our knowledge, little to no similar work has been
conducted on the visual features that characterize how strongly those emotions
are shown (i.e., expressiveness). We describe the current state of these areas of
research, as we draw from this related work to de ne our own approaches to
predicting and characterizing expressiveness.</p>
      <p>
        Emotion Recognition Because the task derives from similar visual features|
facial landmarks and movement, for example|advancements in deep learning
for the eld of emotion recognition are highly informative and provide much of
the guiding direction for our predictive deep learning models. A number of
architectures have achieved high accuracy for multiclass emotion classi cation in
a variety of settings, including still images, videos, and small datasets. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] used
an ensemble of CNNs with either log-likelihood or hinge loss to classify images
of faces from movie stills as belonging to 1 of 7 basic emotions. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] extended a
similar architecture to accurately predict emotions even with little task-speci c
training data by performing sequential ne-tuning of a CNN pretrained on
ImageNet, rst with a facial expression dataset and then with the target dataset,
a small movie still dataset. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] designed a 3D-CNN that predicts the presence
of an emotion (as opposed to a neutral expression) in each frame of a video.
Finally, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a hybrid approach for emotion recognition in video. After
rst training a CNN on two separate datasets of static images of facial emotions,
the authors used the CNN to obtain embeddings of each frame, which they used
as sequential inputs to an RNN to classify emotion.
      </p>
      <p>
        Interpretable Signals of Emotion The three emotional contexts of startle,
pain, and disgust all have well-studied behavioral responses that could serve as
visual signals of emotion and therefore expressiveness. Previous observational
research has found that the human startle response is characterized by blinking,
hunching the shoulders, pushing the head forward, grimacing, baring the teeth,
raising the arms, tightening the abdomen, and bending the knees [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]; the human
pain response is characterized by facial grimacing, frowning, wincing, increased
muscle tension, increased body movement/agitation, and eye closure [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; and
the human disgust response is characterized by furrowed eyebrows, eye closure,
pupil constriction, nose wrinkling, upper lip retraction, upward movement of the
lower lip and chin, and drawing the corners of the mouth down and back [
        <xref ref-type="bibr" rid="ref20 ref25">25,20</xref>
        ].
These responses have notable similarities, such as the presence of grimacing, eye
closure, and withdrawal from an unpleasant stimulus. However, they also have
unique aspects, such as pushing the head forward in startle, increased muscle
tension in pain, and nose wrinkling in disgust.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Expressiveness Dataset</title>
      <p>
        We describe the data collection pipeline and engineering process for the dataset
we used to perform our modeling and analysis of expressiveness.
Video Data The BP4D+ dataset contains video and metadata of 140
participants performing ten tasks meant to elicit ten di erent emotional states [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
Participants were mostly college-aged (M = 21:0, SD = 4:9) and included a mix
of genders and ethnicities (59% female, 41% male; 46% White, 33% Asian, 11%
Black, 10% Latinx). A camera captured high de nition images of participants'
faces during each task at a rate of 25 frames per second. On average, tasks lasted
44:5 seconds in duration (SD = 31:4).
      </p>
      <p>In this study, we focus on the tasks meant to elicit startle, pain, and disgust.
Example frames from each of these tasks can be found in Figure 1. These tasks
were selected because they did not involve the participant talking; we wanted to
avoid tasks involving talking because the audio recordings are not available as
part of the released dataset. In the startle task, participants unexpectedly heard
a loud noise behind them; in the pain task, participants submerged their hands
in ice water for as long as possible; and in the disgust task, participants smelled
an unpleasant odor similar to rotten eggs.
Startle</p>
      <p>Pain</p>
      <p>Disgust
High expressiveness</p>
      <p>Low to moderate</p>
      <p>expressiveness</p>
      <p>Because a person's expressiveness may change moment-to-moment and we
wanted to have a ne-grained analysis, we segmented each task video into
multiple 3-second clips. Because task duration varied between tasks and participants,
and we did not want examples with longer durations to dominate those with
shorter durations, we decided to focus on a standardized subset of video clips
from each task. For the startle task, we focused on the ve clips ranging from
second 3 to second 18 as this range would capture time before, during, and after
the loud noise. For the pain task, we focused on the rst three clips when pain
was relatively low and the nal four clips when pain was relatively high. Finally,
for the disgust task, we focused on the four clips ranging from second 3 to second
15 as this range would capture time before, during, and after the unpleasant odor
was introduced. In a few cases, missing or dropped video frames were replaced
with empty black images to ensure a consistent length of 3 seconds per clip.
Human Annotation We de ned expressiveness as the degree to which others
would perceive a person to be feeling and expressing emotion. Thus, we needed
to have human annotators watch each video clip and judge how expressive the
person in it appeared to be. To accomplish this goal, we recruited six
crowdworkers from Amazon's Mechanical Turk platform to watch and rate each video
clip. We required that raters be based in the United States and have approval
ratings of 99% or greater on all previous tasks. Raters were compensated at a
rate approximately equal to $7:25 per hour.</p>
      <p>Because raters may have di erent understandings of the word
\expressiveness," we did not want to simply ask them to rate how expressive each clip was.
Instead, we generated three questions intended to directly capture important
aspects of expressiveness. Speci cally, we asked: (1) How strong is the emotional
response of the person in this video clip to [the stimulus] compared to how
strongly a typical person would respond? (2) How much of any emotion does the
person show in this video clip? (3) How much does the person move any part
of their body/head/face in this video clip? Each question was answered using a
ve-point ordered scale from 0 to 4 (see the appendix for details).</p>
      <p>
        To assess the inter-rater reliability of the ratings (i.e., their consistency across
raters), we calculated intraclass correlation coe cients (ICC) for each question
in each task and across all tasks. Because each video clip was rated by a
potentially di erent group of raters, and we ultimately analyzed the average of
all raters' responses (as described in the next subsection), the appropriate ICC
formulation is the one-way average score model [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. ICC coe cients at or above
0.75 are often considered evidence of \excellent" inter-rater reliability [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As
shown in Table 1, all the ICC estimates|and even the lower bounds of their
95% con dence intervals|exceeded this threshold. Thus, inter-rater reliability
was excellent.
Expressiveness Scores For each video clip, we wanted to summarize the
answers to each of the three questions asked as a single expressiveness score to
use as our target in machine learning and statistical analysis, as each question
captured an important aspect of expressiveness. Each of the six raters assigned
to each video clip provided three answers. The simplest approach to
aggregating these 18 scores would be to average them. However, this would assume that
all three questions are equally important to our construct of expressiveness and
equally well-measured. To avoid this assumption, we rst calculated the average
answer to each question across all six raters and then used con rmatory factor
analysis (CFA) to estimate a latent variable that explains the variance shared
amongst the questions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>x1
ε1
0
λ1
η
λ2
x2
ε2
λ3
1
x3
ε3</p>
      <p>
        In Figure 2, the observed question variables are depicted as squares (x) and
the aforementioned latent variable is depicted as a circle ( ) with zero mean and
unit variance. The factor loadings ( ) represent how much each question variable
was composed of shared variance, and the residuals (") represent how much each
question variable was composed of non-shared variance (including measurement
error). We t this same CFA model for each task separately and across all tasks
using the lavaan package [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>The resulting estimates are provided in Table 2. Three patterns in the results
are notable. First, all the standardized loadings were higher than 0.85 (and
most were higher than 0.95), which suggests that there is a great deal of shared
variance between these questions and they are all measuring the same thing
(e.g., expressiveness). Second, there were some factor loading di erences within
tasks, which suggests that there is value in aggregating the question responses
using CFA rather than averaging them. Third, there were some factor loading
di erences between tasks, especially for the motion question, which suggests that
the relationship between motion and expressiveness depends upon context.</p>
      <p>
        Finally, we estimated each video clip's standing on the latent variable (i.e., as
a continuous, real-valued number) by extracting factor score estimates from the
CFA model; this was done using the Bartlett method, which produces unbiased
estimates of the true factor scores [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These estimates were then used as ground
truth expressiveness labels in our further analyses.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <p>We selected our models with our two primary goals in mind: we wanted to nd
a model that would perform well in predicting expressiveness, and we wanted
at least one interpretable model so that we could understand the relationships
between the behavioral signals and the expressiveness scores. We experimented
with three primary architectures|ElasticNet, LSTM, and 3D-CNN|and
describe our approaches in greater detail below.</p>
      <p>
        ElasticNet We chose ElasticNet [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] as an approach because it is suitable for
both our goals of prediction and interpretation. ElasticNet is essentially linear
regression with regularization by a mixture of L1 and L2 priors. This
regularization eliminates the problems of over tting and multicollinearity common to
linear regression with many features and achieves robust generalizability.
However, ElasticNet is still fully interpretable: examination of the feature weights
provides insight into the relationships between features and labels.
      </p>
      <p>
        We engineered visual features from the raw video data to use as input for
our ElasticNet model. For each clip, we used the OpenFace [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] toolkit to extract
per-frame descriptors of gaze, head pose, and facial landmarks (e.g., eyebrows,
eyes, mouth), as well as estimates of the occurrence and intensity for a number
of action units from the Facial Action Coding System [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To reduce the e ects
of jitter, which may produce di erences from frame to frame due simply to noise,
we downsampled our data to 5 Hz from the original 25 Hz.
      </p>
      <p>From this data, we computed frame-to-frame displacement (i.e., distance
travelled) and velocity (i.e., the derivative of displacement) for each facial
landmark. We also calculated frame-to-frame changes in gaze angle and head position
with regard to translation and scale (\head"); pitch; yaw; and roll. For each clip,
we used the averages over all frames of these quantities as our features. We also
counted the total number of action units and calculated the mean intensity of
action units occurring in the clip. We selected these features to represent both
amount and speed of facial, head, and gaze movement.</p>
      <p>
        We used an out-of-the-box implementation of ElasticNet from sklearn and
tuned the hyperparameters by searching over 2 f0:01; 0:05; 0:1; 0:5; 1:0g for
the penalty term and over 2 f0:0; 0:1; : : : ; 1:0g for the L1 prior ratio. For the
nal models on the startle task, pain task, disgust task, and all tasks, was
0:01, 0:1, 0:1, and 0:05, respectively; was 0:0, 0:0, 0:7, and 0:7, respectively.
When = 0:0, ElasticNet corresponds to Ridge regression, and when = 1:0,
ElasticNet corresponds to Lasso regression [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
      </p>
      <p>
        OpenFace-LSTM We also explored several deep learning approaches to
determine whether we could achieve better predictive performance by sacri cing some
interpretability. Due to the small size of the training dataset and the need to
capture the temporal component of the data, we proposed the use of a relatively
simple deep architecture suitable for modeling sequences of data, LSTM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
We implemented a stacked LSTM using the PyTorch framework and tuned over
learning rate, number of layers, and hidden dimension of each layer. In our nal
implementation, we used learning rate 0:005 with 2 layers of hidden dimension
128. Rather than engineering summary features as we did for ElasticNet, we
used a tensor representation of the raw OpenFace facial landmark point
tracking descriptors for each clip as input for the LSTM. Because the LSTM is more
capable of handling high-dimensional data than a linear model, we retained the
original sample rate of 25 Hz to reduce loss of information. Each clip with 75
frames was represented as a [75 614] 2-dimensional tensor, where we
standardized each [75 1] feature by subtracting its mean and dividing by its standard
deviation.
3D-CNN Although manual feature engineering can be useful for directing
models to use relevant visual characteristics to make their predictions, it can also
result in the loss of large amounts of information and furthermore has the
potential to introduce noise. Consequently, we also explored the predictive
performance of deep learning models that learn their own feature representations
from the raw video data. Drawing on past successes with similar architectures
in the related topic of emotion recognition, we selected as our model 3D-CNN
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is also capable of handling the temporal aspect of our data. Our
3DCNN predicts expressiveness directly from a video clip. We modi ed the 18-layer
Resnet3D available through PyTorch's torchvision [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to perform prediction
of a continuous value rather than classi cation, while retaining the
hyperparameter values of the original implementation. We experimented both with training
the model from scratch on the BP4D+ extension dataset and with using the
BP4D+ extension only for ne-tuning of a 3D-CNN pretrained on the Kinetics
400 action recognition dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this section, we describe the evaluation metrics, data partitions, and
baselines that we used to evaluate the performance of our models and to conduct
our analysis of the interpretable visual features relevant to expressiveness. Code
for our evaluation and analyses is available at https://osf.io/bp7df/?view_
only=70e91114627742d7888fbdd36a314ee9.</p>
      <p>
        Evaluation Metrics and Dataset We selected RMSE and correlation of
model predictions with the ground truth expressiveness scores as the evaluation
metrics for our model performance. For ease of interpretability and comparison,
we report normalized RMSE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which we de ne as the RMSE divided by the
scale of the theoretical range of the expressiveness scores. The value of the
normalized RMSE ranges from 0 to 1, with 0 being the best performance and 1
being the worst performance.
      </p>
      <p>
        To determine whether di erences in performance between models and
baselines were statistically signi cant, we used the cluster bootstrap [
        <xref ref-type="bibr" rid="ref21 ref7">7,21</xref>
        ] to generate
95% con dence intervals and p-values for the di erences in RMSEs and
correlations between models. This approach does not make parametric assumptions
about the distribution of the di erence scores and accounts for the hierarchical
dependency of video clips within subjects.1
      </p>
      <p>Because we suspected that expressiveness might manifest di erently in di
erent emotions, we wanted to see whether training separate models for each
emotion elicitation task would produce better predictive performance than training
a single model over all tasks. Furthermore, tting separate ElasticNet models
for each task would allow us to understand whether the feature set relevant to
expressiveness is di erent depending on the emotional context, which would test
our hypothesis. Therefore, we separated the BP4D+ dataset by task and created
60=20=20 train/validation/test splits for each of these task-speci c datasets and
a separate split in the same proportions over the entire dataset. This partitioning
was done such that no subject appeared in multiple splits. For each model, we
report results from training and evaluating on each task-speci c dataset and on
the entire dataset.</p>
      <p>Baselines We de ned several baselines against which to compare our models'
performance:</p>
      <p>Uniform baseline: This baseline samples randomly from a uniform
distribution over the theoretical range of the expressiveness scores (i.e., 3:5 to
3:5).</p>
      <p>Normal baseline: This baseline samples randomly from a standard
normal distribution with mean and variance equal to the theoretical mean and
variance of the expressiveness scores (i.e., mean 0 and variance 1).
Human baseline: This baseline represents the performance of a single
randomly selected human crowdworker. We calculated an estimated factor score
for each rater by weighting their answers to each question by that question's
factor loading and summing the weighted values. These weighted sums were
then standardized and compared to the average of the remaining 5 raters'
1 Software to conduct this procedure is available at https://github.com/jmgirard/
mlboot.
estimated factor scores to assess each rater's solitary performance. Finally,
these performance scores were averaged over all crowdworkers to capture the
performance of a randomly selected crowdworker.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>In the following subsections, we present the results of our experiments, rst
comparing our model approaches and baselines and then visualizing and interpreting
the feature weights of the ElasticNet model.</p>
      <p>Despite achieving NRMSEs well below those of the normal baseline
(Figure 3, Table 3), the proposed deep learning had relatively poor performance in
most tasks according to the correlation metric. For example, OpenFace-LSTM
attained a reasonable correlation compared to the human baseline on the
startle and disgust tasks but produced essentially no correlation with the ground
truth on the pain task. Likewise, pretrained 3D-CNN and 3D-CNN trained from
scratch yielded little and no correlation, respectively, of their predictions with
the ground truth. We suspect that such results may be the product of the small
dataset on which the models were trained, as the data quantity may be
insufcient to allow the models to generalize and learn the appropriate predictive
signals from complex data without human intervention.</p>
      <p>As such, of the proposed models, we consider ElasticNet to demonstrate the
best performance overall. Its NRMSEs were consistently lower than those of the
other proposed models, and its correlations were much higher than those of any
other proposed model and come close to (and in the case of the disgust task,
slightly exceed) those of the human baseline. Statistical analyses of the di
erences in performance between ElasticNet and all other models and baselines, the
results of which are shown in Table 4, support our intuition. Speci cally, when
trained across all tasks, ElasticNet attains signi cantly lower NRMSE and
signi cantly higher correlation of its predictions with the ground truth compared
to all other models and baselines except the human baseline. However, the same
comparison also shows that ElasticNet has signi cantly higher NRMSE and
signi cantly lower correlation of its predictions with the ground truth compared to
the human baseline, indicating that there is still room for improvement.
Understanding Signals of Expressiveness Because our best-performing
model, ElasticNet, is an interpretable linear model, we were able to determine
the relationship between the visual features in our dataset and overall
expressiveness by examining the feature weights of the model trained over all tasks.
Furthermore, by doing the same for the feature weights of models trained over
individual tasks, we were able to explore the hypothesis that the set of signals
indicative of expressiveness varies from context to context. These visualizations
are shown in Figure 4. We directly interpret those features with a standardized
weight close to or greater than 0:2 in absolute value.</p>
      <p>From the weights of the model trained over all tasks, we can see that three
primary features contribute to predicting overall expressiveness: action unit count,
action unit intensity, and point displacement (i.e., the distance traveled by all
fa</p>
      <p>Action Unit Count
Action Unit Intensity
Points Displacement</p>
      <p>Points Velocity
Head Displacement ●</p>
      <p>Head Velocity
Pitch Displacement</p>
      <p>Pitch Velocity
Yaw Displacement
Roll Displacement</p>
      <p>Yaw Velocity
Roll Velocity</p>
      <p>Startle Task</p>
      <p>●
●
●
●
●
●
●
●
●</p>
      <p>●
●
●</p>
      <p>Pain Task
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Disgust Task</p>
      <p>●
●
●
●
●
●
●
●
All Tasks</p>
      <p>●
●
●
cial landmark points). This suggests that there are some behavioral signals that
index expressiveness across emotional contexts, and these are generally related
to the amount and intensity of facial motion. Notably, features related to head
motion and the velocity of motion did not have high feature weights for overall
expressiveness.</p>
      <p>We also observe that each individual task had its own unique set of features
that were important to predicting expressiveness within that context. These
features make intuitive sense when considering the nature of the tasks and are
consistent with the psychological literature we reviewed.</p>
      <p>In the startle task, higher expressiveness was associated with more points
displacement, higher points velocity, less head displacement, and higher action unit
count. These features are consistent with components of the hypothesized
startle response, including blinking, hunching the shoulders, grimacing, and baring
the teeth. The negative weight for head displacement was somewhat surprising,
but we think this observation may be related to subjects freezing in response to
being startled.</p>
      <p>In the pain task, higher expressiveness was associated with higher action
unit count, higher action unit intensity, and less points velocity. These features
are consistent with components of the hypothesized pain response, including
grimacing, frowning, wincing, and eye closure. Although the existing literature
hypothesizes that body motion increases in response to pain, we found that
points velocity has a negative weight. However, we think this nding may be
related to increased muscle tension and/or the nature of this speci c pain
elicitation task (e.g., decreased velocity may be related to the regulation of pain in
particular).</p>
      <p>Finally, in the disgust task, higher expressiveness was associated with higher
action unit count, higher action unit intensity, higher points displacement, and
higher head displacement. These features are consistent with components of
the hypothesized disgust response, including furrowed brows, eye closure, nose
wrinkling, upper lip retraction, upward movement of the lower lip and chin, and
drawing the corners of the mouth down and back. We believe that the observed
head displacement weight may be related to subjects recoiling from the source of
the unpleasant odor, which would produce changes in head scale and translation.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper, we de ne expressiveness as the extent to which an individual
shows his or her feelings, thoughts, or reactions in a given moment. Following
this de nition, we present a dataset that can be used to model or analyze
expressiveness in di erent emotional contexts using human labels of attributes relevant
to visual expressiveness. We propose and test a series of deep learning and
statistical models to predict expressiveness from visual data; we also use the latter
to understand the relationship between intepretable visual features derived from
OpenFace and expressiveness. We nd that training models for speci c emotional
contexts results in better predictive performance that training across contexts.
We also nd support for our hypothesis that expressiveness is associated with
unique features in each context, although several features are also important
across all contexts (e.g., the amount and intensity of facial movement). Future
work would bene t from attending to the similarities and di erences in signals of
expressiveness across emotional contexts to construct a more robust predictive
model.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This material is based upon work partially supported by the National Science
Foundation (Awards #1734868 #1722822) and National Institutes of Health.
We are grateful to Je rey F. Cohn and Lijun Yin for the use of the BP4D+
dataset. Any opinions, ndings, and conclusions or recommendations expressed
in this material are those of the authors and do not necessarily re ect the views
of National Science Foundation or National Institutes of Health, and no o cial
endorsement should be inferred.</p>
    </sec>
    <sec id="sec-9">
      <title>Appendix</title>
      <p>Amazon Mechanical Turk Questions Four questions were proposed to
capture aspects of expressiveness:
1. How strong is the emotional response of the person in this video to [the
stimulus] compared to how strongly a typical person would respond?
2. How much of any emotion does the person show in this video clip?
3. How much does the person move any part of their body/head/face in this
video clip?
4. How much does any part of the person's face become or stay tense in this
video clip?
Amazon Mechanical Turk Ratings For the rst question, the Likert scale
was anchored for raters as follows:
0 - No emotional response / Nothing to respond to
1 - Weak response
2 - Typical strength response
3 - Strong response
4 - Extreme response
For the remaining questions, the Likert scale was anchored:
0 - A little / None
1
2 - Some
3
4 - A lot
Video Segmentation For each task, the following segments were sampled
from each full subject/task video combination. Timestamps are in SS (seconds)
format. The notation -SS refers to a timestamp SS seconds from the end of the
video. Frames do not overlap between segments (that is, the last frame of a
segments ending at 03 is the frame prior to the rst frame of a segment starting
at 03).</p>
      <p>
        Sadness: [00, 03], [03, 06], [30, 33], [33, 36], [{12, {09], [{09, {06], [{06, {03],
[{03, {00]
Startle: [03, 06], [06, 09], [
        <xref ref-type="bibr" rid="ref12">09, 12</xref>
        ], [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ], [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ]
Fear: [00, 03], [03, 06], [06, 09], [
        <xref ref-type="bibr" rid="ref12">09, 12</xref>
        ], [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ], [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ], [
        <xref ref-type="bibr" rid="ref18 ref21">18, 21</xref>
        ]
Pain: [00, 03], [03, 06], [06, 09], [{12, {09], [{09, {06], [{06, {03], [{03, {00]
Disgust: [03, 06], [06, 09], [
        <xref ref-type="bibr" rid="ref12">09, 12</xref>
        ], [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ]
      </p>
      <p>Pilot Studies on Human Rating Reliability To determine which tasks and
questions could be annotated with adequate inter-rater reliability, we conducted
a pilot study with 3 crowdworkers rating the video clips from 5 subjects on 4
questions. The results of this study are provided in Table 5. The ICC scores for
the sadness and fear tasks looked poor overall, and these tasks were excluded.
The ICC scores looked good for the disgust task, and we thought that increasing
the number of raters to 6 might increase the reliability of the startle and pain
tasks to adequate levels. The results of a follow-up study with 6 crowdworkers
are provided in Table 6. The ICC scores indicate that the rst three questions
could be annotated with adequate reliability, but the fourth question had poor
reliability and was excluded. As such, the nal study included 6 raters of the
startle, pain, and disgust tasks with the rst three questions only.
1
2
3
4
1
2
3
4
0.632 [0.290, 0.825]
0.616 [0.248, 0.821]
0.749 [0.515, 0.861]
0.280 [{0.391, 0.658]
0.197
0.391
0.368
0.086
[{0.403, 0.567]
[{0.168, 0.639]
[{0.103, 0.659]
[{0.596, 0.507]</p>
      <p>Task</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baltrusaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>Y.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Openface 2.0: Facial behavior analysis toolkit</article-title>
          .
          <source>In: 2018 13th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2018</year>
          ). pp.
          <volume>59</volume>
          {
          <fpage>66</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Byeon</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>K.C.</given-names>
          </string-name>
          :
          <article-title>Facial expression recognition using 3d convolutional neural network</article-title>
          .
          <source>International journal of advanced computer science and applications</source>
          <volume>5</volume>
          (
          <issue>12</issue>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cicchetti</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          :
          <article-title>Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology</article-title>
          .
          <source>Psychological Assessment</source>
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <volume>284</volume>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>DiStefano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mindrila</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Understanding and using factor scores: Considerations for the applied researcher</article-title>
          .
          <source>Practical assessment, research &amp; evaluation</source>
          <volume>14</volume>
          (
          <issue>20</issue>
          ),
          <volume>1</volume>
          {
          <fpage>11</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Ebrahimi</given-names>
            <surname>Kahou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            , Michalski, V.,
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Memisevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Recurrent neural networks for emotion recognition in video</article-title>
          .
          <source>In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction</source>
          . pp.
          <volume>467</volume>
          {
          <fpage>474</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ekman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friesen</surname>
            ,
            <given-names>W.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hager</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Facial Action Coding System: A Technique for the Measurement of Facial Movement</article-title>
          . Research Nexus (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Field</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welsh</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          :
          <article-title>Bootstrapping clustered data</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>
          <volume>69</volume>
          (
          <issue>3</issue>
          ),
          <volume>369</volume>
          {
          <fpage>390</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fleeson</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Toward a structure-and process-integrated view of personality: Traits as density distributions of states</article-title>
          .
          <source>Journal of personality and social psychology 80(6)</source>
          ,
          <volume>1011</volume>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Girard</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mahoor</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mavadati</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammal</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenwald</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          :
          <article-title>Nonverbal social withdrawal in depression: Evidence from manual and automatic analyses</article-title>
          .
          <source>Image and vision computing</source>
          <volume>32</volume>
          (
          <issue>10</issue>
          ),
          <volume>641</volume>
          {
          <fpage>647</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hamm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohler</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gur</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verma</surname>
          </string-name>
          , R.:
          <article-title>Automated facial action coding system for dynamic analysis of facial expressions in neuropsychiatric disorders</article-title>
          .
          <source>Journal of neuroscience methods 200</source>
          (
          <issue>2</issue>
          ),
          <volume>237</volume>
          {
          <fpage>256</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>3d convolutional neural networks for human action recognition</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>35</volume>
          (
          <issue>1</issue>
          ),
          <volume>221</volume>
          {
          <fpage>231</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kay</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carreira</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hillier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vijayanarasimhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Back</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Natsev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>The kinetics human action video dataset</article-title>
          .
          <source>arXiv preprint arXiv:1705.06950</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kline</surname>
          </string-name>
          , R.:
          <article-title>Principles and Practice of Structural Equation Modeling</article-title>
          . Guilford Press, 4th edn. (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kunz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meixner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lautenbacher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Facial muscle movements encoding pain|a systematic review</article-title>
          .
          <source>Pain</source>
          <volume>160</volume>
          (
          <issue>3</issue>
          ),
          <volume>535</volume>
          {
          <fpage>549</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phung</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karmakar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shilton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yearwood</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>T.B.</given-names>
          </string-name>
          , et al.:
          <article-title>Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view</article-title>
          .
          <source>Journal of medical Internet research</source>
          <volume>18</volume>
          (
          <issue>12</issue>
          ),
          <year>e323</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>McGraw</surname>
            ,
            <given-names>K.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          :
          <article-title>Forming inferences about some intraclass correlation coe cients</article-title>
          .
          <source>Psychological methods 1</source>
          (
          <issue>1</issue>
          ),
          <volume>30</volume>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. National Institute of Mental Health:
          <article-title>Biplor disorder (</article-title>
          <year>2016</year>
          ), https://www.nimh. nih.gov/health/topics/bipolar-disorder/index.shtml
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>H.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vonikakis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep learning for emotion recognition on small datasets using transfer learning</article-title>
          .
          <source>In: Proceedings of the 2015 ACM on international conference on multimodal interaction</source>
          . pp.
          <volume>443</volume>
          {
          <fpage>449</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Olatunji</surname>
            ,
            <given-names>B.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sawchuk</surname>
            ,
            <given-names>C.N.</given-names>
          </string-name>
          :
          <article-title>Disgust: Characteristic features, social manifestations, and clinical implications</article-title>
          .
          <source>Journal of Social and Clinical Psychology</source>
          <volume>24</volume>
          (
          <issue>7</issue>
          ),
          <volume>932</volume>
          {
          <fpage>962</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aminzadeh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nonparametric bootstrapping for hierarchical data</article-title>
          .
          <source>Journal of Applied Statistics</source>
          <volume>37</volume>
          (
          <issue>9</issue>
          ),
          <volume>1487</volume>
          {
          <fpage>1498</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Rosseel</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Lavaan: An r package for structural equation modeling</article-title>
          .
          <source>Journal of statistical software 48(2)</source>
          ,
          <volume>1</volume>
          {
          <fpage>36</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Sillar</surname>
            ,
            <given-names>K.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Picton</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heitler</surname>
            ,
            <given-names>W.J.:</given-names>
          </string-name>
          <article-title>The mammalian startle response</article-title>
          .
          <source>In: The Neuroethology of Predation and Escape</source>
          . John Wiley &amp; Sons, Ltd (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torresani</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , LeCun,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Paluri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>A closer look at spatiotemporal convolutions for action recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>6450</volume>
          {
          <issue>6459</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tybur</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lieberman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurzban</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , DeScioli, P.: Disgust:
          <article-title>Evolved function and structure</article-title>
          .
          <source>Psychological review 120(1)</source>
          ,
          <volume>65</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Image based static facial expression recognition with multiple deep network learning</article-title>
          .
          <source>In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction</source>
          . pp.
          <volume>435</volume>
          {
          <fpage>442</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girard</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciftci</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canavan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reale</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horowitz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , et al.:
          <article-title>Multimodal spontaneous emotion corpus for human behavior analysis</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>3438</volume>
          {
          <issue>3446</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Regularization and variable selection via the elastic net</article-title>
          .
          <source>Journal of the royal statistical society: series B (statistical methodology)</source>
          <volume>67</volume>
          (
          <issue>2</issue>
          ),
          <volume>301</volume>
          {
          <fpage>320</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>0.517 [0</source>
          .
          <issue>156</issue>
          ,
          <issue>0</issue>
          .739]
          <article-title>Table 6. Intraclass correlation (ICC) 0</article-title>
          .
          <fpage>311</fpage>
          [{0.
          <issue>203</issue>
          ,
          <issue>0</issue>
          .628]
          <article-title>among Amazon Turk raters (n = 6 raters 0</article-title>
          .
          <source>562 [0</source>
          .
          <issue>235</issue>
          ,
          <issue>0</issue>
          .764]
          <article-title>per question) in 5-subject pilot studies</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>