=Paper= {{Paper |id=Vol-2897/session1_paper1 |storemode=property |title= A Two-stage Multi-modal Affect Analysis Framework for Children with Autism Spectrum Disorder |pdfUrl=https://ceur-ws.org/Vol-2897/AffconAAAI-21_paper1.pdf |volume=Vol-2897 |authors=Jicheng Li,Anjana Bhat,Roghayeh Barmaki |dblpUrl=https://dblp.org/rec/conf/aaai/LiBB21 }} == A Two-stage Multi-modal Affect Analysis Framework for Children with Autism Spectrum Disorder== https://ceur-ws.org/Vol-2897/AffconAAAI-21_paper1.pdf
         A Two-stage Multi-modal Affect Analysis Framework for Children with
                             Autism Spectrum Disorder

                                       Jicheng Li, Anjana Bhat, Roghayeh Barmaki
                                                            University of Delaware
                                                       {lijichen, abhat, rlb}@udel.edu



                            Abstract
  Autism spectrum disorder (ASD) is a developmental disor-
  der that influences communication and social behavior of a
  person in a way that those in the spectrum have difficulty in
  perceiving other people’s facial expressions, as well as pre-
  senting and communicating emotions and affect via their own
  faces and bodies. Some efforts have been made to predict
  and improve children with ASD’s affect states in play ther-
  apy, a common method to improve children’s social skills via
  play and games. However, many previous works only used
  pre-trained models on benchmark emotion datasets and failed
  to consider the distinction in emotion between typically de-
  veloping children and children with autism. In this paper,
  we present an open-source two-stage multi-modal approach
  leveraging acoustic and visual cues to predict three main af-
  fect states of children with ASD’s affect states (positive, neg-
  ative, and neutral) in real-world play therapy scenarios, and
  achieved an overall accuracy of 72.40%. This work presents             Figure 1: The workflow of our proposed affect predic-
  a novel way to combine human expertise and machine intel-              tion framework atop ASD-Affect dataset. Distinct emotional
  ligence for ASD affect recognition by proposing a two-stage            habits of ASD children inspire the two-stage schema, as they
  schema.
                                                                         tend to scream and shout more in negative affect states and
                                                                         smile more in positive ones. Given a video input, we first
                     1     Introduction                                  classify negative data via speech emotion recognition (SER).
                                                                         Non-negatives will be passed to the second stage, where the
Autism is the fastest-growing developmental disorder in the
                                                                         model will decide whether it’s positive or neutral based on
United States: approximately 1 in 54 children is on the
                                                                         facial expressions.
autism spectrum (Baio et al. 2018). Individuals with ASD
are characterized by having significant social communi-
cation impairments such as inefficient use of social gaze,
                                                                         dren to play. Sample settings of ASD-affect are shown in
gestures, and verbal communication (National Institute of
                                                                         Figure 2.
Health 2018). Thus, individuals in the spectrum have dif-
                                                                            Emotion recognition is the process of identifying human
ficulty perceiving and presenting communication cues such
                                                                         emotion by multiple cues, including facial or spoken ex-
as emotion. Previous research has shown that play therapy
                                                                         pressions, physiological and biological signals. Facilitated
can improve children’s social and emotional skills and per-
                                                                         by machine learning techniques, computer vision, speech
ceive their internal emotional world better (Chethik 2003).
                                                                         and signal processing, we can automate the process of emo-
The video recordings of play therapy interventions can pro-
                                                                         tion recognition. Researchers have shown that messages
vide a rich source to analyze children’s emotion or affect
                                                                         pertaining to feelings, affects and attitudes of interpersonal
states during treatment sessions.
                                                                         communication significantly reside in facial expressions
   In this paper, we present a two-stage affect prediction
                                                                         and speech (Mehrabian et al. 1971; Dhall et al. 2012).
method using video data (see Figure 1). We use a subset
                                                                         Inspired by that, in this paper, we define the problem as
of ASD-affect dataset from (Kaur and Bhat 2019) which in-
                                                                         automating emotion recognition for children with ASD
cludes more than four different therapeutic games for chil-
                                                                         using multi-modal inputs, especially from visual and audio
Copyright c 2021, for this paper by its authors. Use permit-ted          signals.
under Creative Commons License Attribution 4.0 International(CC
BY 4.0).
                                                                      Despite all these challenges, we used transfer learning,
                                                                   fine-tuning, and data post-processing - listed in the following
                                                                   sections - to prepare ASD-affect for further analysis using
                                                                   speech and facial emotion recognition methods.
                                                                      In this paper, we propose a two-stage framework to eval-
                                                                   uate affect states of children in play therapy scenarios using
                                                                   multi-modal emotion clues. This method effectively com-
                                                                   bines prior knowledge from human experts with machine
                                                                   intelligence. To distinguish children between three differ-
                                                                   ent affect states - neutral, positive, and negative - in stage
                                                                   1, the model predicts whether children are in a negative state
                                                                   based on negative symptoms (shouting and screaming) re-
                                                                   siding in speech. In stage 2, children’s emotions in posi-
Figure 2: Sample data of ASD-affect. ASD-affect is featured        tive and neutral states are recognized by distinct facial ex-
by rich diversities in subjects and scenarios. Children with       pressions. The workflow of our framework is presented in
ASD were conducting various activities under therapists’ in-       Figure 1. Our approach enables physical therapists to bet-
structions. These activities include singing, dancing, drum-       ter and more efficiently analyze the effectiveness of play
ming, yoga, and interaction with robots. The dataset consists      therapy interventions since human professionals require a
of more than 20 hours of therapy recordings.                       fair amount of training to better understand the behaviors
                                                                   and emotional states of ASD children. This method can be
                                                                   further applied to data annotation and label verification for
   However, there are several challenges in the process of         other ASD datasets, as actions of ASD children resemble
automatic emotion recognition on ASD-affect dataset:               relatively well.
   Insufficient public dataset: having adequate labeled               This paper is organized as follows. We first summarize re-
training data that include as many variations of the popu-         lated works in emotion recognition and play therapy analysis
lations and environments as possible is important for the          in Section 2. Section 3 describes the method we proposed,
design of a deep expression recognition system. However,           followed by experiment and results discussion presented in
due to privacy concerns, ASD dataset, especially from chil-        Section 4, 5 and 6. Lastly, section 7 outlines the conclusion
dren is very scarce. ASD-related multi-modal dataset, which        and future steps for this research.
records children’s behaviors in play therapy, is even more
sparse.                                                                               2    Related Work
   Domain shifts: existing methods (Doyran et al. 2019)
have directly applied pre-trained model of normal people for       Multi-modal Emotion Recognition
play therapy analysis, either explicitly or implicitly under       Emotions are convoluted psychological states composed
the assumption that emotional traits such as facial expres-        of several components: personal experience, physiological,
sions of typically developing people and ASD children are          behavioral, and communicative reactions. There are two
the same or similar. We argue that such simplifications are        mainstream emotion representations: discrete model (Ek-
not always appropriate given that ASD children suffer from         man 1994) and dimensional model. In this paper, we used
affect and communication disorder and cannot express their         discrete emotion models.
emotions appropriately, especially for children diagnosed at          Emotions can be carried in various modalities of inputs.
level three (Weitlauf et al. 2014).                                Mehrabian shows that 55% of messages pertaining to feel-
   Data noise: many existing benchmarks for emotion                ings and attitudes of interpersonal communication is in fa-
recognition (Lucey et al. 2010; Valstar and Pantic 2010;           cial expressions (Mehrabian et al. 1971). Besides, Dhall
Burkhardt et al. 2005) are posed and collected in a controlled     suggests that audio modality can bring extra gain in emo-
laboratory setting. In contrast, our dataset was collected in      tion recognition accuracy (Dhall et al. 2012). Thus, multi-
an in-the-wild manner featuring various backgrounds, peo-          modal emotion recognition approaches usually outperform
ple, activities, and durations. Therefore, ASD-affect has lots     unimodal ones. Two main sub-sets of multi-modal emotion
of noise, which requires substantial data cleaning and post-       recognition models are facial expression recognition (FER)
processing.                                                        and speech emotion recognition (SER), which are also the
   Sparse labeling: unlike other multi-modal benchmark             main focus of this work.
datasets (Dhall et al. 2012; Nojavanasghari et al. 2016)
where the duration of data samples are in the scale of sec-        Facial Expression Recognition FER systems can be di-
onds (usually less than 5 seconds), examples of ASD-affect         vided into two main categories based on the feature repre-
dataset may last for minutes, equivalent to lacking ground         sentations: static and dynamic. In static-based methods, the
truth or introducing excessive noise to the dataset. This is be-   feature representation is encoded with only spatial informa-
cause benchmark datasets were intentionally collected and          tion from a single image frame. In contrast, dynamic-based
labeled for autonomous recognition by machine, but ASD-            approaches consider temporal relations among contiguous
affect was initially compiled and annotated to serve human         frames in the input facial expression sequence (Li and Deng
experts.                                                           2018). Li proposed a bi-modality method (Li et al. 2019),
where convolutional networks (CNNs) were used to recog-            there are no significant diversities in speech emotion be-
nize static facial expressions while a bi-direction long short     tween neutral and positive recordings. Instead, children are
term Memory (Bi-LSTM) was employed to learn dynamic                smiling when positively engaged in therapy, while their fa-
facial expression sequences extracted by CNNs. Liu also            cial expressions remain neutral more often in neutral states.
embodied facial landmarks in the FER system (Liu et al.            Therefore, we chose to leverage the variance in facial ex-
2018). However, these works were conducted on benchmark            pressions to distinguish between positive and neutral data.
datasets (Dhall et al. 2012) where sequential relation of im-
ages is well-preserved so that sequential methods are able to      Stage 1: Negative vs Non-negative
function. Conversely, our ASD dataset was recorded in the          Since distinct speech emotions exhibit different patterns
natural or in-the-wild settings; so we could only use a static-    in the energy spectrum, to capture emotion features from
based method to classify facial expressions in each frame,         speech, we selected log-Mel spectrograms which have been
without considering temporal information.                          effective in speech emotion recognition tasks in the past
Speech Emotion Recognition Speech is a rich, dense                 (Zhao, Mao, and Chen 2019; Zhang et al. 2018; Chen et al.
form of communication that can convey information effec-           2018). A spectrogram is a visual representation of the spec-
tively. There are two classical ways to extract emotional fea-     trum of a signal’s frequencies as it varies with time. It is a
tures from speech. First is to obtain low-level discriminator      graph with two geometric dimensions: time and frequency.
features of speech, such as Mel-frequency cepstral coeffi-         The amplitude of a particular frequency at a particular time
cients (Yeh, Lin, and Lee 2019; Yoon et al. 2020). Another         is represented by the intensity or color of each pixel in the
way is to convert audio to spectrograms then use CNNs as           spectrogram. A Mel-spectrogram is a spectrogram where the
feature extractors (Zhang et al. 2018; Zhao, Mao, and Chen         frequencies are converted to the Mel scale - a perceptual
2019). In this paper, we use spectrograms as audio represen-       scale of pitches assessed by listeners to be equal in distance
tations.                                                           from one another (Stevens, Volkmann, and Newman 1937).
                                                                   We used the logarithmic form of Mel-spectrogram to better
Play Therapy Analysis                                              reflect emotions, since humans perceive sound in a logarith-
Play therapy is an approach to psychotherapy where a               mic scale (Venkataramanan and Rajamohan 2019).
child is engaging in play activities. Doyran and colleagues        Stage 2: Neutral vs Positive
(Doyran et al. 2019) proposed a visual and text-based frame-
work to track the affective state of a child during a play ther-   As noted earlier, due to image resolution constraints, tem-
apy session. However, audio modality was less explored in          poral information was not well preserved as adjacent frames
their work, and categorical representations of facial expres-      were discarded frequently in the data cleaning stage, causing
sions needed more investigation. Bangerter investigated the        sequential models to fail to converge. Therefore, we needed
spontaneous production of facial expressions of individuals        to use static-based methods that solely depend on one frame
with ASD as a response to entertaining videos (Bangerter           to predict facial expression. We choose ResNet-18 (He et al.
et al. 2020). It turned out that individuals with ASD showed       2016) with a decreased input size to better fit the average
less evidence of facial action units relating to positive facial   face sizes detected in ASD-affect. We pre-trained the model
expression than typically developing children. Due to small        using EmoReact (Nojavanasghari et al. 2016), a multi-modal
face sizes and low resolution of ASD-affect dataset, using         emotion dataset of children and fine-tuned it on ASD-affect
facial action units approach in the current work was not fea-      dataset.
sible, but we are looking into it in future.
                                                                                       4    Experiment
                        3    Method                                Data Processing
In this paper, we propose an open-source two-stage multi-          ASD-affect Dataset Bhat and colleagues proposed that
modal framework to predict children’s affect states in play        use of embodied, multisystem interventions can en-
therapy leveraging visual and audio information1 . First, we       hance various social communication, perceptuo-motor, and
distinguished negative videos from non-negative ones (neu-         cognitive-behavioral impairments of children with ASD
tral and positive) using spectrograms generated from audio.        (Kaur and Bhat 2019). They have studied the effects of vari-
Next, to differentiate between positive and neutral videos,        ous embodied creative interventions, including the themes of
we used static-based facial expression recognition methods.        robotic, musical, physical activity, yoga, and dance therapy
The workflow of this method is illustrated in Figure 1.            interventions for children with ASD. The video recordings
                                                                   of such interventions, known as the ASD-affect dataset, have
Two-stage Schema                                                   provided a rich source for analyzing children’s affect states
Our data assessment on ASD-affect inspires the two-stage           in play therapy. In this paper, we used a subset of ASD-affect
approach. Children who negatively and passively partici-           from six children. Sample data of ASD-affect are shown in
pated in play interventions tend to shout and scream more of-      Figure 2.
ten, and such characteristic is manifest in speech. However,       Data Reconstruction Originally, there were eight differ-
   1
     The source code is available to download at GitHub:           ent types of labels in the ASD-affect: neutral, interested,
https://github.com/Li-Jicheng/Autism-Affect-and-Emotion-           positive, positive and talking, odd positive, runs away, cam-
Recognition.                                                       era difficulties, and negative. For our work, we reconstructed
the dataset, and excluded - runs away and camera difficul-         ment, and normalization techniques for data augmentation
ties and odd positive labels - or merged some labels - inter-      and noise reduction. Before training, all facial images were
ested, positive, positive and talking were all considered as       resized to 48 × 48 offline, then random cropped to 44 × 44
positive labels. After this reconstruction step, we had a to-      on-the-fly during training. In testing, faces were directly re-
tal of 471 clips from six children in three classes of positive    sized to 44 × 44.
(68 clips), neutral (384 clips), and negative (19 clips). Clips       Detected faces may belong to children, other persons in
lengths were varied. See Figure 3 for reconstructed data dis-      the scene or due to noise. To localize children’s faces prop-
tribution.                                                         erly, we leveraged the children’s face dataset to create a face
                                                                   embedding database, where each face was encoded as a 128-
                                                                   dimensional vector. Whenever a new face is encountered, we
                                                                   can compare its embedding with the embedding database
                                                                   we have established to find matches. A ’match’ is defined
                                                                   as the cosine similarity between the new face embedding
                                                                   and a known face embedding is less than a given confidence
                                                                   threshold. Only matched faces were used for predictions,
Figure 3: A comparison of ASD-affect and benchmark                 and unmatched faces were excluded.
dataset AFEW (Dhall et al. 2012). Left: sample distribution
of reconstructed AFEW. We relabeled fear, disgust, angry           Speech Emotion Recognition
and sad as negative. Middle: data count distribution of ASD-
affect. Right: data duration percentage of ASD-affect.             Since the whole dataset is imbalanced, where negative video
                                                                   clips are much less than non-negative ones, we applied
                                                                   weighted sampling to enhance negative samples’ occurrence
Log-Mel Spectrograms We first extracted audio tracks               while working with spectrograms. We chose a batch size of
from video recordings. Audio files were stored in Wave-            32, and the network was trained for 25 epochs. We used
form Audio File Format to retain high fidelity. We then ap-        Adam (Kingma and Ba 2014) optimizer, and the learning
plied noise reduction to audio files and removed silent ut-        rate was set to 0.001.
terances. Afterwards, we split each audio file into equal-
length segments of 3 seconds, and zero-padding was ap-             Facial Expression Recognition
plied to the utterances whose duration is less than 3 seconds.     The training set was the selected ASD children’s faces, as
We set this sequence length since the average audio dura-          mentioned above. We set the batch size to 64 for training
tion in selected benchmark datasets for SER was 3 seconds          while the total training epochs was 25, and chose Adam
(Burkhardt et al. 2005; Livingstone and Russo 2018). After         (Kingma and Ba 2014) optimizer with an initial learning
that, log-Mel spectrograms were generated from each audio          rate of 0.001. The learning rate was decreased by a factor of
segment using librosa toolkit (McFee et al. 2015). We set          0.1 every 20 epochs. Unlike the training phase, in testing,
the Fast Fourier Transform (FFT) window length and hop             input images were captured every five frames from videos
length to 2,048 and 512, respectively. 64 Mel bands were           on-the-fly. Note that inputs in testing were not face crops
used in the spectrogram generation. A total of 9,968 log-Mel       but image frames, indicating that a face detector has to be
spectrograms were generated, including 134 negative sam-           applied to capture human faces from frames. MTCNN was
ples and 9,834 non-negative samples. We then used down-            then applied to test images to capture human faces. Detected
sampling on non-negative spectrograms to even out data,            faces were compared with the established children’s face
and reduce data imbalance. In both the training and testing        database. Once children’s faces were matched and located,
phase, all log-Mels were normalized by the global mean and         a trained model predicted children’s facial expressions,
standard deviation of the training set. All spectrograms were      and such predictions were considered valid votes. Frames
resized to 224 × 224 to match the network’s input size.            were discarded if no target children’s faces were detected,
                                                                   including no faces or only faces from others (e.g., therapists
Facial Images We first extracted image frames from raw
                                                                   or parents). At the end of each video, for all the valid
video clips at a specific sampling rate. Considering that du-
                                                                   votes, if the portion of positive predictions exceeds a
ration of neutral clips were typically longer than positive
                                                                   certain threshold, the whole video is predicted as positive,
ones in ASD-affect, we set the sampling rate to 3 frames per
                                                                   otherwise neutral. In this experiment, we set the threshold
second (FPS) for positive video clips and 1 FPS for the neu-
                                                                   to 0.5, equivalent to majority voting. The workflow of the
tral to stratify data proportionally. Then we used MTCNN
                                                                   test phase is explained in Algorithm 1.
(Zhang et al. 2016) to detect human faces in each frame. We
selected 1,756 template faces of children (about 2% of the
total detected faces) to create a facial expression database
for ASD-affect, consisting of 1,159 neutral and 706 positive
faces. Each selected face was manually labeled as either neu-                              5   Results
tral or positive based on facial expressions. The children face    We used 5-fold cross-validation to report findings from
dataset served as training, validation, and test set for the FER   our participant videos (recordings from two children were
model used in the second stage via 5-fold cross-validation.        merged together due to small number of video clips, to-
We used random crop, rotation, shifting, illumination adjust-      talling five batches of data from six children. Since we had
Algorithm 1 Stage 2 testing(input video v, face detector
f ace det , children face embedding child embeds, classi-
fier model, threshold T)
 1: vote list = []   . Initialize a list to store valid votes
 2: Load video v, capture one image as input every 5
    frames. All captured images are stored in f rame list
 3: for img in f rame list do
 4:     Detected face list f l = f ace det(img)
 5:     for each face f in f l do
 6:         if f ∈ child embeds then           . Locate children
 7:             pred = model(f )
 8:             vote list.append(pred)
 9:         end if
10:     end for
11: end for
                   count(pi ==P ositive) in vote list
12: pos ratio =             vote list.length
13: if pos ratio >= T then                     . Majority voting
14:     return Positive
15: else                                                           Figure 5: Classification results per participant in stage 1,
16:     return Neutral                                             negative vs. non-negative. Y-axis shows the percentages
17: end if                                                         of predicted labels for video clips from every participant
                                                                   (data is normalized based on total video clips from ev-
                                                                   ery participant, because video counts vary significantly).
                                                                   The legend format is: true label to predicted label (e.g.
imbalanced classes, in addition to accuracy, we reported re-       neg to non − neg means a negative video is misclassified
call, F1 score, G-mean value, and ROC-AUC score for more           as a non-negative one).
in-depth analysis.

Stage 1: Negative vs. Non-Negative                                 Stage 2: Neutral vs. Positive
                                                                   We reached an overall accuracy of 75.93%, where recall for
We achieved an accuracy of 94.48% and an F1 score of 0.97.         neutral and positive class was 78.29% and 63.24%, respec-
The recall of negative and non-negative labels is 68.42% and       tively. Confusion matrix is shown in Figure 6. Also, F-1
95.57%. Besides, the G-mean value and ROC-AUC score                score is calculated as 0.79. The results from each partici-
are 0.92 and 0.93, respectively. The confusion matrix of           pant’s videos are shown in Figure 7.
stage 1 is shown in Figure 4. The classification results from
recordings of each participant is shown in Figure 5.               Overall Accuracy
                                                                   According to the confusion matrix for all three classes
                                                                   (Figure 8), we have correctly classified 285 neutral videos,
                                                                   43 positive videos, and 13 negative ones. Adding them up,
                                                                   we had 341 out of 471 precisely classified clips, leading to
                                                                   an overall accuracy of 72.40% and an F1 score of 0.75.



                                                                                        6   Discussion
                                                                   Overall, our method achieves an acceptable performance in
                                                                   both stages. However, there is a noticeable accuracy gap be-
                                                                   tween stage 1 and stage 2 and between dominant and non-
                                                                   dominant classes of each stage. If we consider our problem
                                                                   a two binary classification problems –stage one with nega-
                                                                   tive vs non-negative samples, and stage two of non-negative
                                                                   samples classified to positive and neutral samples– for non-
                                                                   dominant labels in both stages, which are negatives in stage
                                                                   1, and positives in stage 2, the recall rates are very compa-
Figure 4: Confusion matrix for prediction of negative and
                                                                   rable: 68.42% and 63.24% respectively. On the other hand,
non-negative videos extracted from audio modality and
                                                                   for both of dominant classes in stages 1 and 2, the recall
SER.
                                                                   for non-negatives is significantly higher than neutral labels
                                                                   (95.57% vs. 78.29%). This may because speech emotion
                                                                         Figure 8: Confusion matrix for three classes.

Figure 6: Confusion matrix for prediction of positive and
neutral videos extracted from FER.                               features, such as shouting and screaming, are more distinct
                                                                 and recognizable and describe negative videos better. More-
                                                                 over, the distinction of positive from neutral labels in stage 2
                                                                 was very tough even for subject matters experts due to data
                                                                 noise and low video resolution. As such, SER performed rel-
                                                                 atively better than FER for our ASD-Affect dataset.

                                                                                      7    Conclusion
                                                                 This paper proposed a novel framework for automatic emo-
                                                                 tion recognition of children with ASD using multi-modal in-
                                                                 formation (facial and speech emotion), providing a baseline
                                                                 model to affect states analysis in play therapy. This work
                                                                 also has implications on automated affect annotation for play
                                                                 therapy video recordings. Besides, the framework leverages
                                                                 human expertise to a great extent by proposing a two-stage
                                                                 schema, a novel way to combine human knowledge and ma-
                                                                 chine intelligence in ASD-related research.
                                                                    We anticipate expanding this project in the future in multi-
                                                                 ple directions. We aim to offer a semi-automated annotation
                                                                 framework to assist subject-matter experts swiftly annotat-
                                                                 ing the recordings from children with autism. We discussed
                                                                 some challenges of the pre-recorded videos in our dataset,
                                                                 especially the low-resolution issue. To overcome the prob-
Figure 7: Classification results per participant in stage 2,     lem, we plan to collect more audio-visual data with higher
neutral vs. positive. Y-axis shows the percentages of pre-       resolution to deploy other FER techniques, including se-
dicted labels for video clips from every participant (data is    quential and action-unit based approaches mentioned in the
normalized based on total video clips from every participant,    paper. Furthermore, deficits in mutual gaze, and shared gaze
since video counts vary significantly). The legend format is:    is also known as a strong predictor of autism among children
true label to predicted label (e.g. neu to pos means a neu-      (Zhao et al. 2017), which we are interested in investigating
tral video is misclassified as a positive one). Blue and green   in the future, as a next line of our previous research (Guo
represent correct predictions for positive and neutral video     and Barmaki 2020) on automatic detection of mutual gaze
clips, while red and orange stand for miss-classified ones.      among adults.
The more considerable sum of green and blue areas are, the
better accuracy our model achieves. This figure illustrates                       8   Acknowledgments
that our model generalizes well among all individuals in         We wish to acknowledge the support from the entire research
ASD-affect in distinguishing neutral and positive videos.        team, study participants and their caregivers to collect ASD-
                                                                 affect dataset. We also thank our sponsor, Amazon Research
                                                                 Awards Program for the generous support. Any opinions,
findings, and conclusions or recommendations expressed in       Recognition in the Wild. In 2019 International Conference
this material are those of the authors and do not necessarily   on Multimodal Interaction, ICMI ’19, 589–594.
reflect the views of the sponsors.                              Liu, C.; Tang, T.; Lv, K.; and Wang, M. 2018. Multi-Feature
                                                                Based Emotion Recognition for Video Clips. In Proceedings
                       References                               of the 20th ACM International Conference on Multimodal
Baio, J.; Wiggins, L.; Christensen, D. L.; Maenner, M. J.;      Interaction, ICMI ’18, 630–634.
Daniels, J.; Warren, Z.; Kurzius-Spencer, M.; Zahorodny,        Livingstone, S. R.; and Russo, F. A. 2018. The Ryer-
W.; Rosenberg, C. R.; White, T.; et al. 2018. Preva-            son Audio-Visual Database of Emotional Speech and Song
lence of autism spectrum disorder among children aged           (RAVDESS): A dynamic, multimodal set of facial and vo-
8 years—autism and developmental disabilities monitoring        cal expressions in North American English. PloS one 13(5):
network, 11 sites, United States, 2014. MMWR Surveillance       e0196391.
Summaries 67(6): 1.
                                                                Lucey, P.; Cohn, J. F.; Kanade, T.; Saragih, J.; Ambadar, Z.;
Bangerter, A.; Chatterjee, M.; Manfredonia, J.; Manyakov,       and Matthews, I. 2010. The Extended Cohn-Kanade Dataset
N. V.; Ness, S.; Boice, M. A.; Skalkin, A.; Goodwin,            (CK+): A complete dataset for action unit and emotion-
M. S.; Dawson, G.; Hendren, R.; et al. 2020. Automated          specified expression. In 2010 IEEE Computer Society Con-
recognition of spontaneous facial expression in individuals     ference on Computer Vision and Pattern Recognition - Work-
with autism spectrum disorder: parsing response variability.    shops, 94–101.
Molecular autism 11(1): 1–15.
                                                                McFee, B.; Raffel, C.; Liang, D.; Ellis, D. P.; McVicar, M.;
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W. F.;
                                                                Battenberg, E.; and Nieto, O. 2015. librosa: Audio and mu-
and Weiss, B. 2005. A database of German emotional
                                                                sic signal analysis in python. In Proceedings of the 14th
speech. In Ninth European Conference on Speech Commu-
                                                                python in science conference, volume 8.
nication and Technology.
                                                                Mehrabian, A.; et al. 1971.     Silent messages, volume 8.
Chen, M.; He, X.; Yang, J.; and Zhang, H. 2018. 3-D Convo-
                                                                Wadsworth Belmont, CA.
lutional Recurrent Neural Networks With Attention Model
for Speech Emotion Recognition. IEEE Signal Processing          National Institute of Health. 2018. Autism Spectrum Disor-
Letters 25(10): 1440–1444.                                      der. https://www.nimh.nih.gov/health/publications/autism-
                                                                spectrum-disorder/index.shtml.
Chethik, M. 2003. Techniques of child therapy: Psychody-
namic strategies. Guilford Press.                               Nojavanasghari, B.; Baltrušaitis, T.; Hughes, C. E.; and
Dhall, A.; Goecke, R.; Lucey, S.; and Gedeon, T. 2012.          Morency, L.-P. 2016. EmoReact: A Multimodal Approach
Collecting Large, Richly Annotated Facial-Expression            and Dataset for Recognizing Emotional Responses in Chil-
Databases from Movies. IEEE MultiMedia 19(3): 34–41.            dren. In Proceedings of the 18th ACM International Confer-
                                                                ence on Multimodal Interaction, ICMI ’16, 137–144.
Doyran, M.; Türkmen, B.; Oktay, E. A.; Halfon, S.; and
Salah, A. A. 2019. Video and Text-Based Affect Analysis         Stevens, S. S.; Volkmann, J.; and Newman, E. B. 1937. A
of Children in Play Therapy. In 2019 International Confer-      scale for the measurement of the psychological magnitude
ence on Multimodal Interaction, ICMI ’19, 26–34.                pitch. The Journal of the Acoustical Society of America 8(3):
                                                                185–190.
Ekman, P. 1994. Strong evidence for universals in facial
expressions: a reply to Russell’s mistaken critique. .          Valstar, M.; and Pantic, M. 2010. Induced disgust, hap-
                                                                piness and surprise: an addition to the mmi facial expres-
Guo, Z.; and Barmaki, R. 2020. Deep neural networks for         sion database. In Proc. 3rd Intern. Workshop on EMOTION
collaborative learning analytics: Evaluating team collabora-    (satellite of LREC): Corpora for Research on Emotion and
tions using student gaze point prediction. Australasian Jour-   Affect, 65. Paris, France.
nal of Educational Technology 36(6): 53–71.
                                                                Venkataramanan, K.; and Rajamohan, H. R. 2019. Emotion
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-       Recognition from Speech.
ual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recogni-         Weitlauf, A. S.; Gotham, K. O.; Vehorn, A. C.; and Warren,
tion, 770–778.                                                  Z. E. 2014. Brief report: DSM-5 ”levels of support:” a com-
                                                                ment on discrepant conceptualizations of severity in ASD.
Kaur, M.; and Bhat, A. 2019. Creative Yoga Intervention Im-     Journal of autism and developmental disorders 44(2): 471–
proves Motor and Imitation Skills of Children With Autism       476.
Spectrum Disorder. Physical Therapy 99(11): 1520–1534.
                                                                Yeh, S.; Lin, Y.; and Lee, C. 2019. An Interaction-aware At-
Kingma, D. P.; and Ba, J. 2014. Adam: A method for              tention Network for Speech Emotion Recognition in Spoken
stochastic optimization. arXiv preprint arXiv:1412.6980 .       Dialogs. In ICASSP 2019 - 2019 IEEE International Confer-
Li, S.; and Deng, W. 2018. Deep facial expression recogni-      ence on Acoustics, Speech and Signal Processing (ICASSP),
tion: A survey. arXiv preprint arXiv:1804.08348 .               6685–6689.
Li, S.; Zheng, W.; Zong, Y.; Lu, C.; Tang, C.; Jiang, X.;       Yoon, S.; Dey, S.; Lee, H.; and Jung, K. 2020. Attentive
Liu, J.; and Xia, W. 2019. Bi-Modality Fusion for Emotion       modality hopping mechanism for speech emotion recogni-
tion. In ICASSP 2020-2020 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP),
3362–3366. IEEE.
Zhang, K.; Zhang, Z.; Li, Z.; and Qiao, Y. 2016. Joint Face
Detection and Alignment Using Multitask Cascaded Convo-
lutional Networks. IEEE Signal Processing Letters 23(10):
1499–1503.
Zhang, S.; Zhang, S.; Huang, T.; and Gao, W. 2018. Speech
Emotion Recognition Using Deep Convolutional Neural
Network and Discriminant Temporal Pyramid Matching.
IEEE Transactions on Multimedia 20(6): 1576–1590.
Zhao, J.; Mao, X.; and Chen, L. 2019. Speech emotion
recognition using deep 1D & 2D CNN LSTM networks.
Biomedical Signal Processing and Control 47: 312–323.
Zhao, S.; Uono, S.; Yoshimura, S.; Kubota, Y.; and Toichi,
M. 2017. Atypical gaze cueing pattern in a complex envi-
ronment in individuals with ASD. Journal of autism and
developmental disorders 47(7): 1978–1986.