<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Two-stage Multi-modal Affect Analysis Framework for Children with Autism Spectrum Disorder</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jicheng Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anjana Bhat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roghayeh Barmaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Delaware</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Autism spectrum disorder (ASD) is a developmental disorder that influences communication and social behavior of a person in a way that those in the spectrum have difficulty in perceiving other people's facial expressions, as well as presenting and communicating emotions and affect via their own faces and bodies. Some efforts have been made to predict and improve children with ASD's affect states in play therapy, a common method to improve children's social skills via play and games. However, many previous works only used pre-trained models on benchmark emotion datasets and failed to consider the distinction in emotion between typically developing children and children with autism. In this paper, we present an open-source two-stage multi-modal approach leveraging acoustic and visual cues to predict three main affect states of children with ASD's affect states (positive, negative, and neutral) in real-world play therapy scenarios, and achieved an overall accuracy of 72:40%. This work presents a novel way to combine human expertise and machine intelligence for ASD affect recognition by proposing a two-stage schema.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Autism is the fastest-growing developmental disorder in the
United States: approximately 1 in 54 children is on the
autism spectrum
        <xref ref-type="bibr" rid="ref1">(Baio et al. 2018)</xref>
        . Individuals with ASD
are characterized by having significant social
communication impairments such as inefficient use of social gaze,
gestures, and verbal communication
        <xref ref-type="bibr" rid="ref22">(National Institute of
Health 2018)</xref>
        . Thus, individuals in the spectrum have
difficulty perceiving and presenting communication cues such
as emotion. Previous research has shown that play therapy
can improve children’s social and emotional skills and
perceive their internal emotional world better
        <xref ref-type="bibr" rid="ref6">(Chethik 2003)</xref>
        .
The video recordings of play therapy interventions can
provide a rich source to analyze children’s emotion or affect
states during treatment sessions.
      </p>
      <p>
        In this paper, we present a two-stage affect prediction
method using video data (see Figure 1). We use a subset
of ASD-affect dataset from
        <xref ref-type="bibr" rid="ref13 ref26">(Kaur and Bhat 2019)</xref>
        which
includes more than four different therapeutic games for
children to play. Sample settings of ASD-affect are shown in
Figure 2.
      </p>
      <p>
        Emotion recognition is the process of identifying human
emotion by multiple cues, including facial or spoken
expressions, physiological and biological signals. Facilitated
by machine learning techniques, computer vision, speech
and signal processing, we can automate the process of
emotion recognition. Researchers have shown that messages
pertaining to feelings, affects and attitudes of interpersonal
communication significantly reside in facial expressions
and speech
        <xref ref-type="bibr" rid="ref21 ref7">(Mehrabian et al. 1971; Dhall et al. 2012)</xref>
        .
Inspired by that, in this paper, we define the problem as
automating emotion recognition for children with ASD
using multi-modal inputs, especially from visual and audio
signals.
      </p>
      <p>However, there are several challenges in the process of
automatic emotion recognition on ASD-affect dataset:</p>
      <p>Insufficient public dataset: having adequate labeled
training data that include as many variations of the
populations and environments as possible is important for the
design of a deep expression recognition system. However,
due to privacy concerns, ASD dataset, especially from
children is very scarce. ASD-related multi-modal dataset, which
records children’s behaviors in play therapy, is even more
sparse.</p>
      <p>
        Domain shifts: existing methods
        <xref ref-type="bibr" rid="ref9">(Doyran et al. 2019)</xref>
        have directly applied pre-trained model of normal people for
play therapy analysis, either explicitly or implicitly under
the assumption that emotional traits such as facial
expressions of typically developing people and ASD children are
the same or similar. We argue that such simplifications are
not always appropriate given that ASD children suffer from
affect and communication disorder and cannot express their
emotions appropriately, especially for children diagnosed at
level three
        <xref ref-type="bibr" rid="ref27">(Weitlauf et al. 2014)</xref>
        .
      </p>
      <p>
        Data noise: many existing benchmarks for emotion
recognition
        <xref ref-type="bibr" rid="ref19 ref25 ref4">(Lucey et al. 2010; Valstar and Pantic 2010;
Burkhardt et al. 2005)</xref>
        are posed and collected in a controlled
laboratory setting. In contrast, our dataset was collected in
an in-the-wild manner featuring various backgrounds,
people, activities, and durations. Therefore, ASD-affect has lots
of noise, which requires substantial data cleaning and
postprocessing.
      </p>
      <p>
        Sparse labeling: unlike other multi-modal benchmark
datasets
        <xref ref-type="bibr" rid="ref23 ref7">(Dhall et al. 2012; Nojavanasghari et al. 2016)</xref>
        where the duration of data samples are in the scale of
seconds (usually less than 5 seconds), examples of ASD-affect
dataset may last for minutes, equivalent to lacking ground
truth or introducing excessive noise to the dataset. This is
because benchmark datasets were intentionally collected and
labeled for autonomous recognition by machine, but
ASDaffect was initially compiled and annotated to serve human
experts.
      </p>
      <p>Despite all these challenges, we used transfer learning,
fine-tuning, and data post-processing - listed in the following
sections - to prepare ASD-affect for further analysis using
speech and facial emotion recognition methods.</p>
      <p>In this paper, we propose a two-stage framework to
evaluate affect states of children in play therapy scenarios using
multi-modal emotion clues. This method effectively
combines prior knowledge from human experts with machine
intelligence. To distinguish children between three
different affect states - neutral, positive, and negative - in stage
1, the model predicts whether children are in a negative state
based on negative symptoms (shouting and screaming)
residing in speech. In stage 2, children’s emotions in
positive and neutral states are recognized by distinct facial
expressions. The workflow of our framework is presented in
Figure 1. Our approach enables physical therapists to
better and more efficiently analyze the effectiveness of play
therapy interventions since human professionals require a
fair amount of training to better understand the behaviors
and emotional states of ASD children. This method can be
further applied to data annotation and label verification for
other ASD datasets, as actions of ASD children resemble
relatively well.</p>
      <p>This paper is organized as follows. We first summarize
related works in emotion recognition and play therapy analysis
in Section 2. Section 3 describes the method we proposed,
followed by experiment and results discussion presented in
Section 4, 5 and 6. Lastly, section 7 outlines the conclusion
and future steps for this research.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Multi-modal Emotion Recognition</title>
        <p>
          Emotions are convoluted psychological states composed
of several components: personal experience, physiological,
behavioral, and communicative reactions. There are two
mainstream emotion representations: discrete model
          <xref ref-type="bibr" rid="ref10">(Ekman 1994)</xref>
          and dimensional model. In this paper, we used
discrete emotion models.
        </p>
        <p>
          Emotions can be carried in various modalities of inputs.
Mehrabian shows that 55% of messages pertaining to
feelings and attitudes of interpersonal communication is in
facial expressions
          <xref ref-type="bibr" rid="ref21">(Mehrabian et al. 1971)</xref>
          . Besides, Dhall
suggests that audio modality can bring extra gain in
emotion recognition accuracy
          <xref ref-type="bibr" rid="ref7">(Dhall et al. 2012)</xref>
          . Thus,
multimodal emotion recognition approaches usually outperform
unimodal ones. Two main sub-sets of multi-modal emotion
recognition models are facial expression recognition (FER)
and speech emotion recognition (SER), which are also the
main focus of this work.
        </p>
        <p>
          Facial Expression Recognition FER systems can be
divided into two main categories based on the feature
representations: static and dynamic. In static-based methods, the
feature representation is encoded with only spatial
information from a single image frame. In contrast, dynamic-based
approaches consider temporal relations among contiguous
frames in the input facial expression sequence
          <xref ref-type="bibr" rid="ref15 ref18">(Li and Deng
2018)</xref>
          . Li proposed a bi-modality method
          <xref ref-type="bibr" rid="ref16">(Li et al. 2019)</xref>
          ,
where convolutional networks (CNNs) were used to
recognize static facial expressions while a bi-direction long short
term Memory (Bi-LSTM) was employed to learn dynamic
facial expression sequences extracted by CNNs. Liu also
embodied facial landmarks in the FER system
          <xref ref-type="bibr" rid="ref17">(Liu et al.
2018)</xref>
          . However, these works were conducted on benchmark
datasets
          <xref ref-type="bibr" rid="ref7">(Dhall et al. 2012)</xref>
          where sequential relation of
images is well-preserved so that sequential methods are able to
function. Conversely, our ASD dataset was recorded in the
natural or in-the-wild settings; so we could only use a
staticbased method to classify facial expressions in each frame,
without considering temporal information.
        </p>
        <p>
          Speech Emotion Recognition Speech is a rich, dense
form of communication that can convey information
effectively. There are two classical ways to extract emotional
features from speech. First is to obtain low-level discriminator
features of speech, such as Mel-frequency cepstral
coefficients
          <xref ref-type="bibr" rid="ref13 ref26 ref29 ref30">(Yeh, Lin, and Lee 2019; Yoon et al. 2020)</xref>
          . Another
way is to convert audio to spectrograms then use CNNs as
feature extractors
          <xref ref-type="bibr" rid="ref13 ref26 ref32 ref34">(Zhang et al. 2018; Zhao, Mao, and Chen
2019)</xref>
          . In this paper, we use spectrograms as audio
representations.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Play Therapy Analysis</title>
        <p>
          Play therapy is an approach to psychotherapy where a
child is engaging in play activities. Doyran and colleagues
          <xref ref-type="bibr" rid="ref9">(Doyran et al. 2019)</xref>
          proposed a visual and text-based
framework to track the affective state of a child during a play
therapy session. However, audio modality was less explored in
their work, and categorical representations of facial
expressions needed more investigation. Bangerter investigated the
spontaneous production of facial expressions of individuals
with ASD as a response to entertaining videos
          <xref ref-type="bibr" rid="ref2">(Bangerter
et al. 2020)</xref>
          . It turned out that individuals with ASD showed
less evidence of facial action units relating to positive facial
expression than typically developing children. Due to small
face sizes and low resolution of ASD-affect dataset, using
facial action units approach in the current work was not
feasible, but we are looking into it in future.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Method</title>
      <p>In this paper, we propose an open-source two-stage
multimodal framework to predict children’s affect states in play
therapy leveraging visual and audio information1. First, we
distinguished negative videos from non-negative ones
(neutral and positive) using spectrograms generated from audio.
Next, to differentiate between positive and neutral videos,
we used static-based facial expression recognition methods.
The workflow of this method is illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>Two-stage Schema</title>
        <p>Our data assessment on ASD-affect inspires the two-stage
approach. Children who negatively and passively
participated in play interventions tend to shout and scream more
often, and such characteristic is manifest in speech. However,
1The source code is available to download at GitHub:
https://github.com/Li-Jicheng/Autism-Affect-and-EmotionRecognition.
there are no significant diversities in speech emotion
between neutral and positive recordings. Instead, children are
smiling when positively engaged in therapy, while their
facial expressions remain neutral more often in neutral states.
Therefore, we chose to leverage the variance in facial
expressions to distinguish between positive and neutral data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Stage 1: Negative vs Non-negative</title>
        <p>
          Since distinct speech emotions exhibit different patterns
in the energy spectrum, to capture emotion features from
speech, we selected log-Mel spectrograms which have been
effective in speech emotion recognition tasks in the past
          <xref ref-type="bibr" rid="ref13 ref26 ref32 ref34 ref5">(Zhao, Mao, and Chen 2019; Zhang et al. 2018; Chen et al.
2018)</xref>
          . A spectrogram is a visual representation of the
spectrum of a signal’s frequencies as it varies with time. It is a
graph with two geometric dimensions: time and frequency.
The amplitude of a particular frequency at a particular time
is represented by the intensity or color of each pixel in the
spectrogram. A Mel-spectrogram is a spectrogram where the
frequencies are converted to the Mel scale - a perceptual
scale of pitches assessed by listeners to be equal in distance
from one another
          <xref ref-type="bibr" rid="ref24">(Stevens, Volkmann, and Newman 1937)</xref>
          .
We used the logarithmic form of Mel-spectrogram to better
reflect emotions, since humans perceive sound in a
logarithmic scale
          <xref ref-type="bibr" rid="ref13 ref26">(Venkataramanan and Rajamohan 2019)</xref>
          .
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Stage 2: Neutral vs Positive</title>
        <p>
          As noted earlier, due to image resolution constraints,
temporal information was not well preserved as adjacent frames
were discarded frequently in the data cleaning stage, causing
sequential models to fail to converge. Therefore, we needed
to use static-based methods that solely depend on one frame
to predict facial expression. We choose ResNet-18
          <xref ref-type="bibr" rid="ref12">(He et al.
2016)</xref>
          with a decreased input size to better fit the average
face sizes detected in ASD-affect. We pre-trained the model
using EmoReact
          <xref ref-type="bibr" rid="ref23">(Nojavanasghari et al. 2016)</xref>
          , a multi-modal
emotion dataset of children and fine-tuned it on ASD-affect
dataset.
        </p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <sec id="sec-4-1">
        <title>Data Processing</title>
        <p>
          ASD-affect Dataset Bhat and colleagues proposed that
use of embodied, multisystem interventions can
enhance various social communication, perceptuo-motor, and
cognitive-behavioral impairments of children with ASD
          <xref ref-type="bibr" rid="ref13 ref26">(Kaur and Bhat 2019)</xref>
          . They have studied the effects of
various embodied creative interventions, including the themes of
robotic, musical, physical activity, yoga, and dance therapy
interventions for children with ASD. The video recordings
of such interventions, known as the ASD-affect dataset, have
provided a rich source for analyzing children’s affect states
in play therapy. In this paper, we used a subset of ASD-affect
from six children. Sample data of ASD-affect are shown in
Figure 2.
        </p>
        <p>
          Data Reconstruction Originally, there were eight
different types of labels in the ASD-affect: neutral, interested,
positive, positive and talking, odd positive, runs away,
camera difficulties, and negative. For our work, we reconstructed
the dataset, and excluded - runs away and camera
difficulties and odd positive labels - or merged some labels -
interested, positive, positive and talking were all considered as
positive labels. After this reconstruction step, we had a
total of 471 clips from six children in three classes of positive
(68 clips), neutral (384 clips), and negative (19 clips). Clips
lengths were varied. See Figure 3 for reconstructed data
distribution.
Log-Mel Spectrograms We first extracted audio tracks
from video recordings. Audio files were stored in
Waveform Audio File Format to retain high fidelity. We then
applied noise reduction to audio files and removed silent
utterances. Afterwards, we split each audio file into
equallength segments of 3 seconds, and zero-padding was
applied to the utterances whose duration is less than 3 seconds.
We set this sequence length since the average audio
duration in selected benchmark datasets for SER was 3 seconds
          <xref ref-type="bibr" rid="ref15 ref18 ref4">(Burkhardt et al. 2005; Livingstone and Russo 2018)</xref>
          . After
that, log-Mel spectrograms were generated from each audio
segment using librosa toolkit
          <xref ref-type="bibr" rid="ref20">(McFee et al. 2015)</xref>
          . We set
the Fast Fourier Transform (FFT) window length and hop
length to 2,048 and 512, respectively. 64 Mel bands were
used in the spectrogram generation. A total of 9,968 log-Mel
spectrograms were generated, including 134 negative
samples and 9,834 non-negative samples. We then used
downsampling on non-negative spectrograms to even out data,
and reduce data imbalance. In both the training and testing
phase, all log-Mels were normalized by the global mean and
standard deviation of the training set. All spectrograms were
resized to 224 224 to match the network’s input size.
Facial Images We first extracted image frames from raw
video clips at a specific sampling rate. Considering that
duration of neutral clips were typically longer than positive
ones in ASD-affect, we set the sampling rate to 3 frames per
second (FPS) for positive video clips and 1 FPS for the
neutral to stratify data proportionally. Then we used MTCNN
          <xref ref-type="bibr" rid="ref12 ref31">(Zhang et al. 2016)</xref>
          to detect human faces in each frame. We
selected 1,756 template faces of children (about 2% of the
total detected faces) to create a facial expression database
for ASD-affect, consisting of 1,159 neutral and 706 positive
faces. Each selected face was manually labeled as either
neutral or positive based on facial expressions. The children face
dataset served as training, validation, and test set for the FER
model used in the second stage via 5-fold cross-validation.
We used random crop, rotation, shifting, illumination
adjustment, and normalization techniques for data augmentation
and noise reduction. Before training, all facial images were
resized to 48 48 offline, then random cropped to 44 44
on-the-fly during training. In testing, faces were directly
resized to 44 44.
        </p>
        <p>Detected faces may belong to children, other persons in
the scene or due to noise. To localize children’s faces
properly, we leveraged the children’s face dataset to create a face
embedding database, where each face was encoded as a
128dimensional vector. Whenever a new face is encountered, we
can compare its embedding with the embedding database
we have established to find matches. A ’match’ is defined
as the cosine similarity between the new face embedding
and a known face embedding is less than a given confidence
threshold. Only matched faces were used for predictions,
and unmatched faces were excluded.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Speech Emotion Recognition</title>
        <p>
          Since the whole dataset is imbalanced, where negative video
clips are much less than non-negative ones, we applied
weighted sampling to enhance negative samples’ occurrence
while working with spectrograms. We chose a batch size of
32, and the network was trained for 25 epochs. We used
Adam
          <xref ref-type="bibr" rid="ref14">(Kingma and Ba 2014)</xref>
          optimizer, and the learning
rate was set to 0.001.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Facial Expression Recognition</title>
        <p>
          The training set was the selected ASD children’s faces, as
mentioned above. We set the batch size to 64 for training
while the total training epochs was 25, and chose Adam
          <xref ref-type="bibr" rid="ref14">(Kingma and Ba 2014)</xref>
          optimizer with an initial learning
rate of 0.001. The learning rate was decreased by a factor of
0.1 every 20 epochs. Unlike the training phase, in testing,
input images were captured every five frames from videos
on-the-fly. Note that inputs in testing were not face crops
but image frames, indicating that a face detector has to be
applied to capture human faces from frames. MTCNN was
then applied to test images to capture human faces. Detected
faces were compared with the established children’s face
database. Once children’s faces were matched and located,
a trained model predicted children’s facial expressions,
and such predictions were considered valid votes. Frames
were discarded if no target children’s faces were detected,
including no faces or only faces from others (e.g., therapists
or parents). At the end of each video, for all the valid
votes, if the portion of positive predictions exceeds a
certain threshold, the whole video is predicted as positive,
otherwise neutral. In this experiment, we set the threshold
to 0.5, equivalent to majority voting. The workflow of the
test phase is explained in Algorithm 1.
        </p>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We used 5-fold cross-validation to report findings from
our participant videos (recordings from two children were
merged together due to small number of video clips,
totalling five batches of data from six children. Since we had
Algorithm 1 Stage 2 testing(input video v, face detector
f ace det , children face embedding child embeds,
classifier model, threshold T)
imbalanced classes, in addition to accuracy, we reported
recall, F1 score, G-mean value, and ROC-AUC score for more
in-depth analysis.</p>
      <sec id="sec-5-1">
        <title>Stage 1: Negative vs. Non-Negative</title>
        <p>We achieved an accuracy of 94:48% and an F1 score of 0:97.
The recall of negative and non-negative labels is 68:42% and
95:57%. Besides, the G-mean value and ROC-AUC score
are 0.92 and 0.93, respectively. The confusion matrix of
stage 1 is shown in Figure 4. The classification results from
recordings of each participant is shown in Figure 5.
We reached an overall accuracy of 75:93%, where recall for
neutral and positive class was 78:29% and 63:24%,
respectively. Confusion matrix is shown in Figure 6. Also, F-1
score is calculated as 0.79. The results from each
participant’s videos are shown in Figure 7.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Overall Accuracy</title>
        <p>According to the confusion matrix for all three classes
(Figure 8), we have correctly classified 285 neutral videos,
43 positive videos, and 13 negative ones. Adding them up,
we had 341 out of 471 precisely classified clips, leading to
an overall accuracy of 72:40% and an F1 score of 0:75.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>Overall, our method achieves an acceptable performance in
both stages. However, there is a noticeable accuracy gap
between stage 1 and stage 2 and between dominant and
nondominant classes of each stage. If we consider our problem
a two binary classification problems –stage one with
negative vs non-negative samples, and stage two of non-negative
samples classified to positive and neutral samples– for
nondominant labels in both stages, which are negatives in stage
1, and positives in stage 2, the recall rates are very
comparable: 68:42% and 63:24% respectively. On the other hand,
for both of dominant classes in stages 1 and 2, the recall
for non-negatives is significantly higher than neutral labels
(95:57% vs. 78:29%). This may because speech emotion
features, such as shouting and screaming, are more distinct
and recognizable and describe negative videos better.
Moreover, the distinction of positive from neutral labels in stage 2
was very tough even for subject matters experts due to data
noise and low video resolution. As such, SER performed
relatively better than FER for our ASD-Affect dataset.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>This paper proposed a novel framework for automatic
emotion recognition of children with ASD using multi-modal
information (facial and speech emotion), providing a baseline
model to affect states analysis in play therapy. This work
also has implications on automated affect annotation for play
therapy video recordings. Besides, the framework leverages
human expertise to a great extent by proposing a two-stage
schema, a novel way to combine human knowledge and
machine intelligence in ASD-related research.</p>
      <p>
        We anticipate expanding this project in the future in
multiple directions. We aim to offer a semi-automated annotation
framework to assist subject-matter experts swiftly
annotating the recordings from children with autism. We discussed
some challenges of the pre-recorded videos in our dataset,
especially the low-resolution issue. To overcome the
problem, we plan to collect more audio-visual data with higher
resolution to deploy other FER techniques, including
sequential and action-unit based approaches mentioned in the
paper. Furthermore, deficits in mutual gaze, and shared gaze
is also known as a strong predictor of autism among children
        <xref ref-type="bibr" rid="ref36">(Zhao et al. 2017)</xref>
        , which we are interested in investigating
in the future, as a next line of our previous research
        <xref ref-type="bibr" rid="ref11">(Guo
and Barmaki 2020)</xref>
        on automatic detection of mutual gaze
among adults.
      </p>
      <p>8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We wish to acknowledge the support from the entire research
team, study participants and their caregivers to collect
ASDaffect dataset. We also thank our sponsor, Amazon Research
Awards Program for the generous support. Any opinions,
findings, and conclusions or recommendations expressed in
this material are those of the authors and do not necessarily
reflect the views of the sponsors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baio</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wiggins</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Christensen</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maenner</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Daniels</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Warren</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kurzius-Spencer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zahorodny</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Rosenberg,
          <string-name>
            <surname>C. R.</surname>
          </string-name>
          ; White,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; et al.
          <year>2018</year>
          .
          <article-title>Prevalence of autism spectrum disorder among children aged 8 years-autism and developmental disabilities monitoring network, 11 sites</article-title>
          , United States,
          <year>2014</year>
          .
          <source>MMWR Surveillance Summaries</source>
          <volume>67</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bangerter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Manfredonia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Manyakov,
          <string-name>
            <given-names>N. V.</given-names>
            ;
            <surname>Ness</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Boice,
          <string-name>
            <given-names>M. A.</given-names>
            ;
            <surname>Skalkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Goodwin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. S.</surname>
          </string-name>
          ; Dawson,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Hendren,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; et al.
          <year>2020</year>
          .
          <article-title>Automated recognition of spontaneous facial expression in individuals with autism spectrum disorder: parsing response variability</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Molecular autism 11(1)</source>
          :
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Burkhardt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Paeschke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rolfes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sendlmeier</surname>
            ,
            <given-names>W. F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>A database of German emotional speech</article-title>
          .
          <source>In Ninth European Conference on Speech Communication and Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.;</given-names>
          </string-name>
          and Zhang,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2018</year>
          . 3-
          <string-name>
            <given-names>D</given-names>
            <surname>Convolutional Recurrent</surname>
          </string-name>
          <article-title>Neural Networks With Attention Model for Speech Emotion Recognition</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          <volume>25</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1440</fpage>
          -
          <lpage>1444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chethik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Techniques of child therapy: Psychodynamic strategies</article-title>
          . Guilford Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dhall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Goecke</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Lucey,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Gedeon,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Collecting</given-names>
            <surname>Large</surname>
          </string-name>
          ,
          <source>Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19</source>
          (
          <issue>3</issue>
          ):
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Doyran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; Tu¨rkmen, B.;
          <string-name>
            <surname>Oktay</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Halfon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Salah</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Video and Text-Based Affect Analysis of Children in Play Therapy</article-title>
          .
          <source>In 2019 International Conference on Multimodal Interaction</source>
          , ICMI '
          <volume>19</volume>
          ,
          <fpage>26</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Ekman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Strong evidence for universals in facial expressions: a reply to Russell's mistaken critique</article-title>
          . .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and Barmaki,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Deep neural networks for collaborative learning analytics: Evaluating team collaborations using student gaze point prediction</article-title>
          .
          <source>Australasian Journal of Educational Technology</source>
          <volume>36</volume>
          (
          <issue>6</issue>
          ):
          <fpage>53</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Kaur</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Bhat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Creative Yoga Intervention Improves Motor and Imitation Skills of Children With Autism Spectrum Disorder</article-title>
          .
          <source>Physical Therapy</source>
          <volume>99</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1520</fpage>
          -
          <lpage>1534</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep facial expression recognition: A survey</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .08348 .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Zheng,
          <string-name>
            <surname>W.</surname>
          </string-name>
          ; Zong,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ; Liu, J.; and
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Bi-Modality Fusion for Emotion Recognition in the Wild</article-title>
          .
          <source>In 2019 International Conference on Multimodal Interaction</source>
          , ICMI '
          <volume>19</volume>
          ,
          <fpage>589</fpage>
          -
          <lpage>594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Multi-Feature Based Emotion Recognition for Video Clips</article-title>
          .
          <source>In Proceedings of the 20th ACM International Conference on Multimodal Interaction</source>
          , ICMI '
          <volume>18</volume>
          ,
          <fpage>630</fpage>
          -
          <lpage>634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Livingstone</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English</article-title>
          .
          <source>PloS one 13</source>
          <volume>(5)</volume>
          :
          <fpage>e0196391</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Lucey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kanade</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Saragih</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ambadar</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Matthews</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotionspecified expression</article-title>
          .
          <source>In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops</source>
          ,
          <fpage>94</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>McFee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Raffel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McVicar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Battenberg</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Nieto</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>librosa: Audio and music signal analysis in python</article-title>
          .
          <source>In Proceedings of the 14th python in science conference</source>
          , volume
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Mehrabian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; et al.
          <year>1971</year>
          .
          <article-title>Silent messages</article-title>
          , volume
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          National Institute of Health.
          <year>2018</year>
          .
          <article-title>Autism Spectrum Disorder</article-title>
          . https://www.nimh.nih.gov/health/publications/autismspectrum-disorder/index.shtml.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Nojavanasghari</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; Baltrusˇaitis, T.;
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Morency</surname>
          </string-name>
          , L.-P.
          <year>2016</year>
          .
          <article-title>EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children</article-title>
          .
          <source>In Proceedings of the 18th ACM International Conference on Multimodal Interaction</source>
          , ICMI '
          <volume>16</volume>
          ,
          <fpage>137</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Volkmann</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>E. B.</given-names>
          </string-name>
          <year>1937</year>
          .
          <article-title>A scale for the measurement of the psychological magnitude pitch</article-title>
          .
          <source>The Journal of the Acoustical Society of America</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>185</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Valstar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Pantic,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Induced disgust, happiness and surprise: an addition to the mmi facial expression database</article-title>
          .
          <source>In Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect</source>
          ,
          <volume>65</volume>
          . Paris, France.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Venkataramanan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Rajamohan,
          <string-name>
            <surname>H. R.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Emotion Recognition from Speech</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Weitlauf</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gotham</surname>
            ,
            <given-names>K. O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vehorn</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Warren</surname>
            ,
            <given-names>Z. E.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Brief report: DSM-5 ”levels of support:” a comment on discrepant conceptualizations of severity in ASD.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>Journal of autism and developmental disorders 44</source>
          <volume>(2)</volume>
          :
          <fpage>471</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>An Interaction-aware Attention Network for Speech Emotion Recognition in Spoken Dialogs</article-title>
          .
          <source>In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>6685</fpage>
          -
          <lpage>6689</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Attentive modality hopping mechanism for speech emotion recognition</article-title>
          .
          <source>In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>3362</fpage>
          -
          <lpage>3366</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and Qiao,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          <volume>23</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1499</fpage>
          -
          <lpage>1503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Zhang,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Huang,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>IEEE Transactions on Multimedia</source>
          <volume>20</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1576</fpage>
          -
          <lpage>1590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Speech emotion recognition using deep 1D &amp; 2D CNN LSTM networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>Biomedical Signal Processing and Control</source>
          <volume>47</volume>
          :
          <fpage>312</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Uono</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yoshimura</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Kubota,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Toichi,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Atypical gaze cueing pattern in a complex environment in individuals with ASD</article-title>
          .
          <source>Journal of autism and developmental disorders</source>
          <volume>47</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1978</fpage>
          -
          <lpage>1986</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>