=Paper= {{Paper |id=Vol-3359/paper6 |storemode=property |title=Comparison of Human Experts and AI in Predicting Autism from Facial Behavior |pdfUrl=https://ceur-ws.org/Vol-3359/paper6.pdf |volume=Vol-3359 |authors=Evangelos Sariyanidi,Casey J. Zampella,Ellis DeJardin,John D. Herrington,Robert T. Schultz,Birkan Tunc |dblpUrl=https://dblp.org/rec/conf/iui/SariyanidiZDHST23 }} ==Comparison of Human Experts and AI in Predicting Autism from Facial Behavior== https://ceur-ws.org/Vol-3359/paper6.pdf
Comparison of Human Experts and AI in Predicting Autism
from Facial Behavior
Evangelos Sariyanidi1 , Casey J. Zampella1 , Ellis DeJardin1 , John D. Herrington1,2 ,
Robert T. Schultz1,2 and Birkan Tunc1,2
1
    Center for Autism Research, The Childrenโ€™s Hospital of Philadelphia, United States
2
    University of Pennsylvania, United States


                                          Abstract
                                          Advances in computational behavior analysis via artificial intelligence (AI) promise to improve mental healthcare services
                                          by providing clinicians with tools to assist diagnosis or measurement of treatment outcomes. This potential has spurred
                                          an increasing number of studies in which automated pipelines predict diagnoses of mental health conditions. However, a
                                          fundamental question remains unanswered: How do the predictions of the AI algorithms correspond and compare with the
                                          predictions of humans? This is a critical question if AI technology is to be used as an assistive tool, because the utility of an
                                          AI algorithm would be negligible if it provides little information beyond what clinicians can readily infer. In this paper, we
                                          compare the performance of 19 human raters (8 autism experts and 11 non-experts) and that of an AI algorithm in terms of
                                          predicting autism diagnosis from short (3-minute) videos of ๐‘ = 42 participants in a naturalistic conversation. Results
                                          show that the AI algorithm achieves an average accuracy of 80.5%, which is comparable to that of clinicians with expertise
                                          in autism (83.1%) and clinical research staff without specialized expertise (78.3%). Critically, diagnoses that were inaccurately
                                          predicted by most humans (experts and non-experts, alike) were typically correctly predicted by AI. Our results highlight the
                                          potential of AI as an assistive tool that can augment clinician diagnostic decision-making.

                                          Keywords
                                          autism, assistive healthcare technologies, digital phenotyping



1. Introduction                                                                                        too much. Its core traits include observable differences
                                                                                                       in social communication, social reciprocity, nonverbal
Modern medical disciplines typically rely on a variety communication, and relationships, as well as restricted
of technological tools to assist in diagnosis and moni- patterns of interests and activities [6]. The current re-
tor treatment progress. From brain imaging technolo- liance on assessment and interpretation of overt behavior
gies to blood and genetic tests, instruments that assist makes autism an excellent candidate for computational
medical decision-makers are a cornerstone of modern behavior analysis approaches. Coupling computationally-
medicine. In the domain of psychiatry and psychology, derived biomarkers with expert clinician judgment may
however, medical decision-making relies nearly exclu- provide an extremely potent approach to autism care,
sively on observational or paper-and-pencil instruments. by enhancing the currently limited reliability of clinical
Thus, recent advances in computer vision and artificial in- assessments (e.g., DSM-5 field trials Kappa = 0.69) [7],
telligence (AI) are poised to rapidly advance research and shortening lengthy diagnostic evaluations, and improv-
clinical decision-making in psychiatry by introducing ing sensitivity for capturing change over the course of
reliable and granular tools within a new paradigm: com- treatment and development.
putational behavior analysis [1, 2, 3, 4, 5]. Such tools can                                              This potential has spurred a plethora of studies that
capture and quantify human behavior with extraordinary aim to diagnose autism via AI pipelines based on vari-
precision, even from brief video recordings.                                                           ous behavioral modalities and sensors [8]. Notably, to
   Autism spectrum disorder (ASD), like nearly all psy- our knowledge, no study has directly compared AI algo-
chiatric conditions, is defined by observable behavioral rithms and human raters with respect to overall predic-
cuesโ€”what a person does well or not well, too little or tive capacity or specific decisions on individual cases. A
Evangelos Sariyanidi, Casey J. Zampella, Ellis DeJardin, John D. Her- comparison of this kind is important when it comes to
rington, Robert T. Schultz and Birkan Tunc. 2023. Comparison of using AI as an assistive technology for clinical decision-
Human Experts and AI in Predicting Autism from Facial Behavior. In making, as it can determine whether or not AI provides
Joint Proceedings of the ACM IUI 2023 Workshops. Sydney, Australia, significant incremental utility beyond existing tools. AI
9 pages.
$ sariyanide@chop.edu (E. Sariyanidi); zampellac@chop.edu
                                                                                                       algorithms can maximize and cooperate synergistically
(C. J. Zampella); dejardine@chop.edu (E. DeJardin);                                                    with human assessment by complementing and augment-
herringtonj@chop.edu (J. D. Herrington); schultzrt@chop.edu                                            ing human decisions. On the other hand, clinicians would
(R. T. Schultz); tuncb@chop.edu (B. Tunc)                                                              have little interest in or benefit from incorporating AI
          ยฉ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   algorithms if their decisions โ€“and errorsโ€“ highly overlap
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
with their own. We aim to address this issue by examin-   Schedule - 2nd Edition, Module 4 (ADOS-2) [11] and
ing whether or not AI detects diagnostic indicators that  adhering to DSM-V criteria for ASD [12]. All aspects
may go unnoticed by human observation.                    of the study were approved by the Institutional Review
   In this paper, our main contribution is comparing the  Board The Childrenโ€™s Hospital of Philadelphia (CHOP).
performance of AI and humans with knowledge of autism     Two participants were excluded from analysis due to their
in accurately classifying autism from a 3-minute get-to-  lack of consent for this particular set of experiments or
know-you conversation with a non-clinician conversa-      their data being unavailable for processing, yielding a
tion partner. Specifically, we implemented a computer     final sample of 42 participants (ASD: N=15, NT: N=27).
vision pipeline for predicting autism using features of      Participants underwent a battery of tasks that assessed
facial behavior during conversations with a sample of     social communication competence, including a slightly
๐‘ = 42 adults โ€“ 15 individuals with autism spectrum       modified version of the Contextual Assessment of So-
disorder (ASD) and 27 neurotypical (NT) individuals. We   cial Skills (CASS) [13]. The CASS is a semi-structured
then recruited a total of 19 human raters (8 expert clin- assessment of conversational ability designed to mimic
icians, 11 non-experts with experience with autism) to    real-life first-time encounters. Participants engaged in
predict the diagnostic status of the same participants. The
                                                          two 3-minute face-to-face conversations with two dif-
expert raters were doctoral level clinicians with extensive
                                                          ferent confederates (research staff, blind to participant
training on autism, while most of the non-experts were    diagnostic status and unaware of the dependent vari-
BA level researchers still learning about autism. Raters  ables of interest). In the first conversation (interested
watched the same videos of participantsโ€™ faces during con-condition), the confederate demonstrates social interest
versations that were fed to the computer vision pipeline, by engaging both verbally and non-verbally in the con-
without sound to allow for a fairer comparison with the   versation. In the second conversation (bored condition),
AI algorithm.                                             the confederate indicates boredom and disengagement
   Results suggest that the AI pipeline based on partici- both verbally (e.g., one-word answers, limited follow-up
pant facial behavior predicts diagnostic status with 80.5%questions) and physically (e.g., neutral affect, limited eye-
accuracy. This accuracy was comparable to the 80.3%       contact and gestures). All analyses throughout this paper
overall accuracy achieved by human raters (83.1% for      are based on the interested condition only.
experts and 78.3% for non-experts), demonstrating the        During the CASS, participants and confederates were
potential of AI to detect facial behavioral patterns that seated facing one another. Audio and video of the CASS
differentiate adults with autism from neurotypical peers  were recorded using an in-house device comprising two
in the context of a casual, get-to-know-you conversa-     1080p HD (30 fps) cameras (Fig. 1), which was placed
tion. Moreover, we show that the prediction errors of AI  between the participant and confederate on a floor stand.
and humans had little overlap, indicating that the AI can The two cameras of the device point in opposite direc-
provide complementary information that could prompt       tions to allow simultaneous recording of the participant
and assist clinicians with their evaluations and decision-and the confederate. However, the AI analyses in this
making. The fact that all the results of this paper are   paper are conducted on the video data of the participant
extracted from a brief naturalistic conversation is a sig-only. In other words, even if the context of the conversa-
nificant contribution, as a 3-minute conversation with    tion is dyadic, our AI-based analysis is not dyadic since
a non-expert is a highly scalable paradigm, and thus a    it discards the information from the confederate and fo-
promising option as a screening or (preliminary) diagnos- cuses only on the participant. We refer to this type of
tic procedure. The results of this paper motivate further analysis as monadic analysis.
research efforts to understand the decision mechanisms       CASS confederates included 10 undergraduate stu-
of AI algorithms, particularly for uncovering subtle be-  dents or BA-level research assistants (3 males, 7 females,
havioral patterns in psychiatric conditions.              all native English speakers). Confederates were semi-
                                                          randomly selected, based on availability and clinical judg-
                                                          ment. In order to provide opportunities for participants
2. Participants and Procedure                             to initiate and develop the conversation, confederates
                                                          were trained to speak for no more than 50% of the time
Forty-four adults participated in the present study (ASD:
                                                          and to wait 10s to initiate the conversation. If conver-
n=17, NT: n=27, all native and fluent English speakers).
                                                          sational pauses occurred, confederates were trained to
Participant groups did not differ significantly on mean
                                                          wait 5s before re-initiating the conversation. Otherwise,
chronological age, full-scale IQ estimates (WASI-II) [9],
                                                          confederates were told to simply naturally engage in
verbal IQ estimates, or sex ratio (Table 1). Participant
                                                          the conversation. Prior to each conversation, study staff
diagnostic status (ASD or NT) was confirmed as part
                                                          provided the following prompt to the participants and
of this study using the Clinical Best Estimate process
                                                          confederates before leaving the room: โ€œThank you both
[10], informed by the Autism Diagnostic Observation
                                                          so much for coming in today. Right now, you will have 3
Table 1
Participant characterization within our sample. Wilcoxon rank sum tests with continuity correction were used for statistical
group comparisons, except for sex ratio where the Chi-squared test was used. One NT participant had missing ADOS-2 scores.
RRB=Repetitive Behaviors and Restricted Interests subscore of the ADOS-2. *Statistically significant difference between
diagnostic groups, p<0.05.
                    Variable               ASD Mean (SD)       NT Mean (SD)      Statistics   p-value
                    Age (years)               26.9 (7.3)         28.1 (8.4)      W = 234       0.923
                    Sex (Male, Female)         15m, 2f            23m, 4f        ๐œ’2 : 0.08     0.774
                    Full-Scale IQ            102.1 (19.8)        111.7 (9.5)     W = 157       0.080
                    Verbal IQ                112.6 (22.1)       112.4 (11.2)     W = 215       0.736
                    ADOS Total                13.1 (3.0)          1.1 (0.9)      W = 442      < 2e-8*
                    ADOS Social Affect         9.8 (2.3)          1.0 (0.9)      W = 442      < 1e-8*
                    ADOS RRB                   3.3 (1.5)          0.1 (0.3)      W = 441      < 1e-9*




Figure 1: Left: The device used to record the conversation. The device has two cameras, each pointing to one party of the
conversation. Right: Example of videos shown to the human raters. The video contains synchronized videos of the heads/faces
of both the participant and the confederate as recorded by the device on the left. The video of the participantโ€™s face only
served as input to the AI pipeline.



minutes to talk and get to know each other, and then I      presented to human raters in a random order on high
will come back into the room.โ€                              resolution monitors.
                                                               Raters were instructed to watch each video just once
                                                            and to make a decision as to whether the study partic-
3. Prediction of Autism Diagnosis ipant had autism or not. They were told that all par-
                                                            ticipants were either confirmed to have autism through
3.1. Human Raters                                           clinical evaluation by a licensed expert, or were recruited
We recruited a total of 19 human raters to view the videos specifically as neurotypical controls (i.e., clear cases of
from the sample of ๐‘ = 42 participants. Eight of the individuals without autism). Raters were not allowed to
raters were autism clinical experts, doctoral level clini- go back and review earlier videos. They were instructed
cians with extensive training at the Center for Autism to watch all videos within 1 to 3 viewing sessions, with
Research (CAR) of CHOP. The remaining 11 (non-expert) nearly all being completed in 1 or 2 sessions.
raters had some familiarity with autism but not special-
ized training and worked at CAR. Most of these non- 3.2. Computer vision
expert raters were BA-level psychology students learning
about autism.                                               3.2.1. Quantification of facial behavior
   The videos that were shown to the human raters were Our goal is to quantify all observable facial behavior
prepared as follows: First, we cropped the videos of the of a participant, which includes facial expressions and
participant and their corresponding confederate conver- head movements. Also, we did not want to limit analysis
sation partner so that only the heads and necks were to emotion-related expressions (e.g., the six basic emo-
visible. Next, we combined the synchronized videos of tions), as other kinds of facial movements (e.g., commu-
the heads/faces of the participant and confederate into a nicative expressions, speech-related mouth movements)
single video file per participant such that participant and are also important for diagnosing autism [14]. There-
confederate were positioned side by side (Fig. 1, right). fore, we quantify behavior using a 3D morphable model
The audio was removed in order to allow human raters (3DMM) [15] as 3DMMs contain expression bases (e.g.,
to focus on the facial behavior, as was the case for the AI [16]) that can quantify any facial movement. Moreover,
algorithm. The videos for all ๐‘ = 42 participants were 3DMMs can simultaneously model facial identity, pose,
              19 components
                                                                 finally represent the expression variation in a video of
                                                                 ๐‘‡ frames with a matrix E of size ๐‘‡ ร— 53, obtained by
                                                                 horizontally concatenating the expression vectors from
                                                                 all the frames. Finally, using the rotation matrix R esti-
         7 components                    19 components
                                                                 mated at each frame, we compute the yaw, pitch and roll
                                                                 angles per frame, and represent head rotation throughout
                                                                 the video with a matrix ฮฆ of size 3 ร— ๐‘‡ . The facial move-
                                      15 components              ment variation and head rotation of a person throughout
                                                                 the video are represented together with a matrix Y of
Figure 2: We divide the facial mesh of ๐‘ƒ points into the four
                                                                 size 56 ร— ๐‘‡ , obtained as
groups illustrated in this figure: (1) brows and forehead; (2)
eyes; (3) nose and cheeks; and (4) mouth and chin. Each of                                    [๏ธ‚ ]๏ธ‚
                                                                                                E
the ๐‘ƒ mesh points is assigned to one of these four groups by                             Y=         .                   (2)
                                                                                                ฮฆ
first computing the distance of the point to all the 51 facial
landmarks (iBUG-51 [19]), and then identifying the facial      Alternatively, one can consider using the Action
feature (i.e., brow, eye, nose or mouth) corresponding to the
                                                            Units (AUs) of the Facial Action Coding System instead of
closest landmark. The expression basis that we use has a total
of 60 components, distributed as shown in the figure.       the 3DMM-based expression features that we used above.
                                                            However, our analysis is based on correlation of time
                                                            series (Section 3.2.2), which requires a representation
                                                            where AU intensity needs to be providedโ€”binary AU
and expression. This increases the precision of parsing fa- labels would be very limiting. Since automated AU detec-
cial expressions and head movements, since the effect of tion systems (e.g., OpenFace [20]) provide AU intensity
identity (i.e., identity bias [17]) is reduced when modeled only for a relatively small number of AUs, we preferred
and thus explained away. Specifically, we use the 3DI to use the 3DMM-based features instead of the AUs. One
method [18], as it can learn identity from multiple frames could also consider to add the AU features to the features
and thus model and remove its effect more accurately. Y above, but we refrained from doing so, because the
Moreover, 3DI can take the parameters of the camera as number of our correlation features increases exponen-
input, which is critical for increasing the accuracy with tially with the number of rows in Y (Section 3.2.2). This
which facial expressions and pose are decoupled [19].       also explains why we refrained from adding the features
    A 3DMM method produces a dense mesh of ๐‘ƒ three- from nose and cheek regions, as the potential extra infor-
dimensional points X โˆˆ R3ร—๐‘ƒ to represent the face mation that would be provided by these regions may not
in a given video frame I. (๐‘ƒ is 23, 660 for the 3DI justify the exponential increase in the dimensionality of
method). This 3D mesh is a function of the facial pose the feature space. That said, the utility of all such extra
(i.e., a rotation matrix R โˆˆ R3ร—3 and a translation vector information should be explored in future AI pipelines
๐œ โˆˆ R3ร—1 ), the facial identity of the person Xฬ„ and the that can be trained with data from larger samples.
facial expression variation in the image ฮ”X โˆˆ R3ร—๐‘ƒ :
                                                                 3.2.2. Correlation features
                 X = R(Xฬ„ + ฮ”X) + T,                       (1)
                                                                 An important aspect of social communication is how
where the columns of the matrix T โˆˆ R3ร—๐‘ƒ are identi-             different modalities of communicative behavior are in-
cally ๐œ . The matrices of interest in the scope of our study     tegrated and coordinated. For example, the ADOS, the
are the matrix of head rotation R and the expression             gold standard clinical assessment for autism diagnosis,
variation, ฮ”X. 3DMMs represent expression variation              includes criteria that evaluate how an individual com-
as a linear sum, ฮ”X = W๐œ€, where ๐œ€ โˆˆ R๐พร—1 is the                  bines speech with gestures and eye contact with facial
vector representing the expression. The expression ba-           expression [14]. Similarly, the coordination of behavior
sis W used by 3DI method is constructed via PCA [16],            within a communicative modality (e.g., movements across
which limits the interpretability as PCA components are          different parts of the face) is important; for example, atyp-
not localizedโ€“we cannot associate any PCA component              ical aspects of facial expressions can be characteristic of
with a specific facial region. To make the results of our        autism [21, 22]. Thus, to capture coordination across
study more interpretable, we modified the expression             different types of facial and head movements within a
model in a way that the resultant expression model, Wโ€ฒ ,         person, we apply windowed cross-correlation [23] on the
contains 60 localized basis components as shown in Fig. 2.       matrix Y. That is, considering the ๐‘–th and ๐‘—th row of
Using this model, we represent the expression variation          Y as two time series, we compute the cross correlation
in the image with the vector ๐œ€โ€ฒ that minimizes the norm          between the two, over time windows of length ๐‘‡๐‘ค and a
||ฮ”X โˆ’ Wโ€ฒ ๐œ€โ€ฒ ||2 . We ignore the 7 components that cor-          step size of ๐‘‡๐‘ค /2 (i.e., consecutive time windows have an
respond to the nose and cheek regions (Fig. 2), and we           overlap of 50%). We then compute the average ๐œ‡๐‘–,๐‘— and
Table 2
Average prediction accuracy of all human raters, non-expert raters, expert raters and AI.
                              All human raters      Non-expert raters            Expert raters                     AI
                                    80.3%                78.3%                      83.1%                         80.5%


standard deviation ๐œŽ๐‘–,๐‘— of the maximal cross-correlation                                          1.2
values (w.r.t. lag) per window. To distinguish between the
cases where, say, a mouth movement was followed with                                               1
a pose variation from the opposite direction, we allow




                                                                         Mean accuarcy (humans)
only forward lag on the second time series in the pair,                                           0.8
thus (๐œ‡๐‘–,๐‘— , ๐œŽ๐‘–,๐‘— ) is in general different from (๐œ‡๐‘—,๐‘– , ๐œŽ๐‘—,๐‘– ).
In sum, since Y has 56 rows, we have 56 ร— 56 ordered                                              0.6
pairs, and with 2 features (i.e., mean and standard devia-
tion) per pair, the total number of features that represent
                                                                                                  0.4
the behavior of a participant is ๐‘€ = 6272.

                                                                                                  0.2
3.2.3. Classification
We predict the diagnostic group of participants (ASD vs.                                           0
NT) using a linear SVM classifier by simply using the                                                   0   0.2    0.4    0.6     0.8   1   1.2
default ๐ถ value for SVM (i.e., ๐ถ = 1). We report results                                                           Mean accuracy (AI)
based on nested cross-validation, where the only hyper-
parameter that is being optimized is the time window ๐‘‡๐‘ค ,          Figure 3: The average prediction accuracy of human raters
and we optimize over values of ๐‘‡๐‘ค = 1, 2, 4, 6 seconds.            against the average prediction accuracy of the AI pipeline, per
The time window length that was selected in most cross             participant. The average prediction for the AI results in this
validation folds was ๐‘‡๐‘ค = 2๐‘ .                                      figure are computed by repeating 5-fold cross-validation 1000
   While more advanced AI models based on deep learn-              times, and averaging over the predicted 1000 predictions per
ing could be used, the sample size is insufficient for reli-       participant.
ably training deep learning models from scratch. More-
over, to our knowledge, there is no publicly available
pre-trained deep learning model that is directly applica-          out of these five human mispredictions were correctly
ble for our problem, thus taking an existing model and             predicted by the AI, including the first participant in the
re-training only a part of it (e.g., the classification layer)     list, whose diagnosis was predicted correctly by only 21%
with our data is also not an approach within reach.                of the human raters. In other words, participants that
                                                                   were difficult for most human raters to accurately classify
4. Results and Discussion                                          were not particularly difficult for the AI. This suggests
                                                                   that the decision mechanism of AI is different than that
Table 2 shows the prediction accuracy of the human                 of the humans, and the following results further support
raters and the AI method. The results for the AI method            this point of view.
are obtained via 10-fold cross validation (repeated 100               Fig. 3 plots the average prediction accuracy of human
times with shuffling participant order). The average ac-           raters against the average accuracy of the AI algorithm
curacy of expert clinicians is slightly higher than that of        per participant. The correlation between these quantities
non-experts. Of note, the average accuracy of all human            is not strong (๐œŒ = 0.35) and is mostly driven by the par-
raters (expert and non-expert) is similar to that of the AI        ticipants that are correctly classified by both humans and
approach. The average positive predictive value, nega-             the AI (i.e., the top right points of the plot). For example,
tive predictive value, sensitivity and specificity of the AI       if we remove the subjects that are correctly classified by
model are respectively 0.86, 0.79, 0.55, 0.95.                     at least 95% of the human raters, the correlation drops to
   We next investigate whether the errors of the human             ๐œŒ = 0.19. The lack of points in the lower-left quadrant of
raters coincide with the errors of the AI algorithm. Ta-           the Fig. 3 supports the conclusion that the diagnoses that
ble 3 shows the participants whose diagnoses were in-              were difficult to predict for humans were not typically
accurately predicted by most human raters (i.e., average           difficult for the AI, and vice versa.
prediction accuracy < 50%), along with the correct diagno-            This outcome further supports that the decision mech-
sis and diagnosis predicted by AI. Results show that four          anism of the AI is different than that of the humans, and
Table 3
The five participants whose diagnosis (dx) was mispredicted by most human raters (i.e., average prediction accuracy < 50%),
with the corresponding average accuracy by the AI (computed by repeating 5-fold cross-validation 1000 times) and the diagnosis
predicted by AI via leave-one-out cross-validation.

            dx      Average accuracy (humans)      Average accuracy (AI)     predicted dx (AI, leave-one-out CV)
            ASD               21.1%                       88.1%                             ASD
            ASD               31.6%                       77.4%                             ASD
            ASD               42.1%                       82.1%                             ASD
            ASD               47.4%                       89.4%                             ASD
            ASD               47.4%                       12.8%                              NT



is a desirable outcome if AI is to be used as an assistive      among the top ๐‘˜ features. Fig. 5b plots the proportion
technology for human clinical decision-making, since            of the eye-, brow-, mouth- and pose-related features in
it implies that human decisions can be augmented with           the top-10, top-100, top-1000 most important features, as
the help of AI. For example, in a potential application         well as their proportion in the entire pool of 6272 features.
for autism screening from similar short social videos,          For example, while the baseline rate of pose features is
humans and AI could simultaneously make predictions,            only โˆผ5.3% (i.e., โˆผ5.3% of the entire set of 6272 features
and humans could re-evaluate their decision if it is incon-     are pose-related), we see that the top 10 features contain
sistent with the decision of the AI algorithm. However,         a pose-related feature at a ratio of โˆผ13.3% (see caption
arguably, a scenario of this kind is conceivable only if        of Fig. 5 for the computation pose-related features), in-
the AI algorithm produces a semantically interpretable          dicating that the pose features have โˆผ2.5 times more
outputโ€”that is, the algorithm lists the detected behav-         presence in the top-10 features compared to their base-
ioral patterns that lead to a diagnostic decision of autism     line. Similarly, the baseline rate of mouth-related features
vs. NT. Otherwise, without any explanation of the pre-          is โˆผ25.5%, but โˆผ40% of the top-10 features are related
diction, it would be difficult for a clinician to determine     to the mouth, indicating that mouth features also have
to what degree the result of the AI algorithm should be         greater representation in the set of important features
taken into account.                                             compared to their baseline. In sum, our analyses sug-
    In order to shed some light on the decision mecha-          gest that the AI algorithm places high emphasis on pose-
nism of the AI, we analyze the features that were domi-         and mouth-related features when classifying between
nant in the SVM classifierโ€”the features that had greater        autism and NT groups. Further analysis to uncover why
weight. Fig. 4 shows the weights of all the features and        these features are important is beyond the scope of this
Fig. 5a shows the 10 features that had the greatest (ab-        study, as this would require more granular expression
solute) weight across cross-validation folds along with         models (e.g., 3D versions of localized bases [26]), because
their names. While a complete analysis of the seman-            the approach that we designed from an existing model
tic interpretation of each feature is a difficult task, we      does not allow us to pinpoint the facial movements of
can still gain some insight into the SVM decisions by           interest beyond the level of the partitioned regions in
inspecting these results. First, note that pose-pose fea-       Fig. 2; for example, we cannot distinguish between parts
tures (i.e., features that summarize correlation between        of the mouth, such as upper lip or mouth corner. Still,
two head rotation angles) have the greatest weight on           our analyses allowed a degree of interpretation that cor-
average (Fig. 4 top), indicating that head movements are        roborates previous findings on the importance of mouth-
important for distinguishing behavioral patterns of autis-      related movements [2, 4], as well as the central role that
tic vs NT participants. Moreover, correlation features          head movements have in social orienting, attention and
combining the pose and eye emerge as important both             backchannel feedback (e.g., nodding) [27, 28, 24, 25, 29].
in Fig. 4 and in Fig. 5a, supporting previous literature
suggesting that blinking and nodding are important non-
verbal behaviors in conversations [24], and head and eye        5. Conclusions and Future Work
movements are indicators of social attention [25]. Sec-
                                                                In this paper, we studied the prediction of autism from
ond, mouth-related features also emerged as important.
                                                                facial movement behavior during casual conversations.
For example, six out of 10 correlation features in Fig. 5a
                                                                Specifically, we compared the predictive accuracy of ex-
are related to mouth, with three of them being pairs of
                                                                pert and non-expert human raters with that of an AI
mouth-mouth features.
                                                                algorithm. Results show that, while both humans and the
    We next analyze which, if any, of the four feature cat-
                                                                AI are capable of distinguishing individuals with autism
egories (eyes, brows, mouth, pose) have greater presence
                                                                spectrum disorder (ASD) from neurotypical (NT) indi-
                                                                                            Average and standard dev. of features per facial regions
                               0.0012
  Average weights ยฑ st. dev.


                               0.0010

                               0.0008

                               0.0006




                                                                                                                                                                   brow-brow
                                                                         brow-mouth




                                                                                                                          brow-pose



                                                                                                                                            brow-eye
                                               pose-pose



                                                              eye-pose




                                                                                                          mouth-mouth




                                                                                                                                                                               eye-mouth



                                                                                                                                                                                             eye-eye



                                                                                                                                                                                                                 mouth-pose
                                                                                                                        Weights of all features
                               0.0035
                               0.0030
                               0.0025
 Feature weight




                               0.0020
                               0.0015
                               0.0010
                               0.0005
                               0.0000



                                                                                                                                                       brow-brow
                                                           brow-mouth




                                                                                              brow-pose




                                                                                                                          brow-eye
                                        pose-pose
                                         eye-pose




                                                                              mouth-mouth




                                                                                                                                                                                 eye-mouth




                                                                                                                                                                                                       eye-eye


                                                                                                                                                                                                                         mouth-pose
Figure 4: Top: Average and standard deviation of correlation features per facial region; e.g., statistics for eye-pose show are
computed from correlation features that are extracted from these two regions (Section 3.2.2). Bottom: Manhattan plot showing
all correlation features.



viduals with high accuracy, their errors do not overlap,                                                                              Furthermore, research on younger participants is needed,
suggesting that the decision mechanism of an AI algo-                                                                                 given that early diagnosis improves access to effective
rithm may be different than that of a human. Thus, AI                                                                                 early interventions and thus can improve developmental
technologies have the potential to provide complemen-                                                                                 outcomes. Another future direction is to investigate the
tary information to a clinician and become an assistive                                                                               benefits of dyadic analysis, where, unlike our monadic
tool for decision making. Arguably, the most immediate                                                                                analysis (Section 2), the behavior of confederate is also
application based on our results is a new, semi-automatic                                                                             taken to account. Finally, user research is necessary to
screening technology for autism, where an individual is                                                                               test if and to what degree clinician diagnoses can be im-
advised for further diagnostic evaluation in the event that                                                                           proved through the use of AI assistive tools.
a (non-expert) human or the AI model predicts that the
individual exhibits autism-specific behavior. However, in
a real life scenario, the problem of interest would be more                                                                           Acknowledgments
difficult as a potential patient may not be NT but may
                                                                                                                                      This work is partially funded by the National Institutes of
not have ASD either. Thus, future research is needed to
                                                                                                                                      Health (NIH), Office of the Director (OD), National Insti-
identify the performance of humans and AI models in
                                                                                                                                      tute of Child Health and Human Development (NICHD),
predicting ASD diagnosis from neurodiverse samples.
                                                                                                                                      and National Institute of Mental Health (NIMH) of US, un-
   Our results directly motivate further future research
                                                                                                                                      der grants R01MH118327, R01MH122599, 5P50HD105354-
in multiple directions. The most pressing future direc-
                                                                                                                                      02 and R21HD102078.
tion from the perspective of making AI an effective as-
sistive tool is examination of the behaviors that lead to
a predicted diagnosis. Having interpretable outputs is                                                                                References
necessary for using AI technologies in clinics, as clini-
cians should understand how the AI algorithm makes                                                                                     [1] M. S. Mast, D. Gatica-Perez, D. Frauendorfer,
a prediction before taking this prediction into account.                                                                                   L. Nguyen, T. Choudhury, Social Sensing for Psy-
Figure 5: (a) The labels and weights of the top 10 features along with the standard error (across cross-validation folds). (b)
The ratio of each of the four feature categories (brow, eye, mouth, pose) in the top ๐‘˜ features (i.e., ๐‘˜ features with the highest
average SVM weight) against ๐‘˜. The graphs are computed on the basis of a feature category appearing on either side of a
correlation feature. For example, if a correlation feature is extracted from the correlation between a mouth and a pose feature,
it is considered to be both a mouth and a pose feature. The rightmost value of each graph shows the baseline rate for each
feature category โ€“the ratio of the feature category in the entire set of 6272 featuresโ€“ highlighting the importance of the mouth
and pose features, since they appear more frequently in the top-10, top-100, top-1000 features compared to their baseline rate.



     chology: Automated Interpersonal Behavior As- [5] D. Q. McDonald, C. J. Zampella, E. Sariyanidi,
     sessment, Psychological Science 24 (2015) 154โ€“160.         A. Manakiwala, E. DeJardin, J. D. Herrington, R. T.
     doi:10.1177/0963721414560811.                              Schultz, B. Tunรง, Head Movement Patterns dur-
 [2] C. J. Zampella, L. Bennetto, J. D. Herrington, Com-        ing Face-to-Face Conversations Vary with Age, in:
     puter Vision Analysis of Reduced Interpersonal Af-         Companion Publication of the International Con-
     fect Coordination in Youth With Autism Spectrum            ference on Multimodal Interaction, 2022.
     Disorder, Autism research : official journal of the    [6] S. E. Levy, D. S. Mandell, R. T. Schultz, Autism,
     International Society for Autism Research 13 (2020)        Lancet 374 (2009) 1627โ€“1638.
     2133โ€“2142. URL: https://pubmed.ncbi.nlm.nih.gov/       [7] D. A. Regier, W. E. Narrow, D. E. Clarke, H. C.
     32666690/. doi:10.1002/AUR.2334.                           Kraemer, S. J. Kuramoto, E. A. Kuhl, D. J. Kupfer,
 [3] C. J. Zampella, E. Sariyanidi, A. G. Hutchinson, G. K.     DSM-5 field trials in the United States and Canada,
     Bartley, R. T. Schultz, B. Tunรง, Computational             Part II: test-retest reliability of selected categorical
     Measurement of Motor Imitation and Imitative               diagnoses., The American journal of psychiatry
     Learning Differences in Autism Spectrum Disor-             170 (2013) 59โ€“70. URL: http://psychiatryonline.
     der, in: Companion Publication of the International        org/doi/abs/10.1176/appi.ajp.2012.12070999http:
     Conference on Multimodal Interaction, Associa-             //www.ncbi.nlm.nih.gov/pubmed/23111466.
     tion for Computing Machinery, Inc, 2021, pp. 362โ€“          doi:10.1176/appi.ajp.2012.12070999.
     370. URL: https://doi.org/10.1145/3461615.3485426. [8] R. A. J. de Belen, T. Bednarz, A. Sowmya,
     doi:10.1145/3461615.3485426.                               D. Del Favero, Computer vision in autism spectrum
 [4] J. Parish-Morris, E. Sariyanidi, C. Zampella, G. K.        disorder research: a systematic review of published
     Bartley, E. Ferguson, A. A. Pallathra, L. Bateman,         studies from 2009 to 2019, Translational psychiatry
     S. Plate, M. Cola, J. Pandey, E. S. Brodkin, R. T.         10 (2020) 1โ€“20.
     Schultz, B. Tunc, Oral-Motor and Lexical Diver- [9] D. Wechsler, Wechsler Abbreviated Scale of Intelli-
     sity During Naturalistic Conversations in Adults           gence - Second Edition (WASI-II), Pearson Clinical,
     with Autism Spectrum Disorder, in: Proceedings             San Antonio, TX, 2011.
     of the Fifth Workshop on Computational Linguis- [10] C. Lord, E. Petkova, V. Hus, W. Gan, F. Lu, D. M.
     tics and Clinical Psychology: From Keyboard to             Martin, O. Ousley, L. Guy, R. Bernier, J. Gerdts,
     Clinic, Association for Computational Linguistics,         M. Algermissen, A. Whitaker, J. S. Sutcliffe, Z. War-
     Stroudsburg, PA, USA, 2018, pp. 147โ€“157. URL: http:        ren, A. Klin, C. Saulnier, E. Hanson, R. Hundley,
     //aclweb.org/anthology/W18-0616. doi:10.18653/             J. Piggot, E. Fombonne, M. Steiman, J. Miles, S. M.
     v1/W18-0616.                                               Kanne, R. P. Goin-Kochel, S. U. Peters, E. H. Cook,
     S. Guter, J. Tjernagel, L. A. Green-Snyder, S. Bishop,         2018, pp. 59โ€“66.
     A. Esler, K. Gotham, R. Luyster, F. Miller, J. Olson,     [21] A. Metallinou, R. B. Grossman, S. Narayanan, Quan-
     J. Richler, S. Risi, A Multisite Study of the Clinical         tifying atypicality in affective facial expressions of
     Diagnosis of Different Autism Spectrum Disorders,              children with autism spectrum disorders, in: 2013
     Archives of General Psychiatry 69 (2012) 306.                  IEEE international conference on multimedia and
     URL: http://archpsyc.jamanetwork.com/article.                  expo (ICME), IEEE, 2013, pp. 1โ€“6.
     aspx?doi=10.1001/archgenpsychiatry.2011.148.              [22] T. Guha, Z. Yang, A. Ramakrishna, R. B. Grossman,
     doi:10.1001/archgenpsychiatry.2011.148.                        D. Hedley, S. Lee, S. S. Narayanan, On quantify-
[11] C. Lord, M. Rutter, P. S. DiLavore, S. Risi, K. Gotham,        ing facial expression-related atypicality of children
     S. L. Bishop, Autism diagnostic observation sched-             with autism spectrum disorder, in: 2015 IEEE inter-
     ule, second edition (ADOS-2), Western Psychologi-              national conference on acoustics, speech and signal
     cal Services, Torrance, CA, 2012.                              processing (ICASSP), IEEE, 2015, pp. 803โ€“807.
[12] APA, Diagnostic and Statistical Manual of Mental          [23] S. M. Boker, J. L. Rotondo, M. Xu, K. King, Win-
     Disorders, 5th Edition: DSM-5, American Psychi-                dowed cross-correlation and peak picking for the
     atric Association, Washington, D.C, 2013.                      analysis of variability in the association between be-
[13] A. B. Ratto, L. Turner-Brown, B. M. Rupp, G. B.                havioral time series, Psychological Methods 7 (2002)
     Mesibov, D. L. Penn, Development of the Contex-                338โ€“355. doi:10.1037/1082-989X.7.3.338.
     tual Assessment of Social Skills (CASS): a role play      [24] A. Gupta, F. L. Strivens, B. Tag, K. Kunze, J. A. Ward,
     measure of social skill for individuals with high-             Blink as you sync: Uncovering eye and nod syn-
     functioning autism., Journal of autism and devel-              chrony in conversation using wearable sensing, in:
     opmental disorders 41 (2011) 1277โ€“86. URL: http://             Proceedings of the 23rd International Symposium
     link.springer.com/10.1007/s10803-010-1147-zhttp:               on Wearable Computers, 2019, pp. 66โ€“71.
     //www.ncbi.nlm.nih.gov/pubmed/21287253.                   [25] T. Foulsham, M. Gejdosova, L. Caunt, Reading and
     doi:10.1007/s10803-010-1147-z.                                 misleading: Changes in head and eye movements
[14] V. Hus, C. Lord, The autism diagnostic observation             reveal attentional orienting in a social context, Vi-
     schedule, module 4: revised algorithm and stan-                sion 3 (2019) 43.
     dardized severity scores, Journal of autism and           [26] E. Sariyanidi, H. Gunes, A. Cavallaro, Learning
     developmental disorders 44 (2014) 1996โ€“2012.                   bases of activity for facial expression recognition,
[15] B. Egger, W. A. Smith, A. Tewari, S. Wuhrer, M. Zoll-          IEEE Transactions on Image Processing 26 (2017)
     hoefer, T. Beeler, F. Bernard, T. Bolkart, A. Ko-              1965โ€“1978.
     rtylewski, S. Romdhani, et al., 3d morphable face         [27] A. Krogsager, N. Segato, M. Rehm, Backchannel
     modelsโ€”past, present, and future, ACM Trans.                   head nods in Danish first meeting encounters
     Graph. 39 (2020) 1โ€“38.                                         with a humanoid robot: The role of physical
[16] C. Cao, Y. Weng, S. Zhou, Y. Tong, K. Zhou, Face-              embodiment,          Lecture Notes in Computer
     warehouse: A 3d facial expression database for vi-             Science (including subseries Lecture Notes in
     sual computing, IEEE Trans. Vis. Comput. Graph.                Artificial Intelligence and Lecture Notes in
     20 (2013) 413โ€“425.                                             Bioinformatics) 8511 LNCS (2014) 651โ€“662. URL:
[17] E. Sariyanidi, H. Gunes, A. Cavallaro, Automatic               https://link-springer-com.proxy.library.upenn.
     analysis of facial affect: A survey of registration,           edu/chapter/10.1007/978-3-319-07230-2_62. doi:10.
     representation, and recognition, IEEE Transac-                 1007/978-3-319-07230-2_62/COVER.
     tions on Pattern Analysis and Machine Intelligence        [28] K. B. Martin, Z. Hammal, G. Ren, J. F. Cohn, J. Cas-
     37 (2015) 1113โ€“1133. doi:10.1109/TPAMI.2014.                   sell, M. Ogihara, J. C. Britton, A. Gutierrez, D. S.
     2366127.                                                       Messinger, Objective measurement of head move-
[18] E. Sariyanidi, C. J. Zampella, R. T. Schultz, B. Tunc,         ment differences in children with and without
     Inequality-constrained and robust 3d face model                autism spectrum disorder, Molecular Autism 9
     fitting, in: Eur. Conf. Comput. Vis., 2020, pp. 433โ€“           (2018) 14. doi:10.1186/s13229-018-0198-4.
     449.                                                      [29] J. Hale, J. A. Ward, F. Buccheri, D. Oliver, A. F. C.
[19] E. Sariyanidi, C. J. Zampella, R. T. Schultz, B. Tunc,         Hamilton, Are You on My Wavelength? Interper-
     Can facial pose and expression be separated with               sonal Coordination in Dyadic Conversations, Jour-
     weak perspective camera?, in: IEEE Conf. Comput.               nal of Nonverbal Behavior 44 (2020) 63โ€“83. doi:10.
     Vis. Pattern Recog., 2020, pp. 7173โ€“7182.                      1007/S10919-019-00320-3/FIGURES/6.
[20] T. Baltrusaitis, A. Zadeh, Y. C. Lim, L.-P. Morency,
     Openface 2.0: Facial behavior analysis toolkit, in:
     2018 13th IEEE international conference on auto-
     matic face & gesture recognition (FG 2018), IEEE,