Sensor-based Data Fusion for Multimodal Affect
           Detection in Game-based Learning Environments

           Nathan L. Henderson, Jonathan P. Rowe, Bradford W. Mott, and James C. Lester
                                                           North Carolina State University
                                                         Raleigh, North Carolina, 27695, USA
                                           {nlhender, jprowe, bwmott, lester}@ncsu.edu


ABSTRACT                                                                                   To more closely model the human cognitive perception and
Affect detection is central to educational data mining because of its                      recognition of certain states, affective modeling techniques have
potential contribution to predicting learning processes and                                expanded to include multiple parallel data streams that are
outcomes. Using multiple modalities has been shown to increase                             processed simultaneously to form a single affect prediction or
the performance of affect detection. With the rise of sensor-based                         approximation; such systems are referred to as “multimodal” [2].
modalities due to their relatively low cost and high level of                              Each data stream, or “modality,” can be provided by a wide array
flexibility, there has been a marked increase in research efforts                          of sources ranging from user interaction logs to eye gaze tracking.
pertaining to sensor-based, multimodal systems for affective                               The processing of multiple independent modalities has been shown
computing problems. In this paper, we demonstrate the impact that                          to boost affect classifier performance [6] and provide additional
multimodal systems can have when using Microsoft Kinect-based                              insight into the various aspects of a student’s interaction with an
posture data and electrodermal activity data for the analysis of                           intelligent tutoring system [11]. Multimodal computing can be
affective states displayed by students engaged with a game-based                           highly beneficial to affective computing and educational data
learning environment. We compare the effectiveness of both                                 mining tasks by providing multiple complementary perspectives on
support vector machines and deep neural networks as affect                                 a single subject or event [3].
classifiers. Additionally, we evaluate different types of data fusion                      A common implementation of multimodal affect detection systems
to determine which method for combining the separate modalities                            utilizes sensors as perceptors to capture physical data and activity.
yields the highest classification rate. Results indicate that                              This enables the system to process different types of physiological
multimodal approaches outperform unimodal baseline classifiers,                            and positional information that signify different affective states of
and feature-level concatenation offers the highest performance                             students. Sensors are commonly deployed within multimodal
among the data fusion techniques.                                                          systems due to their relatively low expense, flexibility with regards
                                                                                           to hardware and software requirements, and generalization across a
Keywords                                                                                   variety of domains. Consequently, sensor-based multimodal
                                                                                           systems have been the focus of several research efforts in recent
Affect detection, data fusion, deep learning, posture, electrodermal                       years. Examples of sensor-based modalities include facial
activity, sensor-based learning                                                            expression [1], posture [9], electroencephalogram (EEG) data [24],
                                                                                           and electrodermal activity (EDA) [15].
1. INTRODUCTION                                                                            Sensor-based systems are not without inherent challenges [7]. Such
Affect detection plays a role of growing importance in educational                         systems can be plagued by issues such as calibration problems,
data mining. Accurately detecting affect is vital to understanding                         mistracking, noise, irregular behavior, inconsistent data transfer,
learning. While states such as confusion or engagement have been                           and synchronization issues. Cultural and social behaviors of
previously correlated with positive learning outcomes [20], other                          participants engaged in a sensor-based system can also impact
emotions such as boredom have been associated with negative                                performance, as well. In certain instances, a sensor may
learning outcomes [5]. Similarly, it has been found that affect                            malfunction for an extended period of time, resulting in large
detection can potentially be used to avoid negative learning                               intervals of missing or invalid data for one or more modalities.
outcomes [10].
                                                                                           In this paper, we investigate sensor-based multimodal models for
                                                                                           affect detection using data from students engaged with a game-
                                                                                           based learning environment for emergency medicine. We utilize
                                                                                           student posture information captured by a Microsoft Kinect, as well
                                                                                           as EDA data captured by an Affectiva Q-Sensor. We compare the
                                                                                           performance of support vector machine (SVM) and deep
                                                                                           feedforward neural network models as affect classifiers using
                                                                                           unimodal data, as well as multimodal data combining the posture
                                                                                           and EDA data channels. Finally, we evaluate three different
                                                                                           variations of data fusion for the multimodal affect classifiers.
                                                                                           Results suggest improved performance of multimodal classifiers as
                                                                                           compared to unimodal classifiers trained on separate Kinect and Q-
                                                                                           Sensor modalities, and they reveal the impact that different data


                    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
fusion techniques have on a classifier’s accuracy with multimodal        scenarios alongside a variety of non-player characters (NPCs).
datasets.                                                                During a training scenario, participants are faced with different
                                                                         tasks in real time such as securing the area, applying appropriate
2. RELATED WORK                                                          medical care to combat victims, and preparing for evacuation. The
Because of their domain independence, sensors have been                  Kinect-based posture data and EDA data collected by the Q-Sensor
integrated into a wide selection of multimodal affect detection          is captured during four different training scenarios: a leg injury
systems. Pei et al. [23] utilize long short-term memory (LSTM)           scenario, an introductory training scenario, a story-driven narrative
recurrent neural networks for a binary affect classification task with   scenario, and a patient expiration scenario that portrays a combat
audio and visual recordings. Nazari et al. [18] implement a              victim expiring regardless of the actions of the player. A screenshot
multimodal system to detect instances of narcissism in individuals       of a player’s first-person perspective when engaged with TC3Sim
using modalities such as facial expressions, dialogue, vocal             is shown in Figure 1.
acoustics, and behavioral cues. Facial tracking is paired with self-
assessment post-tests to detect student engagement with
MetaTutor, an adaptive learning system with a curricular focus on
the human circulatory system [12]. Additionally, Muller et al. [17]
implement a multimodal affect detection system based on human
pose, motion tracking and speech to classify instances of four
affective states (anger, happiness, sadness, and surprise) as well as
estimate continuous level valence and arousal. Other sensor-based
systems use modalities such as eye gaze to predict learning
outcomes using gradient tree boosting algorithms [25].
The use of posture data within affect detection systems has
experienced a significant increase in recent years. Low-cost sensors
such as the Microsoft Kinect have allowed this modality to be easily
integrated into multimodal systems. As shown in [22], Kinect-                        Figure 1. TC3Sim game-based learning
based posture data can be used by supervised and rule-based                                      environment.
algorithms to detect various affective states. Likewise, Grafsgaard
et al. [9] use Kinect data to estimate student engagement in             The dataset used in this work was collected from a study with 119
computer-based tutoring systems used to teach introductory               cadets from the United States Military Academy (83% male. 17%
programming concepts. Shifts in posture have been linked to              female) who participated in different training sessions with
affective states such as frustration, and thus have been associated      TC3Sim. All participants completed the same training materials,
with negative learning outcomes [9]. When used in conjunction            which were administered through the Generalized Intelligent
with other modalities such as facial expression and gesture              Framework for Tutoring (GIFT) framework. GIFT is a service-
tracking, posture can also be indicative of engagement, learning,        oriented software framework designed to aid in the development
and self-efficacy, as [10] demonstrates through the use of stepwise      and deployment of computer-based adaptive training systems [28].
linear regression techniques. Finally, Kinect data has also been         Each participant worked individually at a single workstation, and
utilized for tasks involving anger detection [21] and biometric          each session lasted approximately one hour. The posture activity
identification [24].                                                     for each participant was captured using a Microsoft Kinect for
In addition to posture and pose-related data, advances in                Windows 1.0 sensor. The head and torso positions and movements
multimodal systems have also extended to biosignal modalities.           were captured using skeleton-tracking features contained in the
Examples of such work include [24], where Kinect-based posture           GIFT framework. The data from the Kinect was sampled at a rate
data is combined with EEG data through sensor fusion to construct        of 10-12 Hz. This modality contained timestamped feature vectors
a reliable biometric identification model. Additional low-cost           containing coordinates of 91 vertices. For this effort, three vertices
sensors were used to capture EEG, EDA, and electromyography              were selected in accordance with prior research regarding affect
(EMG) data, where results indicate that a multimodal approach            detection with Kinect data [9]: top_skull, center_shoulder, and
outperformed unimodal detectors for arousal and valence levels [8].      head. 73 additional features were engineered from this modality
Using support vector machines, EEG data as well as eye gaze data         during the post-processing stage. These features were summary
was used to predict emotional response to videos [27]. The               statistics such as the mean, variance, and standard deviation of the
combination of EDA and EEG data has likewise been applied to the         different vertices over time windows of 5, 10, and 20 seconds prior
problem of stress detection [15] and frustration detection [7]. EDA      to each observation.
has been paired with Kinect-based posture data and webcam-based          In addition to the postural modality, electrodermal activity was
facial expression data to predict students’ instances of frustration     captured from each user using an Affectiva Q-Sensor bracelet worn
and engagement in response to tutor questions in an educational          by each participant. The Q-Sensor captured each user’s skin
environment [29].                                                        temperature, electrodermal activity, and the sensor’s acceleration
                                                                         vectors as determined by an onboard accelerometer. However, in
3. DATASET                                                               this study, only the EDA readings were used for affect detectors. In
We investigate different multimodal affect classifiers within the        a similar fashion to the posture modality, summary statistics were
context of a game-based environment for emergency medical                calculated for the EDA modality such as the min, max, and variance
training, the Tactical Combat Casualty Care Simulation (TC3Sim).         of the EDA values for each session, as well as the summary
Developed by Engineering and Computer Simulations (ECS),                 statistics across time windows of the prior 5, 10, and 20 seconds.
TC3Sim is widely used by the U.S. Army to provide realistic              The net changes in the EDA levels across the previous 3 and 20
combat medic simulations for soldiers. Students assume the first-        seconds were also calculated. However, the Q-Sensors experienced
person perspective of a combat medic involved in different               highly inconsistent behavior with regard to the data capture, which
affected approximately half of the collected data. Additionally, the                   removed from the dataset. Therefore, our classifiers were trained
interaction trace log data from each session was captured by the                       on a dataset using 422 BROMP observations containing correlated
GIFT framework, but because this work focuses exclusively on                           posture and EDA data.
sensor-based modalities, this data was not utilized.
To obtain ground truth labels of each student’s affective states, two
                                                                                       4.1 Data Preprocessing
                                                                                       After the aforementioned BROMP observations were removed
trained observers marked instances of different displays of affect in
                                                                                       from the dataset, five separate datasets were created through
accordance with the BROMP protocol [19]. BROMP is a
                                                                                       oversampling of each affective state. The oversampling was
quantitative observation protocol for run-time coding of student
                                                                                       accomplished using a minority class cloning technique.
affect and behavior during classroom-based interactions [19].
                                                                                       Additionally, feature data was scaled using z-score standardization.
During this process, the two observers walked around the perimeter
                                                                                       This method ensures that each attribute of the feature vectors have
of the classroom and discreetly marked instances of affect in 20-
                                                                                       the same mean and standard deviation but allows for different
second intervals using a handheld device. Affective states recorded
                                                                                       ranges.
include bored, confused, engaged, frustrated, and surprised.
A total of 3,066 separate BROMP observations were collected.                           4.2 Feature Selection
Only observations that were collected during students’ actual                          Prior to training the classifiers, each dataset underwent forward
engagement with TC3Sim were kept, and observations where there                         selection for the purpose of feature selection. This reduces the
was disagreement between the two observers were discarded.                             number of attributes in each dataset through a greedy algorithm that
Agreeing BROMP observations were treated as a single label, and                        trains a model and selects the best [0, k] features based on each
only BROMP observations recorded during the TC3Sim exercise                            model’s Cohen’s Kappa [4]. For our work, a k value of 10 was
were preserved, excluding instances during pre and post-test                           chosen. The model used in feature selection was the sequential
surveys, as well as instances occurring during the instructional                       minimal optimization (SMO) support vector machine [7]. This
PowerPoint presentation. Additional factors contributing to the                        polynomial-kernel model was selected due to its linear memory
significant reduction in BROMP observations were the subtlety of                       requirements and scalability, as a high number of models were
instances of affect in the cadets compared to classroom participants,                  trained to obtain the best features. An attribute was not considered
as well as cases of multiple different affective states being observed                 unless it showed positive improvement over the currently-selected
within the same 20-second window. The resulting dataset contained                      dataset, and the attribute showing the highest improvement was
755 distinct BROMP observations; the distribution of affect                            kept as a selected feature. The feature selection was implemented
instances is shown in Figure 2. Instances of engagement were by                        using RapidMiner 9.0 [16]. This platform was selected due to its
far the most common occurrence, while instances of frustration and                     convenience as a toolkit for implementing the data processing
surprise were sparse. As stated previously, the Q-Sensor                               pipeline, as well as its use in prior work in affect detection [7].
experienced frequent stops in data logging. This issue resulted in
333 BROMP observations containing missing EDA information,                             4.3 Classifiers
while a subset of 422 data samples contained both the posture and                      Prior work has demonstrated the effectiveness of deep neural
the EDA modalities. The posture-based modality did not appear to                       networks in affect classification tasks [14]. We utilize the same
suffer any data loss from the Kinect sensor.                                           neural network approach and compare it with SVM models. The
                                                                                       SVMs contain a radial kernel function with a convergence epsilon
                                                                                       of 0.001 for a maximum of 100,000 iterations. The artificial neural
                          700                                                          network (ANN) architecture contained feed-forward layers of 800,
                                                                                       800, 500, 100, and 50 nodes, respectively, in addition to a binary
                          600                                                          classification layer. Each layer’s activation function was a
    Number of Instances


                          500                                                          Rectified Linear Unit (ReLU). Each network was trained for 10
                                                                                       epochs with the ADADELTA adaptive learning rate [30]. A
                          400                                                          separate classifier was trained for each affective state, using the
                                                                                       selected features of the oversampled data as described in section
                          300                                                          4.1.
                          200                              435
                                                                                       4.4 Data Fusion
                          100                                                          To evaluate different methods of integrating the two modalities for
                                                174                      32       29
                                     73                                                affect classification, we implement several variations of data fusion
                           0                                                           techniques. We test two types of data fusion: feature-level fusion
                                                                                       (“Early Fusion”) and decision-level fusion (“Late Fusion”). Early
                                      d


                                                ed


                                                                                  ed
                                                         ed


                                                                      ed
                                   re


                                                                                       Fusion involves the concatenation of features from the posture and
                                              us


                                                           g


                                                                               ris
                                                                   at
                                Bo


                                                        ga


                                                                    tr
                                            nf


                                                                             rp
                                                                 us
                                                     En


                                                                                       EDA modalities prior to training the affect classifier. Late Fusion
                                          Co


                                                                           Su
                                                               Fr


                                                                                       calls for the training of separate classifiers for each modality, and
   Figure 2. Distribution of affect instances from BROMP                               the predicted confidence levels of each binary class (positive or
                        observations.                                                  negative label of affective state) are processed by a voting
                                                                                       schematic to produce a singular prediction of the affective state.
4. METHODOLOGY                                                                         The voting schematic can be implemented in different ways, such
The primary goal of this paper is to demonstrate the effectiveness                     as majority voting, averaging, or weighting [2]. For this paper, we
of a multimodal classification system for affect detection using two                   take the highest confidence value across the two classifiers and use
                                                                                       the associated class as our final representative prediction. Two
modalities: Kinect-based posture data and electrodermal activity
                                                                                       different variations of Early Fusion are also evaluated. The first
data. To ensure that both modalities are present in each data sample,
any BROMP observation with missing or invalid EDA data was                             variation, referred to in this paper as “Early Fusion 1”, concatenates
the features prior to the feature selection process. The other
                                                                            TABLE 1: Classifier Performance for Affective States (Posture)
variation, referred to as “Early Fusion 2”, performs separate feature
selection on the separate modalities, and only the selected features                                   Bored
are concatenated prior to training the classifiers. A visual                Classifier    Kappa         Accuracy      F1 Score
representation of the various data fusion pipelines is shown in             SVM           0.004         0.607         0.013
Figure 3.                                                                   ANN           -0.001        0.408         0.530
                      (A) Early Fusion 1                                                              Confused
        Concatenated Feature Vector                                         Classifier    Kappa         Accuracy      F1 Score
                                                                            SVM           0.002         0.566         0.040
                 Posture    EDA             PCA
                                                                            ANN           -0.003        0.566         0.040
                                                       Prediction                                     Engaged
                                      Classifier
                                                                            Classifier    Kappa         Accuracy      F1 Score
                                                                            SVM           0.065         0.484         0.523
                            (B) Early Fusion 2
                                                                            ANN           0.020         0.484         0.523
                Posture         PCA
                                                                                                     Frustrated
          Concatenated Feature Vector                                       Classifier    Kappa         Accuracy      F1 Score
                                                                            SVM           0.092         0.553         0.441
                  Posture    EDA          Classifier     Prediction
                                                                            ANN           0.063         0.501         0.650
                                                                                                     Surprised
                   EDA         PCA
                                                                            Classifier    Kappa         Accuracy      F1 Score
                                                                            SVM           -0.236        0.632         0.040
                            (C) Late Fusion                                 ANN           0.02          0.270         0.431
      Posture                         Classifier
                     PCA
                                                                             TABLE 2: Classifier Performance for Affective States (EDA)
                                        Voting           Prediction
                                      Schematic                                                        Bored
                                                                             Classifier    Kappa         Accuracy       F1 Score
        EDA           PCA              Classifier                            SVM           -0.042        0.500          0.286
                                                                             ANN           -0.047        0.360          0.478
                                                                                                      Confused
     Figure 3. Data pipeline for data fusion variations.
                                                                             Classifier    Kappa         Accuracy       F1 Score
                                                                             SVM           0.033         0.533          0.319
5. RESULTS AND DISCUSSION                                                    ANN           -0.083        0.387          0.529
The classifiers were evaluated using 10-fold cross validation, with
                                                                                                      Engaged
the data split on a per-session basis to ensure that all data from
individual training sessions were kept in the same fold. The same            Classifier    Kappa         Accuracy       F1 Score
batches of data were maintained across all modeling approaches to            SVM           -0.108        0.449          0.437
ensure fair comparisons across classifiers. The unimodal baseline            ANN           -0.013        0.541          0.682
classifiers and Early Fusion pipelines were implemented using
                                                                                                     Frustrated
RapidMiner 9.0. RapidMiner does not support decision-level
fusion, so the Late Fusion pipeline was implemented using Python             Classifier    Kappa         Accuracy       F1 Score
3.6, while the classifiers were still implemented in RapidMiner.             SVM           -0.046        0.491          0.539
Unimodal classifiers were trained on the posture and EDA                     ANN           0.011         0.387          0.641
modalities independently to provide a baseline for the multimodal                                    Surprised
classifiers’ performance. The results for the posture and EDA-               Classifier    Kappa         Accuracy       F1 Score
based unimodal classifiers for each affective state are shown in
                                                                             SVM           0.086         0.607          0.357
Tables 1 and 2 respectively. Evaluation metrics include Cohen’s
Kappa, raw accuracy, and F1 Score. Particular focus is given to              ANN           -0.001        0.222          0.478
Cohen’s Kappa due to its ability to account for the possibility of
correct classification due to random chance.                            The posture classifiers performed relatively poorly on boredom,
                                                                        confused, and surprised. It is worth noting that surprised contains
The posture-based SVM returned the highest Kappa for four of the
                                                                        the lowest number of positive instances within the dataset, which
five affective states, and the EDA-based SVM outperformed the
                                                                        may contribute to the poor performance. Additionally, it is possible
ANN for three of the five affective states. The ANN model
                                                                        that postural behavior may not distinguishably change between
performed poorly on a majority of the evaluations, returning a
                                                                        positive instances of boredom and confused, lead to common
negative Kappa on two of the posture-based states and four of the
                                                                        misclassifications across the two states. The EDA classifiers also
five EDA-based states, indicating that the ANN is no better than a
                                                                        performed poorly on the affective states of bored, engaged, and
random classifier for a majority of states.
                                                                        frustrated. However, the EDA modality contains significantly
                                                                        fewer features than the posture modality, and this may have caused
additional misclassifications. It is also possible that the EDA                TABLE 3: Performances for Early Fusion 1, Early Fusion 2, and
modality may not contain enough variance for the classifiers to                        Late Fusion for Affective States using SVM
distinguish between positive and negative instances of affective                                          Bored
states. Additionally, the EDA classifiers face the task of                    Classifier                 Kappa       Accuracy       F1 Score
distinguishing between different changes in the EDA
                                                                              Early Fusion 1             -0.082      0.466          0.164
measurements, and determining whether such changes can be
attributed to a particular affective state or another cause. However,         Early Fusion 2             0.041       0.5318         0.356
this proves to be more difficult than the posture modality due to the         Late Fusion                -0.056      0.583          0.145
singular dimensionality of the EDA channel. To further illustrate                                       Confused
this issue, a graphical representation of the change in EDA
throughout a session is shown in Figure 4.                                    Classifier                 Kappa       Accuracy       F1 Score
                                                                              Early Fusion 1             0.049       0.566          0.300

                         8                                                    Early Fusion 2             -0.004      0.515          0.321
                         7                                                    Late Fusion                0.032       0.597          0.148
    EDA (microsiemens)


                         6                                                                               Engaged
                         5                                                    Classifier                 Kappa       Accuracy       F1 Score
                         4                                                    Early Fusion 1             -0.064      0.446          0.393
                         3                                                    Early Fusion 2             0.068       0.542          0.491
                         2                                                    Late Fusion                -0.035      0.481          0.459
                         1                                                                              Frustrated
                         0                                                    Classifier                 Kappa       Accuracy       F1 Score
                             0   500       1000         1500   2000           Early Fusion 1             0.191       0.657          0.656
                                       Time (seconds)                         Early Fusion 2             0.246       0.594          0.483
                                                                              Late Fusion                0.119       0.5679         0.490
 Figure 4. EDA measured over duration of a single training                                              Surprised
                        session.                                              Classifier                 Kappa       Accuracy       F1 Score
                                                                              Early Fusion 1             -0.021      0.590          0.053
The SVM was selected as the classifier used to implement and
evaluate the data fusion methods discussed in Section 4.4. The                Early Fusion 2             0.013       0.682          0.080
same feature selection algorithm and classifier configuration were            Late Fusion                -0.192      0.514          0.124
used as in the unimodal approach, and the same session-level
groupings were also maintained. The three different data fusion
approaches were evaluated for each affective state, and the results        prediction of FALSE with a confidence level of 0.8, then the EDA
for each state are shown in Table 3.                                       modality overrides the incorrect prediction because of our selected
                                                                           voting schematic. However, Late Fusion was not the optimal fusion
Early Fusion 2 returned the highest Kappa for bored, engaged, and          method for any of the affective states, though its effectiveness as a
frustrated. Early Fusion 1 returned the highest value for confused,        multimodal fusion technique has been demonstrated in other
while the Q-Sensor baseline was the highest value for surprised.           affective computing tasks [14].
One possible reason that Early Fusion 2 is the highest-performing
data fusion method is because feature selection is performed               Of note is the performance of the multimodal classifier on the
separately on each modality prior to each classifier. This means that      frustration dataset compared to the other affective states, as the
if each feature selection algorithm selects up to the kth best features,   classifier achieved substantially higher Kappa scores. One possible
then the combined feature vector can contain up to 2*k features,           explanation for this behavior is that negative, high-arousal
twice as many features as allowed by Early Fusion 1. This increase         emotions such as frustration or anger have been shown to occur
in features may boost the performance of the classifier. Late Fusion       relatively infrequently in students engaged with computer-based
can also work with 2*k features, but the features are split between        learning environments [13]. This may possibly mean that the
the two unimodal classifiers before decision-level fusion. Early           recorded instances of frustration may contain more distinguishable
Fusion 2 also explores the correlations between various inter-modal        features compared to other common, low-arousal affective states
attributes more deeply compared to Early Fusion 1. The complex             such as boredom and engagement, encouraging higher performance
relationships between various intra-modal features are explicitly          from the frustration-based classifier. Additionally, frustration has
modeled in the feature selection performed on each independent             been demonstrated to illicit higher EDA levels [26], indicating that
modality, while the correlations between the selected inter-modal          the inclusion of the EDA modality with the posture modality
features are explored when training the primary classifier following       provides additional informative features to the feature vectors,
feature selection. However, these two stages are performed                 contributing to the relatively high performance of the classifier.
simultaneously in Early Fusion 1 and certain complex relationships         Although the multimodal classifiers generally outperformed
may not be detected as a result.                                           unimodal classifiers, the highest-performing model returned a
Late Fusion provides the ability to “correct” a possibly incorrect         relatively low Kappa compared to the performance of a human
prediction across the two modalities. For example, if the postural         BROMP labeler (~0.6). However, this threshold can vary
classifier produces an incorrect prediction of TRUE with a                 depending on the affective state and intervention associated with
confidence level of 0.6, but the EDA classifier produces an accurate       each state. For example, identifying instances of engagement can
                                                                           be viewed as a lower priority than identifying instances of
frustration or boredom, as these affective states often necessitate a    8. REFERENCES
dynamic intervention to improve learning outcomes. However, the          [1]    Arroyo, I., Cooper, D.G., Burleson, W., Woolf, B.P.,
Kappas for most of the classifiers fall below 0.05, indicating                  Muldner, K. and Christopherson, R. 2009. Emotion
significant difficulty for several classifiers in achieving consistent          sensors go to school. In Proceedings of the 14th
performance across multiple affective states.                                   International Conference on Artificial Intelligence In
Previous research efforts have demonstrated that the EDA modality               Education (2009), 17–24.
does not have a tightly-coupled relationship with different affective    [2]    Baltrušaitis, T., Ahuja, C. and Morency, L.-P. 2018.
states when compared to other higher-dimensionality modalities                  Multimodal machine learning: A survey and taxonomy.
such as facial expression and gesture [13]. The results of our work             IEEE Transactions on Pattern Analysis and Machine
also indicate that the EDA modality resulted in at least one                    Intelligence.     41,      2    (2018),     423–443.
classifier returning a negative Kappa for all five affective states.            DOI:https://doi.org/10.1109/TPAMI.2018.2798607.
Possible explanations for this behavior include an inadequate
amount of training data, lack of variance or distinguishable trends      [3]    Chang, C.M., Su, B.H., Lin, S.C., Li, J.L. and Lee, C.C.
across the observed time windows, or lack of useful features (17                2017. A bootstrapped multi-view weighted kernel fusion
EDA features vs. 75 posture features). However, our results                     framework for cross-corpus integration of multimodal
indicate that the EDA modality does generally improve classifier                emotion recognition. In 2017 Seventh International
performance when used in conjunction with the posture modality.                 Conference on Affective Computing and Intelligent
                                                                                Interaction (ACII) (2017), 377–382.
6. CONCLUSION                                                            [4]    Cohen, J. 1960. A coefficient of agreement for nominal
In this paper, we demonstrate the effectiveness of a multimodal                 scales. Educational and psychological measurement. 20,
affect detection system based on sensor data capturing a user’s                 1 (1960), 37–46.
posture and EDA data while engaged with a game-based learning
                                                                         [5]    Craig, S., Graesser, A., Sullins, J. and Gholson, B. 2005.
environment. We show the improvement that multimodal
classifiers achieve compared with unimodal classifiers for both                 Affect and learning: An exploratory look into the role of
modalities. We also demonstrate that SVMs outperform ANNs as                    affect in learning with AutoTutor. Journal of Educational
a unimodal classifier in this particular domain. Finally, we                    Media.          29,       3        (2005),       241–250.
demonstrate that data fusion is an effective way to combine                     DOI:https://doi.org/10.1080/1358165042000283101.
multiple modalities, either prior to or following classification.        [6]    D’Mello, S. and Kory, J. 2012. Consistent but modest: A
Results suggest several promising directions for future work. To                meta-analysis on unimodal and multimodal affect
improve model performance on smaller datasets or data containing                detection accuracies from 30 studies. Proceedings of the
instances of missing modalities, more sophisticated feature                     14th ACM international conference on Multimodal
engineering approaches can be evaluated. The evaluation of our                  interaction    -     ICMI     ’12.    (2012),    31–38.
data fusion techniques with additional modalities can further                   DOI:https://doi.org/10.1145/2388676.2388686.
indicate the effectiveness of this approach in a variety of              [7]    DeFalco, J.A., Rowe, J.P., Paquette, L., Georgoulas-
multimodal systems. Additional exploration of generalizable                     Sherry, V., Brawner, K., Mott, B.W., Baker, R.S. and
multimodal systems should be undertaken to further utilize the                  Lester, J.C. 2018. Detecting and addressing frustration in
flexibility of sensor-based systems. Further evaluation of                      a serious game for military training. International Journal
classification algorithms can be investigated as well, in particular,           of Artificial Intelligence in Education. 28, 2 (2018), 152–
algorithms designed for the processing of temporal data such as                 193. DOI:https://doi.org/10.1007/s40593-017-0152-1.
recurrent neural networks. The impact of additional biosignal            [8]    Girardi, D., Lanubile, F. and Novielli, N. 2017. Emotion
modalities such as EEG or EMG data would provide a more in-                     detection using noninvasive low cost sensors. In 2017
depth perspective of the effect such modalities have on multimodal              Seventh International Conference on Affective Computing
affect detection systems. Finally, the integration of multimodal                and Intelligent Interaction (2017), 125–130.
affect detection into a run-time learning environment would enable
adaptive pedagogical functionalities that address potentially            [9]    Grafsgaard, J., Boyer, K., Wiebe, E. and Lester, J. 2012.
negative learning outcomes through the use of dynamic                           Analyzing posture and affect in task-oriented tutoring. In
interventions and user-tailored feedback based on learners’                     International Conference of the Florida Artificial
affective states.                                                               Intelligence Research Society (2012), 438–443.
                                                                         [10]   Grafsgaard, J.F., Wiggins, J.B., Boyer, K.E., Wiebe, E.N.
7. ACKNOWLEDGMENTS                                                              and Lester, J.C. 2014. Predicting learning and affect from
We wish to thank Dr. Jeanine DeFalco, Dr. Benjamin Goldberg,                    multimodal data streams in task-oriented tutorial dialogue.
and Dr. Keith Brawner of the U.S. Army Combat Capabilities                      In Proceedings of the Seventh International Conference
Development Command, Dr. Mike Matthews and COL James Ness                       on Educational Data Mining (London, UK, 2014), 122–
of the U.S. Military Academy, Dr. Robert Sottilare of SoarTech,                 129.
and Dr. Ryan Baker of the University of Pennsylvania for their
assistance in facilitating this research. The research was supported     [11]   Grafsgaard, J.F., Wiggins, J.B., Vail, A.K., Boyer, K.E.,
by the U.S. Army Research Laboratory under cooperative                          Wiebe, E.N. and Lester, J.C. 2014. The additive value of
agreement #W911NF-13-2-0008. Any opinions, findings, and                        multimodal features for predicting engagement,
conclusions expressed in this paper are those of the authors and do             frustration, and learning during tutoring. In Proceedings
not necessarily reflect the views of the U.S. Army.                             of the Sixteenth ACM International Conference on
                                                                                Multimodal Interaction (2014), 42–49.
                                                                         [12]   Harley, J.M., Bouchet, F. and Azevedo, R. 2013. Aligning
                                                                                and comparing data on emotions experienced during
       learning with MetaTutor. In International Conference on     [22]   Patwardhan, A. and Knapp, G. 2016. Multimodal affect
       Artificial Intelligence in Education (2013), 61–70.                recognition    using     Kinect.   arXiv   preprint
[13]   Harley, J.M., Bouchet, F., Hussain, M.S., Azevedo, R. and          arXiv:1607.02652. (2016).
       Calvo, R. 2015. A multi-componential analysis of            [23]   Pei, E., Yang, L., Jiang, D. and Sahli, H. 2015. Multimodal
       emotions during complex learning with an intelligent               dimensional affect recognition using deep bidirectional
       multi-agent system. Computers in Human Behavior. 48,               long short-term memory recurrent neural networks. In
       May                    (2015),                  615–625.           Proceedings of the International Conference on Affective
       DOI:https://doi.org/10.1016/j.chb.2015.02.013.                     Computing and Intelligent Interaction (ACII) (2015),
[14]   Henderson, N.L., Rowe, J.P., Mott, B.W., Brawner, K.,              208–214.
       Baker, R.S. and Lester, J.C. 2019. 4D Affect Detection :    [24]   Rahman, W. and Gavrilova, M.L. 2017. Emerging EEG
       Improving Frustration Detection in Game-Based Learning             and Kinect face fusion for biometric identification. In
       with Posture-Based Temporal Data Fusion. In                        Proceedings of the IEEE Symposium Series on
       Proceedings of The 20th International Conference on                Computational Intelligence (SSCI) (2017), 1–8.
       Artificial Intelligence in Education (in press) (2019).     [25]   Rajendran, R., Carter, K.E. and Levin, D.T. 2018.
[15]   Kalimeri, K. and Saitis, C. 2016. Exploring multimodal             Predicting Learning by Analyzing Eye-Gaze Data of
       biosignal features for stress detection during indoor              Reading Behavior. International Educational Data
       mobility. In Proceedings of the 18th ACM International             Mining Society. (2018).
       Conference on Multimodal Interaction (2016), 53–60.         [26]   Ramachandran, B.R.N., Pinto, S.A.R., Born, J., Winkler,
[16]   Mierswa, I., Wurst, M., Klinkenberg, R. and Scholz, M.             S. and Ratnam, R. 2017. Measuring neural, physiological
       2006. Yale: Rapid prototyping for complex data mining              and behavioral effects of frustration. In Proceedings of the
       tasks. In Proceedings of the 12th ACM SIGKDD                       16th International Conference on Biomedical Engineering
       international conference on Knowledge discovery and                (2017), 43–46.
       data mining (2006), 935–940.                                [27]   Soleymani, M., Pantic, M. and Pun, T. 2012. Multimodal
[17]   Muller, P.M., Amin, S., Verma, P., Andriluka, M. and               emotion recognition in response to videos. IEEE
       Bulling, A. 2015. Emotion recognition from embedded                transactions on affective computing. 3, 2 (2012), 211–223.
       bodily expressions and speech during dyadic interactions.   [28]   Sottilare, R.A., Baker, R.S., Graesser, A.C. and Lester,
       2015 International Conference on Affective Computing               J.C. 2018. Special Issue on the Generalized Intelligent
       and Intelligent Interaction, ACII 2015. (2015), 663–669.           Framework for Tutoring (GIFT): Creating a stable and
       DOI:https://doi.org/10.1109/ACII.2015.7344640.                     flexible platform for innovations in AIED Research.
[18]   Nazari, Z., Lucas, G. and Gratch, J. 2015. Multimodal              International Journal of Artificial Intelligence in
       approach for automatic recognition of machiavellianism.            Education.        28,     2       (2018),      139–151.
       2015 International Conference on Affective Computing               DOI:https://doi.org/10.1007/s40593-017-0149-9.
       and Intelligent Interaction, ACII 2015. (2015), 215–221.    [29]   Vail, A.K., Wiggins, J.B., Grafsgaard, J.F., Boyer, K.E.,
       DOI:https://doi.org/10.1109/ACII.2015.7344574.                     Wiebe, E.N. and Lester, J.C. 2016. The Affective Impact
[19]   Ocumpaugh, J., Baker, R.S. and Rodrigo, M.T. 2015.                 of Tutor Questions: Predicting Frustration and
       Baker Rodrigo Ocumpaugh Monitoring Protocol                        Engagement Alexandria. International Educational Data
       (BROMP) 2.0 Technical and Training Manual.                         Mining          Society.       (2016),         247–254.
[20]   Pardos, Z., Baker, R., Pedro, M.S., Gowda, S.M. and                DOI:https://doi.org/10.1145/1235.
       Gowda, S.M. 2014. Affective states and state tests:         [30]   Zeiler, M.D. 2012. ADADELTA: An adaptive learning
       investigating how affect and engagement during the                 rate                   method.                     (2012).
       school year predict end-of-year learning outcomes.                 DOI:https://doi.org/http://doi.acm.org.ezproxy.lib.ucf.ed
       Journal of Learning Analytics. 1, 1 (2014), 107–128.               u/10.1145/1830483.1830503.
       DOI:https://doi.org/10.1145/2460296.2460320.
[21]   Patwardhan, A. and Knapp, G. 2017. Aggressive actions
       and anger detection from multiple modalities using
       Kinect. CoRR. (2017).