Sensor-based Data Fusion for Multimodal Affect Detection in Game-based Learning Environments Nathan L. Henderson, Jonathan P. Rowe, Bradford W. Mott, and James C. Lester North Carolina State University Raleigh, North Carolina, 27695, USA {nlhender, jprowe, bwmott, lester}@ncsu.edu ABSTRACT To more closely model the human cognitive perception and Affect detection is central to educational data mining because of its recognition of certain states, affective modeling techniques have potential contribution to predicting learning processes and expanded to include multiple parallel data streams that are outcomes. Using multiple modalities has been shown to increase processed simultaneously to form a single affect prediction or the performance of affect detection. With the rise of sensor-based approximation; such systems are referred to as “multimodal” [2]. modalities due to their relatively low cost and high level of Each data stream, or “modality,” can be provided by a wide array flexibility, there has been a marked increase in research efforts of sources ranging from user interaction logs to eye gaze tracking. pertaining to sensor-based, multimodal systems for affective The processing of multiple independent modalities has been shown computing problems. In this paper, we demonstrate the impact that to boost affect classifier performance [6] and provide additional multimodal systems can have when using Microsoft Kinect-based insight into the various aspects of a student’s interaction with an posture data and electrodermal activity data for the analysis of intelligent tutoring system [11]. Multimodal computing can be affective states displayed by students engaged with a game-based highly beneficial to affective computing and educational data learning environment. We compare the effectiveness of both mining tasks by providing multiple complementary perspectives on support vector machines and deep neural networks as affect a single subject or event [3]. classifiers. Additionally, we evaluate different types of data fusion A common implementation of multimodal affect detection systems to determine which method for combining the separate modalities utilizes sensors as perceptors to capture physical data and activity. yields the highest classification rate. Results indicate that This enables the system to process different types of physiological multimodal approaches outperform unimodal baseline classifiers, and positional information that signify different affective states of and feature-level concatenation offers the highest performance students. Sensors are commonly deployed within multimodal among the data fusion techniques. systems due to their relatively low expense, flexibility with regards to hardware and software requirements, and generalization across a Keywords variety of domains. Consequently, sensor-based multimodal systems have been the focus of several research efforts in recent Affect detection, data fusion, deep learning, posture, electrodermal years. Examples of sensor-based modalities include facial activity, sensor-based learning expression [1], posture [9], electroencephalogram (EEG) data [24], and electrodermal activity (EDA) [15]. 1. INTRODUCTION Sensor-based systems are not without inherent challenges [7]. Such Affect detection plays a role of growing importance in educational systems can be plagued by issues such as calibration problems, data mining. Accurately detecting affect is vital to understanding mistracking, noise, irregular behavior, inconsistent data transfer, learning. While states such as confusion or engagement have been and synchronization issues. Cultural and social behaviors of previously correlated with positive learning outcomes [20], other participants engaged in a sensor-based system can also impact emotions such as boredom have been associated with negative performance, as well. In certain instances, a sensor may learning outcomes [5]. Similarly, it has been found that affect malfunction for an extended period of time, resulting in large detection can potentially be used to avoid negative learning intervals of missing or invalid data for one or more modalities. outcomes [10]. In this paper, we investigate sensor-based multimodal models for affect detection using data from students engaged with a game- based learning environment for emergency medicine. We utilize student posture information captured by a Microsoft Kinect, as well as EDA data captured by an Affectiva Q-Sensor. We compare the performance of support vector machine (SVM) and deep feedforward neural network models as affect classifiers using unimodal data, as well as multimodal data combining the posture and EDA data channels. Finally, we evaluate three different variations of data fusion for the multimodal affect classifiers. Results suggest improved performance of multimodal classifiers as compared to unimodal classifiers trained on separate Kinect and Q- Sensor modalities, and they reveal the impact that different data Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). fusion techniques have on a classifier’s accuracy with multimodal scenarios alongside a variety of non-player characters (NPCs). datasets. During a training scenario, participants are faced with different tasks in real time such as securing the area, applying appropriate 2. RELATED WORK medical care to combat victims, and preparing for evacuation. The Because of their domain independence, sensors have been Kinect-based posture data and EDA data collected by the Q-Sensor integrated into a wide selection of multimodal affect detection is captured during four different training scenarios: a leg injury systems. Pei et al. [23] utilize long short-term memory (LSTM) scenario, an introductory training scenario, a story-driven narrative recurrent neural networks for a binary affect classification task with scenario, and a patient expiration scenario that portrays a combat audio and visual recordings. Nazari et al. [18] implement a victim expiring regardless of the actions of the player. A screenshot multimodal system to detect instances of narcissism in individuals of a player’s first-person perspective when engaged with TC3Sim using modalities such as facial expressions, dialogue, vocal is shown in Figure 1. acoustics, and behavioral cues. Facial tracking is paired with self- assessment post-tests to detect student engagement with MetaTutor, an adaptive learning system with a curricular focus on the human circulatory system [12]. Additionally, Muller et al. [17] implement a multimodal affect detection system based on human pose, motion tracking and speech to classify instances of four affective states (anger, happiness, sadness, and surprise) as well as estimate continuous level valence and arousal. Other sensor-based systems use modalities such as eye gaze to predict learning outcomes using gradient tree boosting algorithms [25]. The use of posture data within affect detection systems has experienced a significant increase in recent years. Low-cost sensors such as the Microsoft Kinect have allowed this modality to be easily integrated into multimodal systems. As shown in [22], Kinect- Figure 1. TC3Sim game-based learning based posture data can be used by supervised and rule-based environment. algorithms to detect various affective states. Likewise, Grafsgaard et al. [9] use Kinect data to estimate student engagement in The dataset used in this work was collected from a study with 119 computer-based tutoring systems used to teach introductory cadets from the United States Military Academy (83% male. 17% programming concepts. Shifts in posture have been linked to female) who participated in different training sessions with affective states such as frustration, and thus have been associated TC3Sim. All participants completed the same training materials, with negative learning outcomes [9]. When used in conjunction which were administered through the Generalized Intelligent with other modalities such as facial expression and gesture Framework for Tutoring (GIFT) framework. GIFT is a service- tracking, posture can also be indicative of engagement, learning, oriented software framework designed to aid in the development and self-efficacy, as [10] demonstrates through the use of stepwise and deployment of computer-based adaptive training systems [28]. linear regression techniques. Finally, Kinect data has also been Each participant worked individually at a single workstation, and utilized for tasks involving anger detection [21] and biometric each session lasted approximately one hour. The posture activity identification [24]. for each participant was captured using a Microsoft Kinect for In addition to posture and pose-related data, advances in Windows 1.0 sensor. The head and torso positions and movements multimodal systems have also extended to biosignal modalities. were captured using skeleton-tracking features contained in the Examples of such work include [24], where Kinect-based posture GIFT framework. The data from the Kinect was sampled at a rate data is combined with EEG data through sensor fusion to construct of 10-12 Hz. This modality contained timestamped feature vectors a reliable biometric identification model. Additional low-cost containing coordinates of 91 vertices. For this effort, three vertices sensors were used to capture EEG, EDA, and electromyography were selected in accordance with prior research regarding affect (EMG) data, where results indicate that a multimodal approach detection with Kinect data [9]: top_skull, center_shoulder, and outperformed unimodal detectors for arousal and valence levels [8]. head. 73 additional features were engineered from this modality Using support vector machines, EEG data as well as eye gaze data during the post-processing stage. These features were summary was used to predict emotional response to videos [27]. The statistics such as the mean, variance, and standard deviation of the combination of EDA and EEG data has likewise been applied to the different vertices over time windows of 5, 10, and 20 seconds prior problem of stress detection [15] and frustration detection [7]. EDA to each observation. has been paired with Kinect-based posture data and webcam-based In addition to the postural modality, electrodermal activity was facial expression data to predict students’ instances of frustration captured from each user using an Affectiva Q-Sensor bracelet worn and engagement in response to tutor questions in an educational by each participant. The Q-Sensor captured each user’s skin environment [29]. temperature, electrodermal activity, and the sensor’s acceleration vectors as determined by an onboard accelerometer. However, in 3. DATASET this study, only the EDA readings were used for affect detectors. In We investigate different multimodal affect classifiers within the a similar fashion to the posture modality, summary statistics were context of a game-based environment for emergency medical calculated for the EDA modality such as the min, max, and variance training, the Tactical Combat Casualty Care Simulation (TC3Sim). of the EDA values for each session, as well as the summary Developed by Engineering and Computer Simulations (ECS), statistics across time windows of the prior 5, 10, and 20 seconds. TC3Sim is widely used by the U.S. Army to provide realistic The net changes in the EDA levels across the previous 3 and 20 combat medic simulations for soldiers. Students assume the first- seconds were also calculated. However, the Q-Sensors experienced person perspective of a combat medic involved in different highly inconsistent behavior with regard to the data capture, which affected approximately half of the collected data. Additionally, the removed from the dataset. Therefore, our classifiers were trained interaction trace log data from each session was captured by the on a dataset using 422 BROMP observations containing correlated GIFT framework, but because this work focuses exclusively on posture and EDA data. sensor-based modalities, this data was not utilized. To obtain ground truth labels of each student’s affective states, two 4.1 Data Preprocessing After the aforementioned BROMP observations were removed trained observers marked instances of different displays of affect in from the dataset, five separate datasets were created through accordance with the BROMP protocol [19]. BROMP is a oversampling of each affective state. The oversampling was quantitative observation protocol for run-time coding of student accomplished using a minority class cloning technique. affect and behavior during classroom-based interactions [19]. Additionally, feature data was scaled using z-score standardization. During this process, the two observers walked around the perimeter This method ensures that each attribute of the feature vectors have of the classroom and discreetly marked instances of affect in 20- the same mean and standard deviation but allows for different second intervals using a handheld device. Affective states recorded ranges. include bored, confused, engaged, frustrated, and surprised. A total of 3,066 separate BROMP observations were collected. 4.2 Feature Selection Only observations that were collected during students’ actual Prior to training the classifiers, each dataset underwent forward engagement with TC3Sim were kept, and observations where there selection for the purpose of feature selection. This reduces the was disagreement between the two observers were discarded. number of attributes in each dataset through a greedy algorithm that Agreeing BROMP observations were treated as a single label, and trains a model and selects the best [0, k] features based on each only BROMP observations recorded during the TC3Sim exercise model’s Cohen’s Kappa [4]. For our work, a k value of 10 was were preserved, excluding instances during pre and post-test chosen. The model used in feature selection was the sequential surveys, as well as instances occurring during the instructional minimal optimization (SMO) support vector machine [7]. This PowerPoint presentation. Additional factors contributing to the polynomial-kernel model was selected due to its linear memory significant reduction in BROMP observations were the subtlety of requirements and scalability, as a high number of models were instances of affect in the cadets compared to classroom participants, trained to obtain the best features. An attribute was not considered as well as cases of multiple different affective states being observed unless it showed positive improvement over the currently-selected within the same 20-second window. The resulting dataset contained dataset, and the attribute showing the highest improvement was 755 distinct BROMP observations; the distribution of affect kept as a selected feature. The feature selection was implemented instances is shown in Figure 2. Instances of engagement were by using RapidMiner 9.0 [16]. This platform was selected due to its far the most common occurrence, while instances of frustration and convenience as a toolkit for implementing the data processing surprise were sparse. As stated previously, the Q-Sensor pipeline, as well as its use in prior work in affect detection [7]. experienced frequent stops in data logging. This issue resulted in 333 BROMP observations containing missing EDA information, 4.3 Classifiers while a subset of 422 data samples contained both the posture and Prior work has demonstrated the effectiveness of deep neural the EDA modalities. The posture-based modality did not appear to networks in affect classification tasks [14]. We utilize the same suffer any data loss from the Kinect sensor. neural network approach and compare it with SVM models. The SVMs contain a radial kernel function with a convergence epsilon of 0.001 for a maximum of 100,000 iterations. The artificial neural 700 network (ANN) architecture contained feed-forward layers of 800, 800, 500, 100, and 50 nodes, respectively, in addition to a binary 600 classification layer. Each layer’s activation function was a Number of Instances 500 Rectified Linear Unit (ReLU). Each network was trained for 10 epochs with the ADADELTA adaptive learning rate [30]. A 400 separate classifier was trained for each affective state, using the selected features of the oversampled data as described in section 300 4.1. 200 435 4.4 Data Fusion 100 To evaluate different methods of integrating the two modalities for 174 32 29 73 affect classification, we implement several variations of data fusion 0 techniques. We test two types of data fusion: feature-level fusion (“Early Fusion”) and decision-level fusion (“Late Fusion”). Early d ed ed ed ed re Fusion involves the concatenation of features from the posture and us g ris at Bo ga tr nf rp us En EDA modalities prior to training the affect classifier. Late Fusion Co Su Fr calls for the training of separate classifiers for each modality, and Figure 2. Distribution of affect instances from BROMP the predicted confidence levels of each binary class (positive or observations. negative label of affective state) are processed by a voting schematic to produce a singular prediction of the affective state. 4. METHODOLOGY The voting schematic can be implemented in different ways, such The primary goal of this paper is to demonstrate the effectiveness as majority voting, averaging, or weighting [2]. For this paper, we of a multimodal classification system for affect detection using two take the highest confidence value across the two classifiers and use the associated class as our final representative prediction. Two modalities: Kinect-based posture data and electrodermal activity different variations of Early Fusion are also evaluated. The first data. To ensure that both modalities are present in each data sample, any BROMP observation with missing or invalid EDA data was variation, referred to in this paper as “Early Fusion 1”, concatenates the features prior to the feature selection process. The other TABLE 1: Classifier Performance for Affective States (Posture) variation, referred to as “Early Fusion 2”, performs separate feature selection on the separate modalities, and only the selected features Bored are concatenated prior to training the classifiers. A visual Classifier Kappa Accuracy F1 Score representation of the various data fusion pipelines is shown in SVM 0.004 0.607 0.013 Figure 3. ANN -0.001 0.408 0.530 (A) Early Fusion 1 Confused Concatenated Feature Vector Classifier Kappa Accuracy F1 Score SVM 0.002 0.566 0.040 Posture EDA PCA ANN -0.003 0.566 0.040 Prediction Engaged Classifier Classifier Kappa Accuracy F1 Score SVM 0.065 0.484 0.523 (B) Early Fusion 2 ANN 0.020 0.484 0.523 Posture PCA Frustrated Concatenated Feature Vector Classifier Kappa Accuracy F1 Score SVM 0.092 0.553 0.441 Posture EDA Classifier Prediction ANN 0.063 0.501 0.650 Surprised EDA PCA Classifier Kappa Accuracy F1 Score SVM -0.236 0.632 0.040 (C) Late Fusion ANN 0.02 0.270 0.431 Posture Classifier PCA TABLE 2: Classifier Performance for Affective States (EDA) Voting Prediction Schematic Bored Classifier Kappa Accuracy F1 Score EDA PCA Classifier SVM -0.042 0.500 0.286 ANN -0.047 0.360 0.478 Confused Figure 3. Data pipeline for data fusion variations. Classifier Kappa Accuracy F1 Score SVM 0.033 0.533 0.319 5. RESULTS AND DISCUSSION ANN -0.083 0.387 0.529 The classifiers were evaluated using 10-fold cross validation, with Engaged the data split on a per-session basis to ensure that all data from individual training sessions were kept in the same fold. The same Classifier Kappa Accuracy F1 Score batches of data were maintained across all modeling approaches to SVM -0.108 0.449 0.437 ensure fair comparisons across classifiers. The unimodal baseline ANN -0.013 0.541 0.682 classifiers and Early Fusion pipelines were implemented using Frustrated RapidMiner 9.0. RapidMiner does not support decision-level fusion, so the Late Fusion pipeline was implemented using Python Classifier Kappa Accuracy F1 Score 3.6, while the classifiers were still implemented in RapidMiner. SVM -0.046 0.491 0.539 Unimodal classifiers were trained on the posture and EDA ANN 0.011 0.387 0.641 modalities independently to provide a baseline for the multimodal Surprised classifiers’ performance. The results for the posture and EDA- Classifier Kappa Accuracy F1 Score based unimodal classifiers for each affective state are shown in SVM 0.086 0.607 0.357 Tables 1 and 2 respectively. Evaluation metrics include Cohen’s Kappa, raw accuracy, and F1 Score. Particular focus is given to ANN -0.001 0.222 0.478 Cohen’s Kappa due to its ability to account for the possibility of correct classification due to random chance. The posture classifiers performed relatively poorly on boredom, confused, and surprised. It is worth noting that surprised contains The posture-based SVM returned the highest Kappa for four of the the lowest number of positive instances within the dataset, which five affective states, and the EDA-based SVM outperformed the may contribute to the poor performance. Additionally, it is possible ANN for three of the five affective states. The ANN model that postural behavior may not distinguishably change between performed poorly on a majority of the evaluations, returning a positive instances of boredom and confused, lead to common negative Kappa on two of the posture-based states and four of the misclassifications across the two states. The EDA classifiers also five EDA-based states, indicating that the ANN is no better than a performed poorly on the affective states of bored, engaged, and random classifier for a majority of states. frustrated. However, the EDA modality contains significantly fewer features than the posture modality, and this may have caused additional misclassifications. It is also possible that the EDA TABLE 3: Performances for Early Fusion 1, Early Fusion 2, and modality may not contain enough variance for the classifiers to Late Fusion for Affective States using SVM distinguish between positive and negative instances of affective Bored states. Additionally, the EDA classifiers face the task of Classifier Kappa Accuracy F1 Score distinguishing between different changes in the EDA Early Fusion 1 -0.082 0.466 0.164 measurements, and determining whether such changes can be attributed to a particular affective state or another cause. However, Early Fusion 2 0.041 0.5318 0.356 this proves to be more difficult than the posture modality due to the Late Fusion -0.056 0.583 0.145 singular dimensionality of the EDA channel. To further illustrate Confused this issue, a graphical representation of the change in EDA throughout a session is shown in Figure 4. Classifier Kappa Accuracy F1 Score Early Fusion 1 0.049 0.566 0.300 8 Early Fusion 2 -0.004 0.515 0.321 7 Late Fusion 0.032 0.597 0.148 EDA (microsiemens) 6 Engaged 5 Classifier Kappa Accuracy F1 Score 4 Early Fusion 1 -0.064 0.446 0.393 3 Early Fusion 2 0.068 0.542 0.491 2 Late Fusion -0.035 0.481 0.459 1 Frustrated 0 Classifier Kappa Accuracy F1 Score 0 500 1000 1500 2000 Early Fusion 1 0.191 0.657 0.656 Time (seconds) Early Fusion 2 0.246 0.594 0.483 Late Fusion 0.119 0.5679 0.490 Figure 4. EDA measured over duration of a single training Surprised session. Classifier Kappa Accuracy F1 Score Early Fusion 1 -0.021 0.590 0.053 The SVM was selected as the classifier used to implement and evaluate the data fusion methods discussed in Section 4.4. The Early Fusion 2 0.013 0.682 0.080 same feature selection algorithm and classifier configuration were Late Fusion -0.192 0.514 0.124 used as in the unimodal approach, and the same session-level groupings were also maintained. The three different data fusion approaches were evaluated for each affective state, and the results prediction of FALSE with a confidence level of 0.8, then the EDA for each state are shown in Table 3. modality overrides the incorrect prediction because of our selected voting schematic. However, Late Fusion was not the optimal fusion Early Fusion 2 returned the highest Kappa for bored, engaged, and method for any of the affective states, though its effectiveness as a frustrated. Early Fusion 1 returned the highest value for confused, multimodal fusion technique has been demonstrated in other while the Q-Sensor baseline was the highest value for surprised. affective computing tasks [14]. One possible reason that Early Fusion 2 is the highest-performing data fusion method is because feature selection is performed Of note is the performance of the multimodal classifier on the separately on each modality prior to each classifier. This means that frustration dataset compared to the other affective states, as the if each feature selection algorithm selects up to the kth best features, classifier achieved substantially higher Kappa scores. One possible then the combined feature vector can contain up to 2*k features, explanation for this behavior is that negative, high-arousal twice as many features as allowed by Early Fusion 1. This increase emotions such as frustration or anger have been shown to occur in features may boost the performance of the classifier. Late Fusion relatively infrequently in students engaged with computer-based can also work with 2*k features, but the features are split between learning environments [13]. This may possibly mean that the the two unimodal classifiers before decision-level fusion. Early recorded instances of frustration may contain more distinguishable Fusion 2 also explores the correlations between various inter-modal features compared to other common, low-arousal affective states attributes more deeply compared to Early Fusion 1. The complex such as boredom and engagement, encouraging higher performance relationships between various intra-modal features are explicitly from the frustration-based classifier. Additionally, frustration has modeled in the feature selection performed on each independent been demonstrated to illicit higher EDA levels [26], indicating that modality, while the correlations between the selected inter-modal the inclusion of the EDA modality with the posture modality features are explored when training the primary classifier following provides additional informative features to the feature vectors, feature selection. However, these two stages are performed contributing to the relatively high performance of the classifier. simultaneously in Early Fusion 1 and certain complex relationships Although the multimodal classifiers generally outperformed may not be detected as a result. unimodal classifiers, the highest-performing model returned a Late Fusion provides the ability to “correct” a possibly incorrect relatively low Kappa compared to the performance of a human prediction across the two modalities. For example, if the postural BROMP labeler (~0.6). However, this threshold can vary classifier produces an incorrect prediction of TRUE with a depending on the affective state and intervention associated with confidence level of 0.6, but the EDA classifier produces an accurate each state. For example, identifying instances of engagement can be viewed as a lower priority than identifying instances of frustration or boredom, as these affective states often necessitate a 8. REFERENCES dynamic intervention to improve learning outcomes. However, the [1] Arroyo, I., Cooper, D.G., Burleson, W., Woolf, B.P., Kappas for most of the classifiers fall below 0.05, indicating Muldner, K. and Christopherson, R. 2009. Emotion significant difficulty for several classifiers in achieving consistent sensors go to school. In Proceedings of the 14th performance across multiple affective states. International Conference on Artificial Intelligence In Previous research efforts have demonstrated that the EDA modality Education (2009), 17–24. does not have a tightly-coupled relationship with different affective [2] Baltrušaitis, T., Ahuja, C. and Morency, L.-P. 2018. states when compared to other higher-dimensionality modalities Multimodal machine learning: A survey and taxonomy. such as facial expression and gesture [13]. The results of our work IEEE Transactions on Pattern Analysis and Machine also indicate that the EDA modality resulted in at least one Intelligence. 41, 2 (2018), 423–443. classifier returning a negative Kappa for all five affective states. DOI:https://doi.org/10.1109/TPAMI.2018.2798607. Possible explanations for this behavior include an inadequate amount of training data, lack of variance or distinguishable trends [3] Chang, C.M., Su, B.H., Lin, S.C., Li, J.L. and Lee, C.C. across the observed time windows, or lack of useful features (17 2017. A bootstrapped multi-view weighted kernel fusion EDA features vs. 75 posture features). However, our results framework for cross-corpus integration of multimodal indicate that the EDA modality does generally improve classifier emotion recognition. In 2017 Seventh International performance when used in conjunction with the posture modality. Conference on Affective Computing and Intelligent Interaction (ACII) (2017), 377–382. 6. CONCLUSION [4] Cohen, J. 1960. A coefficient of agreement for nominal In this paper, we demonstrate the effectiveness of a multimodal scales. Educational and psychological measurement. 20, affect detection system based on sensor data capturing a user’s 1 (1960), 37–46. posture and EDA data while engaged with a game-based learning [5] Craig, S., Graesser, A., Sullins, J. and Gholson, B. 2005. environment. We show the improvement that multimodal classifiers achieve compared with unimodal classifiers for both Affect and learning: An exploratory look into the role of modalities. We also demonstrate that SVMs outperform ANNs as affect in learning with AutoTutor. Journal of Educational a unimodal classifier in this particular domain. Finally, we Media. 29, 3 (2005), 241–250. demonstrate that data fusion is an effective way to combine DOI:https://doi.org/10.1080/1358165042000283101. multiple modalities, either prior to or following classification. [6] D’Mello, S. and Kory, J. 2012. Consistent but modest: A Results suggest several promising directions for future work. To meta-analysis on unimodal and multimodal affect improve model performance on smaller datasets or data containing detection accuracies from 30 studies. Proceedings of the instances of missing modalities, more sophisticated feature 14th ACM international conference on Multimodal engineering approaches can be evaluated. The evaluation of our interaction - ICMI ’12. (2012), 31–38. data fusion techniques with additional modalities can further DOI:https://doi.org/10.1145/2388676.2388686. indicate the effectiveness of this approach in a variety of [7] DeFalco, J.A., Rowe, J.P., Paquette, L., Georgoulas- multimodal systems. Additional exploration of generalizable Sherry, V., Brawner, K., Mott, B.W., Baker, R.S. and multimodal systems should be undertaken to further utilize the Lester, J.C. 2018. Detecting and addressing frustration in flexibility of sensor-based systems. Further evaluation of a serious game for military training. International Journal classification algorithms can be investigated as well, in particular, of Artificial Intelligence in Education. 28, 2 (2018), 152– algorithms designed for the processing of temporal data such as 193. DOI:https://doi.org/10.1007/s40593-017-0152-1. recurrent neural networks. The impact of additional biosignal [8] Girardi, D., Lanubile, F. and Novielli, N. 2017. Emotion modalities such as EEG or EMG data would provide a more in- detection using noninvasive low cost sensors. In 2017 depth perspective of the effect such modalities have on multimodal Seventh International Conference on Affective Computing affect detection systems. Finally, the integration of multimodal and Intelligent Interaction (2017), 125–130. affect detection into a run-time learning environment would enable adaptive pedagogical functionalities that address potentially [9] Grafsgaard, J., Boyer, K., Wiebe, E. and Lester, J. 2012. negative learning outcomes through the use of dynamic Analyzing posture and affect in task-oriented tutoring. In interventions and user-tailored feedback based on learners’ International Conference of the Florida Artificial affective states. Intelligence Research Society (2012), 438–443. [10] Grafsgaard, J.F., Wiggins, J.B., Boyer, K.E., Wiebe, E.N. 7. ACKNOWLEDGMENTS and Lester, J.C. 2014. Predicting learning and affect from We wish to thank Dr. Jeanine DeFalco, Dr. Benjamin Goldberg, multimodal data streams in task-oriented tutorial dialogue. and Dr. Keith Brawner of the U.S. Army Combat Capabilities In Proceedings of the Seventh International Conference Development Command, Dr. Mike Matthews and COL James Ness on Educational Data Mining (London, UK, 2014), 122– of the U.S. Military Academy, Dr. Robert Sottilare of SoarTech, 129. and Dr. Ryan Baker of the University of Pennsylvania for their assistance in facilitating this research. The research was supported [11] Grafsgaard, J.F., Wiggins, J.B., Vail, A.K., Boyer, K.E., by the U.S. Army Research Laboratory under cooperative Wiebe, E.N. and Lester, J.C. 2014. The additive value of agreement #W911NF-13-2-0008. Any opinions, findings, and multimodal features for predicting engagement, conclusions expressed in this paper are those of the authors and do frustration, and learning during tutoring. In Proceedings not necessarily reflect the views of the U.S. Army. of the Sixteenth ACM International Conference on Multimodal Interaction (2014), 42–49. [12] Harley, J.M., Bouchet, F. and Azevedo, R. 2013. Aligning and comparing data on emotions experienced during learning with MetaTutor. In International Conference on [22] Patwardhan, A. and Knapp, G. 2016. Multimodal affect Artificial Intelligence in Education (2013), 61–70. recognition using Kinect. arXiv preprint [13] Harley, J.M., Bouchet, F., Hussain, M.S., Azevedo, R. and arXiv:1607.02652. (2016). Calvo, R. 2015. A multi-componential analysis of [23] Pei, E., Yang, L., Jiang, D. and Sahli, H. 2015. Multimodal emotions during complex learning with an intelligent dimensional affect recognition using deep bidirectional multi-agent system. Computers in Human Behavior. 48, long short-term memory recurrent neural networks. In May (2015), 615–625. Proceedings of the International Conference on Affective DOI:https://doi.org/10.1016/j.chb.2015.02.013. Computing and Intelligent Interaction (ACII) (2015), [14] Henderson, N.L., Rowe, J.P., Mott, B.W., Brawner, K., 208–214. Baker, R.S. and Lester, J.C. 2019. 4D Affect Detection : [24] Rahman, W. and Gavrilova, M.L. 2017. Emerging EEG Improving Frustration Detection in Game-Based Learning and Kinect face fusion for biometric identification. In with Posture-Based Temporal Data Fusion. In Proceedings of the IEEE Symposium Series on Proceedings of The 20th International Conference on Computational Intelligence (SSCI) (2017), 1–8. Artificial Intelligence in Education (in press) (2019). [25] Rajendran, R., Carter, K.E. and Levin, D.T. 2018. [15] Kalimeri, K. and Saitis, C. 2016. Exploring multimodal Predicting Learning by Analyzing Eye-Gaze Data of biosignal features for stress detection during indoor Reading Behavior. International Educational Data mobility. In Proceedings of the 18th ACM International Mining Society. (2018). Conference on Multimodal Interaction (2016), 53–60. [26] Ramachandran, B.R.N., Pinto, S.A.R., Born, J., Winkler, [16] Mierswa, I., Wurst, M., Klinkenberg, R. and Scholz, M. S. and Ratnam, R. 2017. Measuring neural, physiological 2006. Yale: Rapid prototyping for complex data mining and behavioral effects of frustration. In Proceedings of the tasks. In Proceedings of the 12th ACM SIGKDD 16th International Conference on Biomedical Engineering international conference on Knowledge discovery and (2017), 43–46. data mining (2006), 935–940. [27] Soleymani, M., Pantic, M. and Pun, T. 2012. Multimodal [17] Muller, P.M., Amin, S., Verma, P., Andriluka, M. and emotion recognition in response to videos. IEEE Bulling, A. 2015. Emotion recognition from embedded transactions on affective computing. 3, 2 (2012), 211–223. bodily expressions and speech during dyadic interactions. [28] Sottilare, R.A., Baker, R.S., Graesser, A.C. and Lester, 2015 International Conference on Affective Computing J.C. 2018. Special Issue on the Generalized Intelligent and Intelligent Interaction, ACII 2015. (2015), 663–669. Framework for Tutoring (GIFT): Creating a stable and DOI:https://doi.org/10.1109/ACII.2015.7344640. flexible platform for innovations in AIED Research. [18] Nazari, Z., Lucas, G. and Gratch, J. 2015. Multimodal International Journal of Artificial Intelligence in approach for automatic recognition of machiavellianism. Education. 28, 2 (2018), 139–151. 2015 International Conference on Affective Computing DOI:https://doi.org/10.1007/s40593-017-0149-9. and Intelligent Interaction, ACII 2015. (2015), 215–221. [29] Vail, A.K., Wiggins, J.B., Grafsgaard, J.F., Boyer, K.E., DOI:https://doi.org/10.1109/ACII.2015.7344574. Wiebe, E.N. and Lester, J.C. 2016. The Affective Impact [19] Ocumpaugh, J., Baker, R.S. and Rodrigo, M.T. 2015. of Tutor Questions: Predicting Frustration and Baker Rodrigo Ocumpaugh Monitoring Protocol Engagement Alexandria. International Educational Data (BROMP) 2.0 Technical and Training Manual. Mining Society. (2016), 247–254. [20] Pardos, Z., Baker, R., Pedro, M.S., Gowda, S.M. and DOI:https://doi.org/10.1145/1235. Gowda, S.M. 2014. Affective states and state tests: [30] Zeiler, M.D. 2012. ADADELTA: An adaptive learning investigating how affect and engagement during the rate method. (2012). school year predict end-of-year learning outcomes. DOI:https://doi.org/http://doi.acm.org.ezproxy.lib.ucf.ed Journal of Learning Analytics. 1, 1 (2014), 107–128. u/10.1145/1830483.1830503. DOI:https://doi.org/10.1145/2460296.2460320. [21] Patwardhan, A. and Knapp, G. 2017. Aggressive actions and anger detection from multiple modalities using Kinect. CoRR. (2017).