A Preliminary Assessment of Game Event Detection in Emotional Mario Task at MediaEval 2021 Van-Tu Ninh1 , Tu-Khiem Le1 , Manh-Duy Nguyen1 , Sinéad Smyth2 , Graham Healy1 , Cathal Gurrin1 1 School of Computing, Dublin City University, Ireland 2 School of Psychology, Dublin City University, Ireland tu.ninhvan@adaptcentre.ie,tukhiem.le4@mail.dcu.ie,manh.nguyen5@mail.dcu.ie,sinead.smyth@dcu.ie,graham. healy@dcu.ie,cathal.gurrin@dcu.ie ABSTRACT codes given by the task organisers, we also modify the source code The Emotional Mario task at MediaEval 2021 presents a new chal- in the Github repository provided in [10] to extract all relevant lenge of analysing the gameplay of ten participants on the well- frames corresponding to the actions in the game. For sensor data known Super Mario Bros video game by detecting key events using recorded from Empatica E4 device, the Blood Volume Pulse (BVP) facial and biometrics data. Our purpose in this work is to evalu- and Accelerometer are pruned to 60 Hz from the original sampling ate the application of emotion-related features in other domains rate of 64 Hz and 32 Hz respectively, to match the sampling rate of affective computing in game event detection. In this working of the video. For facial data, the Face Emotion Recognition (FER) notes paper, we present our work on in-game event detection us- features provided by the task organisers [5] extracted using the FER ing the conventional Random Forest model with a combination package [3] are inputted as a 7-dimensional vector into the model of Blood Volume Pulse and Electrodermal Activity statistical fea- for training. Even though the use of in-game video is not recom- tures with the facial expressions of the player as the input. In ad- mended in this task, we also extract game-frame deep features from dition, we also investigate the evaluation of using the in-game a ResNet-50 model pre-trained on the ImageNet dataset. These deep visual features in another pipeline with the same Random For- features are the same as the ones used in the preliminary work on est model to compare the efficiency of using in-game visual fea- the same dataset in [10], which is a 2048-dimensional vector. tures in the model. The source code of our work can be found at 2.1.2 Blood Volume Pulse (BVP). For Blood Volume Pulse (BVP) https://github.com/nvtu/Emotional-Mario-Analysis. feature extraction, we extract statistical features commonly used for stress detection and emotion recognition using physiological 1 INTRODUCTION signals. We use the Neurokit21 library, which employs the Elgandi Being referred to as engines of experience, games act as a source of processing pipleline to clean the photoplethysmogram (PPG) signal external stimuli that can trigger responses in human emotion (e.g., [6] and detect systolic peaks [2]. We then compute heart rate (HR), a person might feel intense stress when fighting against a boss in time-domain and frequency-domain of heart rate variability (HRV) a game). However, the connection between games and human’s using the extracted systolic peaks with a window-size of 60 seconds. emotions has not been comprehensively studied, which presents For frequency-domain HRV features, the same parameters of low an open area of research. Therefore, the Emotional Mario Task was (LF: 0.04-0.15 Hz) and high (HF: 0.15-0.4 Hz) frequency bands as in initiated to analyse this relationship [5]. The task employed 10 vol- [9] are used. Finally, the feature vector is standardised. This feature unteers to play various stages in the Super Mario Bros video game extraction process results in a 27-dimensional vector. and capture their reactions using a webcam and an E4 wristband. 2.1.3 Electrodermal Activity (EDA):. We followed previous re- The ultimate goal is to (1) predict five key events in the game, and search [7] in stress detection analysis to extract statistical EDA (2) summarise the gameplay by aggregating the best moments in features. Using Neurokit2 library, we extract components of the the game. In this work, we focus mainly on the first task. Our aim EDA signal that comprise Skin Conductance Response (SCR), Skin is to analyse the contribution of facial expressions and physiolog- Conductance Level (SCL), SCR Peaks, SCR Onsets, and SCR Am- ical signals recorded from wearable devices to the detection and plitude. Then, the statistical EDA features from the combination classification of five key events in the game. of four works [1, 4, 8, 9] are computed except for the slope of EDA signal along the time-axis, which results in a 35-dimensional vector. 2 APPROACH Finally, the feature vector is standardised. 2.1 Data Processing and Feature Extraction 2.1.1 Face, Game frame, and Sensor Synchronization and Pro- 2.2 Game Event Detection Models cessing: There are three types of data in the dataset captured using In total, we develop two models whose names are A and B, re- different devices with different sampling rates, which are: face video, spectively. Model A, which detects game events based on different in-game video and sensor data. Apart from the data-synchronisation combinations of emotion-related features, comprises of two stages. As illustrated in Figure 1 (black arrow), the first stage of the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). model aims at detecting if a game event happens at a timestamp MediaEval’21, December 13-15 2021, Online 1 https://github.com/neuropsychology/NeuroKit MediaEval’21, December 13-15 2021, Online Van-Tu Ninh et al. Table 2: Evaluation results of our approaches compared to other teams for event timestamps in the range of +/- 5 seconds Run Precision Recall F1 score Model A (ResNet50 + BVP) 0.3991 0.3001 0.3426 Model B (BVP + EDA) 0.0021 0.8903 0.0041 Model B (BVP + EDA + FER) 0.0019 0.7975 0.0039 GSE-AAU 0.0242 0.0812 0.0373 Random 0.2847 0.2847 0.2947 Table 3: Evaluation results of our approaches compared to other teams for matching events in the range of +/- 5 seconds Run Precision Recall F1 score Model A (ResNet50 + BVP) 0.2068 0.1522 0.1753 Model B (BVP + EDA) 0.0014 0.5709 0.0028 Figure 1: Overview of the game event detection model. The Model B (BVP + EDA + FER) 0.0012 0.4998 0.0025 protocol of model A using emotion-related features for train- GSE-AAU 0.0112 0.0849 0.0197 ing is illustrated using black arrow. The protocol of model Random 0.0667 0.0667 0.0667 B using a combination of visual and biometrics features is demonstrated using red arrow. The classification result of model B consists of one additional no-event category shown 3 RESULTS AND ANALYSIS in the red box. 3.1 Evaluation Metrics The organisers evaluate the runs based on exact event matching while the second stage concentrates on classifying the correspond- and event time-frame matching in a range of +/- one second and +/- ing game event (flag reached, life lost, status up, status down, new five seconds using precision, recall, and f1 score. [5]. In our paper, stage). Both stages employ a Random Forest model implemented in we report the evaluation results of both exact event matching and scikit-learn2 and incremental trees3 libraries with the same config- time-frame matching in range of +/- five seconds. uration of parameters. For the first stage training, as the number of samples of game-event/no-game-event is imbalanced which affects 3.2 Results the learning process of the model, we shuffle the non-game-event In total, we submitted three runs to the task. As described in section samples, then divide them into batches whose size is equal to the 2, model A is used with emotion-related features as input, while one of game-event samples, and apply incremental training to the model B used additional ResNet-50 visual features of gameplay. In Random Forest model. The non-default parameter values that we our prior experiment, we also tried using model B with emotion- employ in model A are shown in table 1. related features as input to detect the event without success poten- Model B, which classifies game events using deep visual features tially due in part to the highly imbalanced nature of the dataset. extracted from game frames combined with BVP statistical features, The results in table 2 and 3 both show that there is a large gap in the is a simple incremental training Random Forest with the same precision of correct event detection between using emotion-related parameter values as in Table 1 except for the number of estimators features extracted from physiological signals and using visual fea- (100), minimum samples for splitting (default value), and maximum tures from game-frame. This suggests that the game-frames contain depth (default value). a lot of information about the event compared to non-visual data. As demonstrated in table 2 and 3, the precision score of the model Table 1: Non-default Parameter Values of Random Forest A is extremely low, while the recall score is considerably higher model in model A than other attempts in the task, which shows that the number of false positive predictions is significantly high. This means that a Parameter Value proper approach of event detection using emotion-related features Number of estimators 500 has not been constructed successfully yet and further research on Minimum samples for splitting 4 this task needs to be conducted. Maximum depth √ 8 Best split max features number of features ACKNOWLEDGMENTS Bootstrap samples True This publication is funded as part of Dublin City University’s Re- Out-of-bag samples True search Committee and research grants from Science Foundation Class weight balanced subsample Ireland and co-funded by the European Regional Development 2 https://scikit-learn.org Fund under grant numbers SFI/13/RC/2106, SFI/13/RC/2106_P2, 3 https://github.com/garethjns/IncrementalTrees SFI/12/RC/2289_P2, and 18/CRT/6223. Emotional Mario: A Games Analytics Challenge MediaEval’21, December 13-15 2021, Online REFERENCES [1] Jongyoon Choi, Beena Ahmed, and Ricardo Gutierrez-Osuna. 2011. Development and evaluation of an ambulatory stress monitor based on wearable sensors. IEEE transactions on information technology in biomedicine 16, 2 (2011), 279–286. [2] Mohamed Elgendi, Ian Norton, Matt Brearley, Derek Abbott, and Dale Schuurmans. 2013. Systolic peak detection in acceleration photo- plethysmograms measured from emergency responders in tropical conditions. PLoS One 8, 10 (2013), e76585. [3] Justin Shenk et al. 2021. Facial Expression Recognition with a deep neu- ral network as a PyPI package. (2021). https://github.com/justinshenk/ fer [4] Jennifer Healey and Rosalind W. Picard. 2005. Detecting stress during real-world driving tasks using physiological sensors. IEEE Transactions on Intelligent Transportation Systems 6 (2005), 156–166. [5] Mathias Lux, M. Riegler, Henrik Svoren, S. Hicks, Duc-Tien Dang- Nguyen, Kristine Jorgensen, Vajira Thambawita, and P. Halvorsen. 2021. Emotional Mario Task at MediaEval 2021. In MediaEval. [6] Mohsen Nabian, Yu Yin, Jolie Wormwood, Karen S Quigley, Lisa F Bar- rett, and Sarah Ostadabbas. 2018. An open-source feature extraction tool for the analysis of peripheral physiological data. IEEE journal of translational engineering in health and medicine 6 (2018), 1–11. [7] Van-Tu Ninh, Sinéad Smyth, Minh-Triet Tran, and Cathal Gurrin. 2021. Analysing the Performance of StressDetection Models on Consumer- Grade Wearable Devices. In SoMeT. [8] Kizito Nkurikiyeyezu, Anna Yokokubo, and Guillaume Lopez. 2020. Effect of Person-Specific Biometrics in Improving Generic Stress Predictive Models. Sensors and Materials 32 (02 2020), 703–722. https://doi.org/10.18494/SAM.2020.2650 [9] Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM international conference on multimodal interaction. 400–408. [10] Henrik Svoren, Vajira Thambawita, P. Halvorsen, Petter Jakobsen, Enrique Alejandro García Ceja, Farzan Majeed Noori, Hugo Lewi Hammer, Mathias Lux, M. Riegler, and S. Hicks. 2020. Toadstool: A Dataset for Training Emotional Intelligent Machines Playing Super Mario Bros.