A Preliminary Assessment of Game Event Detection in Emotional
                 Mario Task at MediaEval 2021
                                           Van-Tu Ninh1 , Tu-Khiem Le1 , Manh-Duy Nguyen1 ,
                                            Sinéad Smyth2 , Graham Healy1 , Cathal Gurrin1
                                                1 School of Computing, Dublin City University, Ireland
                                   2 School of Psychology, Dublin City University, Ireland

        tu.ninhvan@adaptcentre.ie,tukhiem.le4@mail.dcu.ie,manh.nguyen5@mail.dcu.ie,sinead.smyth@dcu.ie,graham.
                                             healy@dcu.ie,cathal.gurrin@dcu.ie

ABSTRACT                                                                             codes given by the task organisers, we also modify the source code
The Emotional Mario task at MediaEval 2021 presents a new chal-                      in the Github repository provided in [10] to extract all relevant
lenge of analysing the gameplay of ten participants on the well-                     frames corresponding to the actions in the game. For sensor data
known Super Mario Bros video game by detecting key events using                      recorded from Empatica E4 device, the Blood Volume Pulse (BVP)
facial and biometrics data. Our purpose in this work is to evalu-                    and Accelerometer are pruned to 60 Hz from the original sampling
ate the application of emotion-related features in other domains                     rate of 64 Hz and 32 Hz respectively, to match the sampling rate
of affective computing in game event detection. In this working                      of the video. For facial data, the Face Emotion Recognition (FER)
notes paper, we present our work on in-game event detection us-                      features provided by the task organisers [5] extracted using the FER
ing the conventional Random Forest model with a combination                          package [3] are inputted as a 7-dimensional vector into the model
of Blood Volume Pulse and Electrodermal Activity statistical fea-                    for training. Even though the use of in-game video is not recom-
tures with the facial expressions of the player as the input. In ad-                 mended in this task, we also extract game-frame deep features from
dition, we also investigate the evaluation of using the in-game                      a ResNet-50 model pre-trained on the ImageNet dataset. These deep
visual features in another pipeline with the same Random For-                        features are the same as the ones used in the preliminary work on
est model to compare the efficiency of using in-game visual fea-                     the same dataset in [10], which is a 2048-dimensional vector.
tures in the model. The source code of our work can be found at                         2.1.2 Blood Volume Pulse (BVP). For Blood Volume Pulse (BVP)
https://github.com/nvtu/Emotional-Mario-Analysis.                                    feature extraction, we extract statistical features commonly used
                                                                                     for stress detection and emotion recognition using physiological
1    INTRODUCTION                                                                    signals. We use the Neurokit21 library, which employs the Elgandi
Being referred to as engines of experience, games act as a source of                 processing pipleline to clean the photoplethysmogram (PPG) signal
external stimuli that can trigger responses in human emotion (e.g.,                  [6] and detect systolic peaks [2]. We then compute heart rate (HR),
a person might feel intense stress when fighting against a boss in                   time-domain and frequency-domain of heart rate variability (HRV)
a game). However, the connection between games and human’s                           using the extracted systolic peaks with a window-size of 60 seconds.
emotions has not been comprehensively studied, which presents                        For frequency-domain HRV features, the same parameters of low
an open area of research. Therefore, the Emotional Mario Task was                    (LF: 0.04-0.15 Hz) and high (HF: 0.15-0.4 Hz) frequency bands as in
initiated to analyse this relationship [5]. The task employed 10 vol-                [9] are used. Finally, the feature vector is standardised. This feature
unteers to play various stages in the Super Mario Bros video game                    extraction process results in a 27-dimensional vector.
and capture their reactions using a webcam and an E4 wristband.                         2.1.3 Electrodermal Activity (EDA):. We followed previous re-
The ultimate goal is to (1) predict five key events in the game, and                 search [7] in stress detection analysis to extract statistical EDA
(2) summarise the gameplay by aggregating the best moments in                        features. Using Neurokit2 library, we extract components of the
the game. In this work, we focus mainly on the first task. Our aim                   EDA signal that comprise Skin Conductance Response (SCR), Skin
is to analyse the contribution of facial expressions and physiolog-                  Conductance Level (SCL), SCR Peaks, SCR Onsets, and SCR Am-
ical signals recorded from wearable devices to the detection and                     plitude. Then, the statistical EDA features from the combination
classification of five key events in the game.                                       of four works [1, 4, 8, 9] are computed except for the slope of EDA
                                                                                     signal along the time-axis, which results in a 35-dimensional vector.
2 APPROACH                                                                           Finally, the feature vector is standardised.
2.1 Data Processing and Feature Extraction
   2.1.1 Face, Game frame, and Sensor Synchronization and Pro-                       2.2     Game Event Detection Models
cessing: There are three types of data in the dataset captured using                 In total, we develop two models whose names are A and B, re-
different devices with different sampling rates, which are: face video,              spectively. Model A, which detects game events based on different
in-game video and sensor data. Apart from the data-synchronisation                   combinations of emotion-related features, comprises of two stages.
                                                                                        As illustrated in Figure 1 (black arrow), the first stage of the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     model aims at detecting if a game event happens at a timestamp
MediaEval’21, December 13-15 2021, Online
                                                                                     1 https://github.com/neuropsychology/NeuroKit
MediaEval’21, December 13-15 2021, Online                                                                                  Van-Tu Ninh et al.

                                                                         Table 2: Evaluation results of our approaches compared to
                                                                         other teams for event timestamps in the range of +/- 5 seconds

                                                                          Run                                 Precision    Recall    F1 score
                                                                          Model A (ResNet50 + BVP)             0.3991      0.3001     0.3426
                                                                          Model B (BVP + EDA)                  0.0021      0.8903     0.0041
                                                                          Model B (BVP + EDA + FER)            0.0019      0.7975     0.0039
                                                                          GSE-AAU                              0.0242      0.0812     0.0373
                                                                          Random                               0.2847      0.2847     0.2947


                                                                         Table 3: Evaluation results of our approaches compared to
                                                                         other teams for matching events in the range of +/- 5 seconds

                                                                          Run                                 Precision    Recall    F1 score
                                                                          Model A (ResNet50 + BVP)             0.2068      0.1522     0.1753
                                                                          Model B (BVP + EDA)                  0.0014      0.5709     0.0028
Figure 1: Overview of the game event detection model. The                 Model B (BVP + EDA + FER)            0.0012      0.4998     0.0025
protocol of model A using emotion-related features for train-             GSE-AAU                              0.0112      0.0849     0.0197
ing is illustrated using black arrow. The protocol of model               Random                               0.0667      0.0667     0.0667
B using a combination of visual and biometrics features is
demonstrated using red arrow. The classification result of
model B consists of one additional no-event category shown               3 RESULTS AND ANALYSIS
in the red box.
                                                                         3.1 Evaluation Metrics
                                                                         The organisers evaluate the runs based on exact event matching
while the second stage concentrates on classifying the correspond-       and event time-frame matching in a range of +/- one second and +/-
ing game event (flag reached, life lost, status up, status down, new     five seconds using precision, recall, and f1 score. [5]. In our paper,
stage). Both stages employ a Random Forest model implemented in          we report the evaluation results of both exact event matching and
scikit-learn2 and incremental trees3 libraries with the same config-     time-frame matching in range of +/- five seconds.
uration of parameters. For the first stage training, as the number of
samples of game-event/no-game-event is imbalanced which affects          3.2    Results
the learning process of the model, we shuffle the non-game-event
                                                                         In total, we submitted three runs to the task. As described in section
samples, then divide them into batches whose size is equal to the
                                                                         2, model A is used with emotion-related features as input, while
one of game-event samples, and apply incremental training to the
                                                                         model B used additional ResNet-50 visual features of gameplay. In
Random Forest model. The non-default parameter values that we
                                                                         our prior experiment, we also tried using model B with emotion-
employ in model A are shown in table 1.
                                                                         related features as input to detect the event without success poten-
   Model B, which classifies game events using deep visual features
                                                                         tially due in part to the highly imbalanced nature of the dataset.
extracted from game frames combined with BVP statistical features,
                                                                         The results in table 2 and 3 both show that there is a large gap in the
is a simple incremental training Random Forest with the same
                                                                         precision of correct event detection between using emotion-related
parameter values as in Table 1 except for the number of estimators
                                                                         features extracted from physiological signals and using visual fea-
(100), minimum samples for splitting (default value), and maximum
                                                                         tures from game-frame. This suggests that the game-frames contain
depth (default value).
                                                                         a lot of information about the event compared to non-visual data.
                                                                         As demonstrated in table 2 and 3, the precision score of the model
Table 1: Non-default Parameter Values of Random Forest
                                                                         A is extremely low, while the recall score is considerably higher
model in model A
                                                                         than other attempts in the task, which shows that the number of
                                                                         false positive predictions is significantly high. This means that a
                Parameter                                Value           proper approach of event detection using emotion-related features
           Number of estimators                           500            has not been constructed successfully yet and further research on
       Minimum samples for splitting                       4             this task needs to be conducted.
             Maximum depth                        √        8
          Best split max features                   number of features   ACKNOWLEDGMENTS
            Bootstrap samples                            True
                                                                         This publication is funded as part of Dublin City University’s Re-
            Out-of-bag samples                           True
                                                                         search Committee and research grants from Science Foundation
               Class weight                       balanced subsample
                                                                         Ireland and co-funded by the European Regional Development
2 https://scikit-learn.org                                               Fund under grant numbers SFI/13/RC/2106, SFI/13/RC/2106_P2,
3 https://github.com/garethjns/IncrementalTrees                          SFI/12/RC/2289_P2, and 18/CRT/6223.
Emotional Mario: A Games Analytics Challenge                                    MediaEval’21, December 13-15 2021, Online


REFERENCES
 [1] Jongyoon Choi, Beena Ahmed, and Ricardo Gutierrez-Osuna. 2011.
     Development and evaluation of an ambulatory stress monitor based
     on wearable sensors. IEEE transactions on information technology in
     biomedicine 16, 2 (2011), 279–286.
 [2] Mohamed Elgendi, Ian Norton, Matt Brearley, Derek Abbott, and Dale
     Schuurmans. 2013. Systolic peak detection in acceleration photo-
     plethysmograms measured from emergency responders in tropical
     conditions. PLoS One 8, 10 (2013), e76585.
 [3] Justin Shenk et al. 2021. Facial Expression Recognition with a deep neu-
     ral network as a PyPI package. (2021). https://github.com/justinshenk/
     fer
 [4] Jennifer Healey and Rosalind W. Picard. 2005. Detecting stress during
     real-world driving tasks using physiological sensors. IEEE Transactions
     on Intelligent Transportation Systems 6 (2005), 156–166.
 [5] Mathias Lux, M. Riegler, Henrik Svoren, S. Hicks, Duc-Tien Dang-
     Nguyen, Kristine Jorgensen, Vajira Thambawita, and P. Halvorsen.
     2021. Emotional Mario Task at MediaEval 2021. In MediaEval.
 [6] Mohsen Nabian, Yu Yin, Jolie Wormwood, Karen S Quigley, Lisa F Bar-
     rett, and Sarah Ostadabbas. 2018. An open-source feature extraction
     tool for the analysis of peripheral physiological data. IEEE journal of
     translational engineering in health and medicine 6 (2018), 1–11.
 [7] Van-Tu Ninh, Sinéad Smyth, Minh-Triet Tran, and Cathal Gurrin. 2021.
     Analysing the Performance of StressDetection Models on Consumer-
     Grade Wearable Devices. In SoMeT.
 [8] Kizito Nkurikiyeyezu, Anna Yokokubo, and Guillaume Lopez. 2020.
     Effect of Person-Specific Biometrics in Improving Generic Stress
     Predictive Models. Sensors and Materials 32 (02 2020), 703–722.
     https://doi.org/10.18494/SAM.2020.2650
 [9] Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and
     Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset
     for wearable stress and affect detection. In Proceedings of the 20th ACM
     international conference on multimodal interaction. 400–408.
[10] Henrik Svoren, Vajira Thambawita, P. Halvorsen, Petter Jakobsen,
     Enrique Alejandro García Ceja, Farzan Majeed Noori, Hugo Lewi
     Hammer, Mathias Lux, M. Riegler, and S. Hicks. 2020. Toadstool: A
     Dataset for Training Emotional Intelligent Machines Playing Super
     Mario Bros.