<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Deep Learning Approach Towards Multimodal Stress Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristian Paul Bara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michalis Papakostas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rada Mihalcea</string-name>
          <email>mihalceag@umich.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Michigan</institution>
          ,
          <addr-line>Electrical Engineering &amp; Computer Science Ann Arbor MI 48109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Several studies that emerged from the elds of psychology and medical sciences during recent years have highlighted the impact that stress can have on human health and behavior. Wearable technologies and sensor-based monitoring have shown promising results towards assessing, monitoring and potentially preventing high-risk situations that may occur as a result of fatigue, poor health, or other similar conditions caused by excessive amounts of stress. In this paper, we present our initial steps in developing a deep-learning based approach that can assist with the task of multimodal stress detection. Our results indicate the promise of this direction, and point to the need for further investigations to better understand the role that deep-learning approaches can play in developing generalizable architectures for multimodal a ective computing. For our experiments we use the MuSE dataset { a rich resource designed to understand the correlations between stress and emotion { and evaluate our methods on eight di erent information signals captured from 28 individuals.</p>
      </abstract>
      <kwd-group>
        <kwd>multimodal stress detection</kwd>
        <kwd>representation learning</kwd>
        <kwd>af- fective computing</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Stress is a normal reaction of the body, mostly observed under situations where
we struggle to cope with the conditions or the changes that occur in our
environment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Its e ects and symptoms a ect our body both physically as well
as mentally and emotionally. Stress is thus playing a very signi cant role
towards shaping our overall behavior, well-being and potentially our personal and
professional success [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        A ective computing is the eld of science that studies and develops
technologies able to capture, characterize and reproduce a ects, i.e., experiences of
feelings or emotions. It is a highly interdisciplinary domain, mostly in uenced
by the elds of computer science, psychology and cognitive science and was
initially introduced in the late 90s by Rosalind Piccard [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Contributions of
a ective computing have traditionally played a very important role towards
designing more engaging and e ective Human-Computer-Interaction systems that
can adapt their responses according to the underlying human behavior [
        <xref ref-type="bibr" rid="ref17 ref22">17, 22</xref>
        ].
      </p>
      <p>
        However, despite the dramatic evolution of a ective computing and computer
science in general during the last decade, capturing and analysing stress factors
and their e ects on the body remains a very challenging problem primarily due
to the multidimensional impact that stress can have on human behavior. In the
past, several works have tried to address the problem of stress detection using
di erent types of sensors, tasks and processing techniques [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">5, 6, 4</xref>
        ].
      </p>
      <p>
        Through our work, we target three main contributions. Firstly, we evaluate
our study using the MuSE Datsaset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], to our knowledge one of the
richest databases for stress and emotion analysis in terms of stimuli and recorded
modalities. One of the most valuable characteristics of MuSE compared to other
available resources is that stress is not induced to the subjects through a
speci c task that they need to complete during data collection. In contrast, the
dataset aims to capture stress as experienced by the participants in their real
lives during the time of the recordings. We discuss more about the characteristics
of the dataset later on in the paper on the corresponding section. Secondly, we
aim to overcome the process of manually handpicking features for each
individual modality by proposing and evaluating a set of di erent deep learning based
con gurations for a ective modeling and stress detection. We showcase the
potential of such approaches and we highlight the advantage of designing modular
deep architectures that can learn unsupervised features in a task-agnostic
manner, hence increasing the generalizability and applicability of pretrained
components across di erent tasks and applications. Lastly, we propose a preliminary
approach towards learning modality-agnostic representations. Di erent sensors
introduce di erent limitations that can relate to variations in computational
demands, sampling rate, data availability, and most importantly, a modality-based
preprocessing and feature design process. Overcoming these obstacles is one of
the greatest challenges in most multimodal processing tasks and our modular
method demonstrates a potential solution towards that direction.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Various computational methods have been proposed over the years aiming to
capture and characterize human behaviors related to stress. Technologies related
to sensor-based monitoring of the human body have the lion's share in this
domain and di erent modalities and stress stimuli scenarios have been explored.</p>
      <p>
        With one of the most impacting works in the area, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] showed that multimodal
monitoring can e ectively capture changes in the human behavior related to
stress and that speci c factors such as body acceleration, intensity and duration
of touch as well as the overall amount of body movement can be greatly a ected
when acting under stress. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] enhanced these preliminary ndings, by identifying
that uctuations in physiological features like blood volume, pulse and heart
rate, may indicate signi cant changes in the levels of stress. The very insightful
review study published by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], emphasized on the importance of considering
psychological, physiological, behavioural and contextual information when assessing
stress, primarily due to the intricate implications that it can have on behavior.
Their review suggested a plethora of features, extracted from various modalities
including cameras, thermal imaging, physiological indicators, environmental and
life factors and several others, as important predictors of stress. In a more recent
study, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] investigated the impact that stress can have on driver behavior by
monitoring some of the physiological signals indicated by the previous studies.
The authors explored a series of temporal and spectral hand-crafted features and
evaluated their methods using traditional machine learning approaches. All the
aforementioned ndings have been consistently revisited, reevaluated and most
of the times recon rmed by a series of survey studies that addressed multimodal
stress detection over the last few years [
        <xref ref-type="bibr" rid="ref21 ref3 ref9">21, 9, 3</xref>
        ].
      </p>
      <p>
        This work has been signi cantly inspired by the research studies of the past.
However, in contrast to the works discussed above, we aim to approach
multimodal stress detection using deep learning modeling. Our motivation, stems
from the very inspiring results that deep learning has o ered to the computer
science community. In the past, very few studies explored deep learning as a
tool for stress classi cation and feature extraction, primarily due to the limited
amount of available resources. A factor that can become a very hard constrain
given the excessive amounts of data that most deep learning algorithms require.
Some of the most popular deep learning based studies related to stress, include
the works by [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] on textual data extracted from the social media and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] on
audio data generated by actors simulating stress and non-stress behaviors.
      </p>
      <p>
        In contrast to those techniques, we perform multi-modal processing using
spatiotemporal analysis on eight di erent information channels with minimal data
prepossessing [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] captured from 28 eight individuals. A subject set greater than
all the research studies mentioned above. We explore the potentials of Recurrent
Neural Networks[
        <xref ref-type="bibr" rid="ref15 ref7">15, 7</xref>
        ] and Convolutional Autoencoders [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for learning a
ective representations and we do an in depth evaluation of our techniques using
the MuSE dataset.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        MuSE is a multimodal database that has been speci cally designed to address
the problem of stress detection and its relation to human emotion [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
dataset consists of 224 recordings coming from 28 subjects who participated
into two recording sessions each. All subjects were undergraduate or graduate
students from di erent majors. The rst recording session took place during
a nal exam period (consisdered to be a high stress period), while the second
one was conducted after the exams ended (considered to be a low stress period).
During each session, subjects were exposed to four di erent stimuli, which aimed
to elicit a variation of emotional responses.
      </p>
      <p>The stimuli used in each recording session were the following:
1. Neutral: Subjects were just sitting while multimodal data were being
collected. No emotional stimulus was provided.
2. Question Answering (QA): Subjects were asked to answer a series of
controversial questions that appeared on a screen. Questions were targeted to
achieve either a positive, a negative or a neutral feeling as aftere ect.
3. Video: Subjects were asked to watch a series of emotionally provocative
videos. Similarly to QA, videos were aiming to trigger a variety of emotions.
4. Monologues: After the end of each video subjects were asked to comment for
30 seconds on the video they just watched.</p>
      <p>During all four steps in both recording sessions the following eight di erent
streams of multimodal data were collected:
1. Thermal Imaging: Thermal imaging data of subject's face were collected
during the whole period of each session.
2. RGB Closeup Video: Regular RGB video of subject's face was recorded
during the whole duration of a session.
3. RGB Wideangle Video: RGB video recordigs showing a full-body view of
the subject was also captured.
4. Audio: User verbal responses were recorded for the interactive sections of
each recording, i.e., the QA and monologues.
5. Physiological: Four di erent types of physiological data were recorded using
contact sensors attached to the subject's ngers and core body. The
physiological signals captured were: (1) Heart rate; (2) Body temperature; (3) Skin
conductance; (4) Breathing rate.
6. Text: Transcripts extracted from the QA. For the purposes of this study we
did not conducted any experiments using this modality.</p>
      <p>Table 1 summarizes the statistics of the nal version of the MuSE as curated
for our experiments. In this study, we consider as a sample of Stress or Non-Stress
any segment that was captured during an exam or post-exam recording session
respectively. Thus, data-points that belong in the same class may originate from
di erent stimuli as long as they have been captured during the same period.</p>
      <p>Modality</p>
      <p>N(%)</p>
      <p>S(%)</p>
      <p>
        Total
For our experiments we propose a deep-learning architecture that is based on
Convolutional-Autoencoders and Recurrent Neural Networks. As brie y
discussed in Section 2, Convolutional-Autoencoders (CAs) are popular for their
ability to learn meaningful unsupervised feature representations by signi cantly
shrinking the dimensionality of the original signal [23, ?]. On the other hand,
Recurrent Neural Networks have shown state-of-the-art results in a series of
applications across di erent domains and they are mostly popular for their
bene ts on sequential information modeling [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. For our implementation we used a
particular recurrent unit also known as Gated Recurrent Unit (GRU) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The novelty of our approach stems from the fact that we use multiple
identical copies of the same architecture to model each modality individually, while
applying minimal preprocessing steps on the original signals. Figure 2 illustrates
the basic components of this architecture.</p>
      <p>In addition we propose a novel approach towards modality independent
multimodal representations using a modi ed version of our original architecture that
allows weight-sharing across all the available modalities while taking into account
modality dependent characteristics. For the purposes of this paper we refer to
this approach as "a-modal" and we visualize it in Figure 3.</p>
      <p>In the following subsections we will discuss in more detail the exact steps
used for preprocessing each modality as well as the individual components of
each architecture.
We try to minimize the computational workload of our framework by signi
cantly simplifying the preprocessing steps applied on each modality. Below we
describe the computations applied on each information signal before entering
the initial encoder-unit of our deep architectures.
1. Thermal: Thermal video included in MuSE was captured in a frame-rate of
30 fps. To minimize the amount of information we clamp all temperatures
between 0 and 50 degrees Celsius. Before passing the thermal video frames
through the network we resize each frame to 128 128 and we convert each
frame into gray scale.
2. RGB: Wide-angle &amp; Closeup video streams were in an original frame-rate of
25 FPS. The frames from these modalities were directly re-scaled to 128 128
and converted to gray scale without any additional edits.
3. Physiological: All four physiological indicators described in Section 3 were
captured in a sampling-rate of 2048 Hz. For each of the signals we extract 2
sec. windows with a 98% overlap and we compute a Fast Fourier Transform
(FFT) on each of the individual segments for each of the signals. Finally,
the four spectra are being stacked vertically to form a 4 4096 matrix
representation for all the physiological signals combined. This representation is
used as a nal input to the network.
4. Audio: The audio signal is recorded at a sample rate of 44.1 kHz. Similar
to the physiological signals, we extract overlapping windows of size 0.37</p>
      <p>
        sec. and compute the FFT on each of the windows to create a nal audio
representation of 1 16384, which is passed through the deep architecture.
The window overlap for the audio signal is equal to 92%. This is a common
way of representing audio used in previous work [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]
Unsupervised Feature Learning ('M1' Module) As explained in the
beginning of this section we propose two di erent architectures which, share some
common core characteristics. The main component shared by both designs is the
'M1' module, which can be seen in detail in Figure 2. This module consists of
a Convolutional-Autoencoder with 14 symmetrical convolutional layers, 7 layers
for encoding and 7 for decoding. The encoding portion has kernel sizes of 3x3
with the number of lters per layer as follows: 2,4,8,16,32,64,128. The decoding
section is a mirror image of the encoder. Every convolutional layer is followed
by a ReLU layer except for the last encoding layer which is a sigmoid. All
convolutional layers have a stride and padding of 1. We used an adaptive learning
starting at 0.01. Every time the loss fails to improve for 5 epochs, the learning
rate is halved. The Autoencoder is being trained independently for each modality
with the objective to minimize the L1 di erence between the original inputs and
the output matrices generated by the decoder. After training an Autoencoder
on each modality we ignore the decoding part and use the encoder to produce
modality-speci c vectorized representations with a xed size of 1 128. The
Autoencoder architecture is almost identical for all modalities with the only
difference being in the size of the initial input layer in order to facilitate the speci c
representation of each modality as discussed in the previous section. In our
amodal architecture (Figure 3), multiple copies of 'M1' are being used depending
on the number of available modalities.
      </p>
      <p>Learning A-modal Representations ('M2' Module) Goal of the a-modal
architecture is to create a feature space that can su ciently describe all
modalities not only in the spatial but also in the temporal domain, without being
restricted in the nature or the number of the available information signals.</p>
      <p>One of the most popular obstacles in multimodal representation learning is
the frame-rate miss-matching across the di erent signals, which makes it di cult
to temporally align the various data-streams in the processing level. In our case,
we try to match all modalities to the maximum frame-rate of 30 fps provided by
the Thermal camera. Closeup and Wideangle RGB videos have a frame-rate of
25 fps. To "correct" the frame-rate of these two sources we simply up-sample the
signal by duplicating every 5th frame of the original video. For the physiological
and audio signals, we extract 30 windows per second based on the principles
described previously in Section 4.1.</p>
      <p>After xating all modalities to the same frame-rate we use 'M1' module as
shown in Figure 3 to extract a vector of 1 128 from each of them. Thus, at
every frame per second we get a set of N 128 feature vectors, where N equals
the number of available modalities. The main component that discriminates
the original architecture of Figure 2 from the a-modal architecture is module
'M2', shown again in Figure 3. Goal of this module is to project the multimodal
representations in a new, modality-agnostic feature space, by maintaining their
temporal coherence.</p>
      <p>
        'M2' module consists of two GRU components. GRU-A is a unidirectional
RNN responsible to project the spatio-temporal, multimodal representations into
the new a-modal space. To tune the parameters of GRU-A we use a another,
biderectional GRU (GRU-B), that aims to solve a frame-sorting problem using
the a-modal representations generated by GRU-A. This step is implemented
using self-supervision and it was inspired by the work shown by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Thus, no
task-speci c annotations are required to train 'M2'. The two components are
trained together using a shared objective function that aims to optimize the
sorting task by improving the quality of the projected representations. Similarly
to 'M1', 'M2' was trained independently. The learning function of 'M2' is shown
below:
      </p>
      <p>L = mp;iPn jjp0
pnjj + jjP
^
P jj</p>
      <p>Where P^ is the reference temporal permutation matrix, P is the output of
GRU-B and represents the predicted temporal permutation matrix, p0 is the
output of the GRU-A over a single modality and pn is the output of the GRU-A
over all modalities.</p>
      <p>During testing, the pretrained GRU-A is used to produce a-modal projections
of the new/unknown multimodal samples, while GRU-B is omitted from the
pipeline.
4.2</p>
      <p>Stress Classi cation ('M3' Module)
In both the modality-dependent and modality-independent architectures,
classication takes place using a "time-aware" unidirectional GRU, shown as
Module'M3' in both Figures 2 and 3. 'M3' is the only component that is trained in a
fully-supervised manner.</p>
      <p>In the rst case of modality-dependent classi cation, 'M3' takes as input a
matrix of size h 128, where h is a hyperparameter representing the
temporal window on which we make classi cation decisions and is depended on the
frame-rate of each modality. This matrix is generated by stacking consecutive
vectorized representations generated by the pretrained Encoder of 'M1'. For our
experiments we make classi cation decisions based on 20 second long overlapping
windows with a 5 second step. We also perform early and late fusion experiments
by combining all the available information signals. In the rst case, early fusion
happens by concatenating the modality-based 1D representations generated by
the 'M1' Encoder. In the later case of late fusion, we vote over the available
unimodal decisions using each models' individual average accuracy as a weighting
factor.</p>
      <p>In the case of a-modal classi cation the input to 'M3' is again a stack of
feature vectors of size h 128, with the main di erence being that h=600 in all
scenarios, given that all modalities have a xated frame-rate of 30 fps and that
we still classify 20 seconds long windows.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>We have conducted two categories of experiments that di er on the amount of
input modalities considered for the nal decision making. For all our
experiments we perform subject-based leave-one-out cross validation and we report
the average performance metrics across all 28 subject.</p>
      <p>Since meaningful verbal interaction was present only in parts of the
recordings, speci cally for the audio modality we have performed analysis only in the
QA recording segments where plenty of meaningful audio samples were available.
We excluded Monologues, since in most cases audio samples were very short and
poor of verbal and linguistic information with very long pauses.</p>
      <p>Table 2 illustrates the nal stress classi cation results using only a single
modality. For these experiments we deploy exclusively the architecture of Figure
2, as described in Section 4.</p>
      <p>As it can be observed by the stability occurred across all the reported
metrics, the classi cation results are pretty balanced between the two classes in all
modalities. A result that is in line with the balanced nature of the dataset as
shown in Table 1 and that proves the ability of the general architecture of Figure
2 to capture and discriminate the valuable information in most scenarios.</p>
      <p>However, despite the classi cation improvement observed compared to
random choice in all cases, not all modalities were equally good on detecting stress.
In particular, Closeup video and Physiological sensors showed the minimum
improvement with 1.3% and 2.4% increase against random respectively, while
Wideangle video was by far the best indicator of stress with an increase of 35%.
The superiority of Wideangle imagining can be attributed to the fact that overall
body, arm and head motion is being captured from this point of view, features
known for their high correlation to stress as explained in Section 2. On the other
hand, we believe that the poor performance of physiological sensors is due to the
the modality prepossessing performed before passing the signals through module
"M1", as our results are contradictory to most related research. In the future
we would like to investigate more temporal-based or spectral-temporal combined
physiological signal representations as others have done in the past, since
focusing explicitly on spectral information seems to ignore very important
characteristics of the individual signals. Other aspects such as signal segmentation and
examination of the unsupervised features learned should also be revisited and
reexamined. With respect to the Closeup video, our post-analysis revealed that
the Autoencoder of module "M1" failed in the vast majority of cases to recreate
facial features that could be indicative of stress. In most scenarios, the images
recreated by the decoder, were lacking the presence of eyes and lips and only the
head posture could be partially reproduced. We suspect that a reason for this
e ect might have been the variability of features present in the our training data,
in combination to the limited amount of available samples. Thus, causing the
Autoencoder to over t on the background information. In the future, we would
like to experiment on transfer-learning approaches by ne-tuning a pretrained
Autoencoder model on facial data, since such methods have shown promising
results in a variety of applications. Lastly, Thermal and Audio signals provided
also noticeable improvements against random choice with 9.9% and 19.5%
increase accordingly. It has to be noted that since Audio was considered only in the
QA sections, the available samples were signi cantly limited. This emphasizes
the e ectiveness of the proposed method to capture impactful a ective audio
features without the need of vast amounts of data.</p>
      <p>In Figure 4 we illustrate sample images as they were recreated by the the
Autoencoder of "M1" for the Wideangle, Thermal and Closeup videos. It is
easy to observe that the more details included in the reconstructed image the
highest the performance of the individual modality. In the case of Wideangle,
body postures can be depicted quite well, while in Thermal images the warmer
areas of the face (a feature that we can intuitively understand that may be
similar across di erent subjects under stress) have been satisfactorily captured.
However as explained above Closeup images could not be represented e ciently.
ered, and one on all the available recordings where we considered only Thermal,
Wideangle, Closeup and Physiological information.</p>
      <p>Our results indicate that late fusion of the individual decisions provided by
far the best results in all cases. Early fusion could not scale up its performance
when audio features where not available and showed overall inferior performance
compared to the other two fusion techniques. A-modal fusion also provided
relatively poor results compared to the modality-dependent late fusion approach.
However, a-modal results were overall slightly better to early fusion in terms of
average accuracy across the two types of experiments (QA and All recordings).
Moreover a-modal fusion performed better compared to Closeup and
Physiological unimodal models and provided results slightly inferior to the Thermal
unimodal approach. In addition, the results provided by the a-modal approach
are very stable between the two experiments (similarly to the late fusion), despite
the fact that completely di erent modalities were used. This observation may be
indicative of the stability of the learned a-modal representations, but further
experimentation is needed. However, this was not the case in early fusion, since the
two experiments (QA vs All) had a 5.6% di erence in performance despite the
fact that the majority of the modalities were the same. These results prove the
ability of the a-modal method to learn robust, modality-agnostic representations
that carry and combine a ective knowledge from all the available resources. Our
ndings indicate that there is obviously a long way to go until models of general
a ective awareness become a reality. However, they highlight the possibilities of
such methods and motivate us towards investigating this topic further.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions &amp; Future Work</title>
      <p>In this paper we investigated the abilities of deep-learning methods on producing
a ective multimodal representations related to stress. We proposed a modular
approach that enables learning unsupervised spatial or spatio-temporal features,
depending on the way that the di erent modules are combined. We showed
how each module can be trained and reused independently and how di erent
combinations of the modules can lead to di erent ways of combining modalities,
each coming with its own advantages and disadvantages.</p>
      <p>In particular we demonstrated an architecture (Figure 2) able to learn spatial
modality-dependent representations in a modality agnostic way and we evaluated
the abilities of each information channel to capture signs of stress. Additionally,
we proposed a variation of this original architecture (Figure 3) able to produce
modality-independent representations, by operating on an arbitrary number of
input signals that can be highly unrelated with each other but very informative
towards understanding the targeted task; in this case the detection of stress.</p>
      <p>One of the main assets of the proposed method is its ability to provide
promising results across all the evaluated experiments by minimizing the preprocessing
steps of all the available signals and by completely avoiding manual feature
engineering. The presented results showcase that deep-learning methods can
produce rich a ective representations related to stress, despite the relatively limited
amount of data. Moreover, they show that they can function as mechanisms to
process, extract and combine information coming from multiple resources
without the need of explicitly tailoring each classi er on the characteristics of each
individual modality. These ndings motivate us towards researching these topics
in greater depth.</p>
      <p>In the future we would like to investigate alternative approaches of
representing the di erent modalities before processing them through the deep
architectures as we believe that it can highly impact the performance of the model.
However, our priority is to do so without compromising the minimal
computational preprocessing cost as discussed in this paper. Furthermore, we plan to
apply our methods on other applications in the spectrum of a ective
computing such as alertness, fatigue and deception detection. Finally, we would like to
investigate alternative architectures, that can lead to improved results both in
terms of classi cation and computational performance.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This material is based in part upon work supported by the Toyota Research
Institute (TRI). Any opinions, ndings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily re ect
the views of TRI or any other Toyota entity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aigrain</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spodenkiewicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubuiss</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Detyniecki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chetouani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal stress detection from multiple assessments</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          <volume>9</volume>
          (
          <issue>4</issue>
          ),
          <volume>491</volume>
          {
          <fpage>506</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alberdi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aztiria</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basarab</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Towards an automatic early stress recognition system for o ce environments based on multimodal measurements: A review</article-title>
          .
          <source>Journal of biomedical informatics 59</source>
          ,
          <volume>49</volume>
          {
          <fpage>75</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>Y.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arnrich</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ersoy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Stress detection in daily life scenarios using smart phones and wearable sensors: A survey</article-title>
          .
          <source>Journal of biomedical</source>
          informatics p.
          <volume>103139</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>Y.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chalabianloo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekiz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ersoy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Continuous stress detection using wearable sensors in real life: Algorithmic programming contest case study</article-title>
          .
          <source>Sensors</source>
          <volume>19</volume>
          (
          <issue>8</issue>
          ),
          <year>1849</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Carneiro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novais</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez-Caballero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neves</surname>
          </string-name>
          , J.:
          <article-title>Multimodal behavioral analysis for non-invasive stress detection</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>39</volume>
          (
          <issue>18</issue>
          ),
          <volume>13376</volume>
          {
          <fpage>13389</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.l.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>P.f.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>J.z.</given-names>
          </string-name>
          :
          <article-title>Detecting driving stress in physiological signals based on multimodal feature analysis and kernel classi ers</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>85</volume>
          , 279{
          <fpage>291</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          .
          <source>arXiv preprint arXiv:1412.3555</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dobson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>What is stress, and how does it a ect reproduction?</article-title>
          <source>Animal reproduction science 60</source>
          ,
          <volume>743</volume>
          {
          <fpage>752</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Greene</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thapliyal</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caban-Holt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A survey of a ective computing for stress detection: Evaluating technologies in stress detection for better health</article-title>
          .
          <source>IEEE Consumer Electronics Magazine</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <volume>44</volume>
          {
          <fpage>56</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
          </string-name>
          , J.:
          <article-title>Deep clustering with convolutional autoencoders</article-title>
          .
          <source>In: International Conference on Neural Information Processing</source>
          . pp.
          <volume>373</volume>
          {
          <fpage>382</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aldeneh</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bara</surname>
            ,
            <given-names>C.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burzo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Provost</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          :
          <article-title>Muse-ing on the impact of utterance ordering on crowdsourced emotion annotations</article-title>
          .
          <source>In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . pp.
          <volume>7415</volume>
          {
          <fpage>7419</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>N.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Georgiev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qendro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning</article-title>
          .
          <source>In: Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing</source>
          . pp.
          <volume>283</volume>
          {
          <fpage>294</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          :
          <article-title>Unsupervised representation learning by sorting sequences</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <volume>667</volume>
          {
          <issue>676</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>User-level psychological stress detection from social media using deep neural network</article-title>
          .
          <source>In: Proceedings of the 22nd ACM international conference on Multimedia</source>
          . pp.
          <volume>507</volume>
          {
          <fpage>516</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mandic</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
          </string-name>
          , J.:
          <article-title>Recurrent neural networks for prediction: learning algorithms, architectures and stability</article-title>
          . John Wiley &amp; Sons, Inc. (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>McEwen</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          :
          <article-title>Protective and damaging e ects of stress mediators</article-title>
          .
          <source>New England journal of medicine 338(3)</source>
          ,
          <volume>171</volume>
          {
          <fpage>179</fpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pantic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebe</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A ective multimodal human-computer interaction</article-title>
          .
          <source>In: Proceedings of the 13th annual ACM international conference on Multimedia</source>
          . pp.
          <volume>669</volume>
          {
          <fpage>676</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Papakostas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannakopoulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Speech-music discrimination using deep visual feature extractors</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>114</volume>
          , 334{
          <fpage>344</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Papakostas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spyrou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannakopoulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siantikos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sgouropoulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mylonas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makedon</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition</article-title>
          .
          <source>Computation</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <volume>26</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Picard</surname>
            ,
            <given-names>R.W.:</given-names>
          </string-name>
          <article-title>A ective computing</article-title>
          . MIT press (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gedeon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Objective measures, sensors and computational techniques for stress recognition and classi cation: A survey</article-title>
          .
          <source>Computer methods and programs in biomedicine 108(3)</source>
          ,
          <volume>1287</volume>
          {
          <fpage>1301</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Personal a ective trait computing using multiple data sources</article-title>
          .
          <source>In: 2019 International Conference on Internet of Things (iThings)</source>
          and
          <article-title>IEEE Green Computing and Communications (GreenCom) and</article-title>
          IEEE Cyber,
          <article-title>Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)</article-title>
          . pp.
          <volume>66</volume>
          {
          <fpage>73</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Tewari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zollhofer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrido</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernard</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobalt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Mofa:
          <article-title>Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <volume>1274</volume>
          {
          <issue>1283</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>