Adversarial and Cooperative Correlated Domain
        Adaptation based Multimodal Emotion Recognition

                           Jie-Lin Qiu1 , Xiaoshi Chen2 , and Kai Hu1∗
                           1
                               Shanghai Jiao Tong University, Shanghai, China
                                   2
                                     Southeast University, Nanjing, China
                                     {sjtu_hukai@sjtu.edu.cn}


           Abstract. In this paper, we propose a new model, Adversarial and Cooperative
           Correlated Domain Adaptation (ACCDA), to make multimodal emotion recogni-
           tion. Adversarial and Cooperative Correlated Domain Adaptation (ACCDA) is an
           extension and unity, which unifies adversarial discriminative domain adaptation
           and cooperative generative domain adaptation with deep canonical correlation
           analysis to train highly correlated domains of multiple physiological data (EEG
           and eye movement signals), which make use of their complentarity and relevance.
           In experiments on two real world datasets, we find that our model can significant-
           ly contribute to higher emotion classification accuracy when higher correlation
           is acquired. Our experiment results indicate that the Adversarial and Coopera-
           tive Correlated Domain Adaptation model performs better than the state-of-the-
           art methods with a mean accuracy of 88.64% for four emotion classification on
           SEED IV dataset. It also outperforms than the state-of-the-art results on DEAP
           dataset with a mean accuracy of 86.15% for two dichotomies.

           Keywords: Emotion Recognition, EEG, Eye Movement, Domain Adaptation


    1   Introduction and Related Work
    Multimodal emotion recognition from electroencephalography (EEG) and eye move-
    ment features have attracted increasing interest. Integrating this information with fu-
    sion technologies is attractive for constructing robust emotion recognition models. The
    combination of signals from the central nervous system, EEG, and external behaviors,
    eye movement, has been reported to be a promising approach [1,2,3,4,5,6,7,8,9,21].
        Domain adaptation methods attempt to mitigate the harmful effects of domain shift.
    Recent domain adaptation methods learn deep neural transformations that map both
    domains into a common feature space. This is generally achieved by optimizing the
    representation to minimize some measure of domain shift such as maximum mean
    discrepancy [11,12] or correlation distances [13,14]. An alternative is to reconstruct
    the target domain from the source representation [15]. Adversarial adaptation methods
    have become an increasingly popular incarnation of this type of approach which seeks
    to minimize an approximate domain discrepancy distance through an adversarial objec-
    tive with respect to a domain discriminator. These methods are closely related to gener-
    ative adversarial learning [16], which pits two networks against each other a generator
∗
    corresponding author
2        Jie-Lin Qiu et al.

and a discriminator. The generator is trained to produce images in a way that confuses
the discriminator, which in turn tries to distinguish them from real image examples. In
domain adaptation, this principle has been employed to ensure that the network cannot
distinguish between the distributions of its training and test domain examples [17,18].
However, each algorithm makes different design choices such as whether to use a gen-
erator, which loss function to employ, or whether to share weights across domains. For
example, [17] share weights and learn a symmetric mapping of both source and target
images to the shared feature space, while [18] decouple some layers thus learning a par-
tially asymmetric mapping. Eric Tzeng et al. combined discriminative modeling, untied
weight sharing, and a GAN loss to form ADDA model and outperformed unsupervised
adaptation results [19].
     However, since the gradient computation requires back propagation through the
generators output, GAN can only model the distribution of continuous variables, mak-
ing it non-applicable for discrete sequences generation. Researchers then proposed Se-
quence Generative Adversarial Network (SeqGAN) [18], which uses model-free policy
gradient algorithm to optimize the original GAN objective. With SeqGAN, the expect-
ed JSD between current and target discrete data distribution is minimized if the training
is perfect. SeqGAN shows observable improvements in many tasks. Since then, many
variants of SeqGAN have been proposed to improve its performance. However, Seq-
GAN is not an ideal algorithm for this problem, and current algorithms based on it
cannot show stable, reliable and observable improvements that cover all scenarios. So
Lu et al. proposed CoT for training generative models that measure a tractable density
function for target data [20]. For multimodal emotion, Lu et al. used both EEG sig-
nals and eye movement signals to recognize three types of emotions [21]. Liu et al.
furthermore used Bimodal Deep AutoEncoder to extract high level representation fea-
tures [22]. Tang et al. adopted the Bimodal Deep Denoising AutoEncoder modal, taking
Bimodal-LSTM model into account [23].
     In this paper, we combine adversarial and cooperative networks to propose a new
domain adaptation framework, named Adversarial and Cooperative Correlated Domain
Adaptation (ACCDA), where correlation calculated between different models in high
dimension which can take advantage of complementary of multiple models and tends
out to achieve remarkable results. Our results demonstrate the complementary of ad-
versarial and cooperative networks, which indicates a new direction for multiple model
based tasks.


2     Adversarial and Cooperative Correlated Domain Adaptation
2.1   Background
Domain adaptation
DA is a branch of transfer learning (i.e., transductive learning within the same fea-
ture space [24]). The source domain is denoted by Ds = {Xs , Ys }, in which Xs =
{xs1 , xs2 , ..., xsn } is the input and Ys = {ys1 , ys2 , ..., ysn } is the corresponding label
set. The values of Xs and Ys are drawn from the joint distribution P (Xs , Ys ). Similar-
ly, the target domain denoted by Dt = {Xt , Yt } corresponds to data and labels drawn
from the joint distribution P (Xt , Yt ). In this paper, we consider unsupervised domain
                          Adversarial and Cooperative Correlated Domain Adaptation          3

adaptation, which means label information from the target domain is not required. Typ-
ically, the marginal distributions of the input data are different between source domain
and target domain: P (Xs ) 6= P (Xt ). This is usually referred to as domain shift and
is considered to be the key problem that leads to poor performance when a model is
trained and tested on data from different domains. To eliminate the influence of domain
shift, feature-based domain adaptation methods try to find a proper transformation func-
tion φ(·) that aligns the data into a new feature space where P (φ(Xs )) = P (φ(Xt )).
Domain-Adversarial Neural Network
    DANN was first proposed in [17], and its properties and applications are then fur-
ther explored in [19]. The model can be divided into the following three parts: a feature
extractor Gf , a label predictor Gy , and a domain classifier Gd . There exists adversarial
relationships between the feature extractor and the domain classifier. The feature ex-
tractor, as the name implies, extracts new features from input features: f = Gf (x; θf ).
Here x denotes input feature vector and f denotes the corresponding output feature
vector in a new feature space. The outputs are then fed into the label predictor and the
domain classifier. The label predictor provides predictions of the corresponding labels:
ŷ = Gy (f ; θy ). The domain classifier distinguishes which domain the input is from:
dˆ = Gd (f ; θd ). The three parts are updated simultaneously with the objective function:
                                         N
                                         X                          N
                                                                    X
                    E(θf , θy , θd ) =         Ly (ŷi , yi ) − λ         Ld (dˆi , di )   (1)
                                         i=1                        i=1

where the first term Ly (·, ·) is the loss for label prediction, and Ld (·, ·) corresponds to
the loss for domain classification. The update rule is designed as follows:

            (θ̂f , θ̂y ) = arg min E(θf , θy , θ̂d ), θ̂d = arg max E(θ̂f , θ̂y , θd )     (2)
                             θf ,θy                                       θd

It can be observed that the label predictor and domain classifier are trained so that
the corresponding losses are minimized. The feature extractor is trained so that the
label prediction loss is minimized while the domain classification loss is maximized.
So the feature extractor is trying to extract features that are good for label prediction,
but not easy to distinguish which domain the features come from. In this way, the feature
extractor is to extract domain invariant features, so the domain shift can be eliminated.
Adversarial Discriminative Domain Adaptation
    Similar to DANN, ADDA also can be divided into three parts, except that there are
two feature extractors, one for source domain data and another for target domain data
[19]. Let Gf 0 and Gf 1 be the corresponding feature extractors for source domain and
target domain, respectively. The training procedure is two-stage. In the first stage, Gf 0
and the label predictor Gy are trained with source domain data so that the prediction
loss is minimized. After the training, the parameters of Gf 0 and Gy are fixed during the
following process. In the second stage, Gf 1 is initialized with the parameters of Gf 0 .
Then Gf 1 and Gd are trained adversarially: Gd is trained to discriminate source domain
data and target domain data, while Gf 1 is trained to fool Gd . So, after the training, the
feature extractor Gf 1 aligns the distribution of the target domain data to that of the
source domain data.
Canonical Correlation Analysis
4       Jie-Lin Qiu et al.

    Canonical correlation analysis is an algorithm to learn the non-linear transformation
of parameters of two random vectors in order to maximize the correlation between them
[25]. Let (X1 , X2 ) ∈ Rn1 × Rn2 denote random vectors with covariances(Σ11 , Σ22 )
and crosscovariance Σ12 . CCA finds maximally correlated pairs of linear projections of
the two views,(ω1 0 X1 , ω2 0 X2 ):

                                                                              ω1 0 Σ12 ω2
    (ω1∗ , ω2∗ ) = arg max corr(ω1 0 X1 , ω2 0 X2 ) = arg max √                                    (3)
                  ω1 ,ω2                                   ω1 ,ω2        ω1 0 Σ11 ω1 ω2 0 Σ22 ω2
   Since the objective is invariant to scaling of w1 and w2 , we can limit the projections
with unit variance:
                      (ω1 ∗, ω2 ∗) =           arg max               ω1 0 Σ12 ω2                   (4)
                                       ω1 0 Σ11 ω1 =ω2 0 Σ22 ω2 =1

When finding multiple pairs of vectors (ω1 i , ω2 i ) ,subsequent projections are also con-
strained to be uncorrelated with previous ones, which is ω1 i Σ11 ω1 j = ω2 i Σ22 ω2 j = 0
for i < j . Assembling the top k projection vectors ω1 i into the columns of a matrix
A1 ∈ Rn1 ×k , and similarly placing ω2 i into A2 ∈ Rn2 ×k , we obtain the following
formulation to identify the top k ≤ min(n1 , n2 ) projections:
        maximize : tr(A1 0 Σ12 A2 ), subject to : A1 0 Σ11 A1 = A2 0 Σ22 A2 = I                    (5)
Cooperative Generative Model
    Lu et al. proposed Cooperative Training (CoT) for training generative models that
measure a tractable density function for target data [20]. CoT coordinately trains a gen-
erator G and an auxiliary predictive mediator M . The training target of M is to estimate
a mixture density of the learned distribution G and the target distribution P , and that of
G is to minimize the Jensen-Shannon divergence estimated through M . CoT achieves
independent success without the necessity of pre-training via Maximum Likelihood
Estimation or involving high-variance algorithms like REINFORCE. This low-variance
algorithm is theoretically proved to be unbiased for both generative and predictive tasks.


Fig. 1: Adversarial and Cooperative Correlated Domain Adaptation Networks. The left
part is ADDA network, the right part is CGDA network, and the DCCA network is in
the middle.
                         Adversarial and Cooperative Correlated Domain Adaptation          5

2.2   Our Model
The overall architecture of the Adversarial and Cooperative Correlated Domain Adap-
tation (ACCDA) is shown in Figure 1. There are two views of networks which con-
tain EEG signals and eye movement signals respectively. It consists of three parts: an
adversarial discriminative domain adaptation (ADDA) network, an cooperative gener-
ative domain adaptation (CGDA) network, and a deep canonical correlation analysis
network. The two domain adaptation networks will be trained simultaneously and in-
dependently, the DCCA network is applied to extract more highly correlated source
and target map of both views. We describe the details of different components in the
following paragraphs.
     For adversarial discriminative domain adaptation network, we first pre-train a source
encoder network using labeled source examples. Next, we perform adversarial adapta-
tion by learning a target encoder network such that a discriminator that sees encoded
source and target examples cannot reliably predict their domain label. During testing,
target images are mapped with the target encoder to the shared feature space and clas-
sified by the source classifier. Source network’s pre-trained network parameters will be
fixed and transmitted to target network. In unsupervised adaptation, we assume access
to source images Xs and labels Ys drawn from a source domain distribution ps (x, y),
as well as target images Xt drawn from a target distribution pt (x, y), where there are
no label observations. Domain adaptation instead learns a source representation map-
ping, Ms , along with a source classifier, Cs , and then learns to adapt that model for
use in the target domain. We regularize the learning of the source and target mappings,
Ms and Mt , so as to minimize the distance between the empirical source and target
mapping distributions: Ms (Xs ) and Mt (Xt ). The source classification model is then
trained using the standard supervised loss:

                                                        K
                                                        X
         min Lcls (Xs , Yt ) = E(xs ,ys )∼(Xs ,Ys ) −         I[ y = ks ]logC(Ms (xs ))   (6)
         Ms ,C
                                                        k=1

A domain discriminator D is optimized according to a standard supervised loss:
LadvD (Xs , Xt , Ms , Mt ) = −Exs ∼Xs [logD(Ms (xs ))] − Ext ∼Xt [log(1 − D(Ms (xs )))]
                                                                                         (7)
Once the mapping parameterization is determined for the source, target mapping is set
so as to minimize the distance between the source and target domains under their re-
spective mappings, while crucially also maintaining a target mapping that is category
discriminative. Consider a layered representations where each layer parameters are de-
noted as, Msl or Mtl , for a given set of equivalent layers, {l1 , ..., ln }. Then the space
of constraints explored in the literature can be described through layerwise equality
constraints as follows:
                       ψ(Ms , Mt ) , {ψli (Msli , Mtli )}i∈{1,...,n}                      (8)
where each individual layer can be constrained independently. A very common form of
constraint is source and target layerwise equality:
                             ψli (Msli , Mtli ) = (Msli = Mtli )                          (9)
6        Jie-Lin Qiu et al.

We choose to allow independent source and target mappings by untying the weights.
This is a more flexible learning paradigm as it allows more domain specific feature
extraction to be learned. However, note that the target domain has no label access, and
thus without weight sharing a target model may quickly learn a degenerate solution if
we do not take care with proper initialization and training procedures. Therefore, we use
the pre-trained source model as an initialization for the target representation space and
fix the source model during adversarial training. In doing so, we are effectively learning
an asymmetric mapping, in which we modify the target model so as to match the source
distribution. This is most similar to the original generative adversarial learning setting,
where a generated space is updated until it is indistinguishable with a fixed real space.
Therefore, we choose the inverted label GAN loss:

                     LadvM (Xs , Xt , D) = −Ext ∼Xt [logD(Mt (xt ))]                  (10)

    As for cooperative generative domain adaptation network, inspired by Lu et al.’s
work [20], we set the CGDA with similar structure compared with adversarial discrim-
inative domain adaptation network, only replace discriminator with generator. So the
domain generator G loss is:

    LgenG (Xs , Xt , Ms , Mt ) = Es∼pdata [log(Mφ (xs ))] + Es∼Gθ [log(Mφ (xs ))]     (11)

where Mφ is the mediator, a predictive module that measures a mixture distribution
of the learned generative distribution Gθ and target latent distribution P = pdata as
Mφ = 12 (P + Gθ ).


                     Table 1: Value sets for hyperparameter tuning.
                      Type                           Value Set
               Subspace Dimension            {10,20,40,60,80,100,120}
             λ for ADDA and CGDA            {2n |n ∈ {−10, −9, ..., 10}}
              Regulation Parameters        {1en |n ∈ {−10, −9, ..., −3}}
             Learning Rate for Adam {2 × 10−4 |n ∈ {−10, −9, ..., 10}}
                                          n

             Learning Rate for DCCA         {10n |n ∈ {−5, −4, ..., −1}}
                                          [400 ± 40, 200 ± 20, 150 ± 20,
                  DCCA Layers
                                             120 ± 10, 60 ± 10, 20 ± 2]


    For deep canonical correlation analysis, we take advantage of complementarity of
EEG and eye movement signals and train an DCCA network to extract highly corre-
lated domain of both views. DCCA is proposed by Galen Andrew et al., which is a
non-linear version of CCA that uses neural networks as the mapping functions instead
of linear transformers [26]. DCCA directly optimizes the correlation between the t-
wo views’ potential learning representation. Retrieval can be performed by the cosine
distance when given the correlated embedding representations of the two views. We re-
gard the source and target domain of two views as input respectively. The layer size of
both views are the same, including input layer L1 , hidden layers L2 , and output layer
                             Adversarial and Cooperative Correlated Domain Adaptation       7

    L3 with nodes of each layer are fully connected. When training, we first use the deep
    networks to extract features, then we calculate the correlation at the output layer with
    canonical correlation analysis. The goal is to jointly learn parameters for both views’
    W and b, where W ∈ Rc1 ×n1 is a matrix of weights, b ∈ Rc1 is a vector of biases,
    and c1 is the units of each intermediate layer in the network for the first view, such
    that corr(f1 (X1 ), f2 (X2 )) is as high as possible, where f (·) is the whole function of
    each view’s network. We define H1 and H2 matrices whose columns are the top-level
    representations produced by the deep models on the two views in layer L3 with k eigen-
    values, and the total correlation of H1 and H2 is the sum of the k singular values of the
    matrix:

                 −1/2        −1/2
          T = Σ11       Σ12 Σ22     , lcorr = corr(H1 , H2 ) = ||T ||tr = tr(T 0 T )1/2   (12)

    Weights of nodes update using back propagation. The DCCA parameters Wlv and bvl
    are trained to use gradient-based optimization to optimize this quantity.

                         ∂corr(H1 , H2 )    1
                                         =     (2∇11 H 1 + ∇12 H 2 )                      (13)
                             ∂H1           m−1

    where
                            −1/2          −1/2            1 −1/2         −1/2
                  ∇12 = Σ̂11       U V 0 Σ̂22    , ∇11 = − Σ̂11 U DU 0 Σ̂22 .             (14)
                                                          2


    3     Experiment

    3.1   Dataset

    We evaluate the performance of the approaches on two real-world datasets: the SEED
    IV1 dataset, and the DEAP2 dataset [27]. The SEED IV dataset contains EEG and eye
    movement signals in total of four emotions [28]. There were 72 film clips in total for
    four emotions and forty five experiments were taken by participants to assess their e-
    motions when watching the film clips with keywords of emotions and ratings out of ten
    points for two dimensions: valence and arousal. The valence scale ranges from sad to
    happy. The arousal scale ranges from calm to excited. The EEG signals were recorded
    with ESI NeuroScan System at a sampling rate of 1000 Hz with a 62-channel electrode
    cap. The eye movement signals were recorded with SMI ETG eye tracking glasses.
    The DEAP dataset contains EEG signals and peripheral physiological signals of 32
    participants. Signals were collected while participants were watching one-minute-long
    emotional music videos. We chose 5 as the threshold to divide the trials into two classes
    according to the rated levels of arousal and valence. We used 5-fold cross validation to
    compare with Liu et al. [22] and Yin et al. [29].

1
    http://bcmi.sjtu.edu.cn/∼ seed/
2
    http://www.eecs.qmul.ac.uk/mmv/datasets/deap/
8       Jie-Lin Qiu et al.

3.2   Feature Extraction
For SEED IV dataset, we extracted Differential Entropy (DE) features from each EEG
signal channel in five frequency bands: δ (1-4 Hz), θ (4-8 Hz), α (8-14Hz), β (14-31 Hz)
and γ (31-50 Hz). The size of Hanning window used when extracting EEG features was
4 s. At each time step, there were totally 310 (5 bands × 62 channels) dimensions for
EEG features. As for eye movement data, the features used are shown in Fig 2 and Table
2. There were totally 39 dimensions including both Power Spectral Density (PSD) and
DE features of pupil diameters at each time step. Before training the model, the features
were normalized to zero mean. One view contains EEG features and the other contains
eye movement features.
    For DEAP dataset, we extracted DE features from EEG signals in four frequency
bands: θ (4-8 Hz), α (8-14 Hz), β (14-31 Hz) and γ (31-50 Hz), since a bandpass fre-
quency filter from 4 - 45 Hz was applied during pre-processing. The size of Hanning
windows was 2 s. Then there were totally 128 (4 bands × 32 channels) dimension-
s of extracted 32-channel EEG features. As for peripheral physiological signals, six
time-domain features were extracted to describe the signals in different perspective, in-
cluding maximum value, minimum value, mean value, standard deviation, variance and
squared sum. So there were totally 48 (6 features × 8 channels) dimensions of extracted
peripheral physiological features.


Fig. 2: Illustration of various eye movement parameters: pupil diameter, fixation disper-
sion, saccade amplitude, saccade duration, and blink.


3.3   Experiment
Our objective now is to perform domain adaptation between different subjects. The
leave-one-subject-out cross-validation algorithm is applied, which means, for each do-
main adaptation method there are a few runs, and for each run the data from one of
the subjects are regarded as target domain while the data from other subjects as source
domains. Multi-layer perceptrons (MLPs) are used for the feature extractors, the label
predictors, and the domain classifiers in the adversarial domain adaptation networks.
Adam optimizer [30] was adopted for training of the networks to obtain faster con-
vergence. We performed randomized search of the hyperparameters over some prede-
fined sets of values. For each method, the hyperparameter settings were evaluated with
the leave-one-subject-out cross-validation algorithm and the best setting was chosen to
generate the final results. For DCCA networks, we use Grid Search to find optimal hy-
perparameters. The specific predefined value sets for some of the hyperparameters are
listed in Table 1. After extracting features by DCCA, we apply SVM for classification.
                          Adversarial and Cooperative Correlated Domain Adaptation    9


               Table 2: The details of the extracted eye movement features.
      Eye movements parameters                      Extracted features
      Pupil diameter(X and Y)          Mean,standard deviation, DE in four bands
                                       (0-0.2Hz, 0.2-0.4Hz, 0.4-0.6Hz, 0.6-1Hz)
      Disperson(X and Y)                        Mean, standard deviation
      Fixation duration (ms)                    Mean, standard deviation
      Blink duration (ms)                       Mean, standard deviation
      Saccade                     Mean, standard deviation of saccade duration(ms)
                                                 and saccade amplitude
      Event statistics                Blink frequency, fixation frequency, fixation
                                      dispersion total, fixation duration maximum,
                                   fixation dispersion maximum, saccade frequency,
                                       saccade duration average, saccade latency
                                          average, saccade amplitude average.


4     Results and Discussion

4.1    Results on Different Datasets

For SEED IV dataset, we regard Zheng et al.’s multimodal deep learning results as our
baseline [28]. We use different kinds of methods to make comparison with our mod-
el. Table 3 demonstrates that BDAE achieved better results than SVM based feature
fusion. Compared with CCA based approach and other method, we conclude that ACC-
DA model which coordinated signals achieved better results. Table 4 shows comparison
results of different methods on DEAP dataset. For two dichotomous classification, Liu
et al.’s multimodal autoencoder model achieved 2% higher than AutoEncoder. Yin et
al. used an ensemble of deep classifiers, making higher-level abstractions of physio-
logical features [29]. Then Tang et al. used Bimodal-LSTM and achieved the state-of-
the-art accuracy for two dichotomous classification [23]. As for our ACCDA method,
we learned correlation of multiple domain signals and achieved better results than the
state-of-the-art method with mean accuracies of 85.86% and 86.45% for arousal and
valence classification tasks.


Table 3: Average accuracies (%)and standard deviation of different approaches for four
emotion classification on SEED IV dataset
                                 CCA SVM BDAE ACCDA
                     Accuracy(%) 49.56 75.88 85.11 88.64
                     Std         19.24 16.14 11.79        8.53
10      Jie-Lin Qiu et al.


Table 4: Comparison of average accuracies (%) of different approaches on DEAP
dataset for two dichotomous
              CCA AutoEncoder Liu et al. Yin et al. Bimodal-LSTM [23] ACCDA
 Arousal(%) 61.25      74.49    80.5      84.18           83.23        85.86
 Valence(%) 69.58      75.69    85.2      83.04           83.82        86.45


4.2   Discussion
In comparison with previous feature-level fusion and multimodal deep learning method,
it is very difficult to relate the original features in one modality to features in other
modality and this method usually learns unimodal features [31]. Moreover, the rela-
tions across various modalities are deep instead of shallow. In our model, we can learn
coordinated representation from high-level signals and make two views of signals be-
come more complementary, which in return improves the classification performance
of fusion features. To find out the complementarity of adversarial and cooperative do-
main adaptation, we verified the performance of multiple networks, which means both
view’s ADDA networks, both view’s CGDA networks to compare with ADDA-CGDA
and CGDA-ADDA method.


Fig. 3: (a) Confusion graph of EEG and eye movements of SEED IV dataset, which
shows their complementary characteristics. The numbers denote the percentage values
of samples in the class (arrow tail) classified as the class (arrow head). (b) Accuracies
of three methods when epoch number increases on DEAP dataset.


5     Conclusion
In this paper, we proposed a new method, Adversarial and Cooperative Correlated Do-
main Adaptation (ACCDA), to make multimodal emotion recognition on three real
world datasets. The model learns correlation from high-level domains due to the com-
plementarity and relevance of multiple signals. The experimental results have shown
that our model contributes to higher classification accuracy of emotion recognition with
high correlation.
                           Adversarial and Cooperative Correlated Domain Adaptation             11

References
 1. Soleymani, M., Pantic, M., Pun, T.: Multimodal emotion recognition in response to videos.
    IEEE Trans. Affective Computing, 3, 211-223 (2012)
 2. DMello, S. K., and Westlund, J. K. 2015. A review and meta-analysis of multimodal affect
    detection systems. ACM Comput. Surv. 47:43:143:36.
 3. Picard, R. W. 1997. Affective computing.
 4. Bocharov, A. V.; Knyazev, G. G.; and Savostyanov, A. N. 2017. Depression and implicit
    emotion processing: An eeg study. Neurophysiologie clinique 47 3:225230.
 5. Tzirakis, P.; Trigeorgis, G.; Nicolaou, M. A.; Schuller, B. W.; and Zafeiriou, S. 2017. End-to-
    end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected
    Topics in Signal Processing 11:13011309.
 6. Hassib, M.; Schneega, S.; Eiglsperger, P.; Henze, N.; Schmidt, A.; and Alt, F. 2017b. En-
    gagemeter: A system for implicit audience engagement sensing using electroencephalogra-
    phy. In CHI.
 7. Wang, X.-W.; Nie, D.; and Lu, B.-L. 2014. Emotional state classification from eeg data using
    machine learning ap- proach. Neurocomputing 129:94106.
 8. Hassib, M.; Pfeiffer, M.; Schneega, S.; Rohs, M.; and Alt, F. 2017a. Emotion actuator: Em-
    bodied emotional feedback through electroencephalography and electrical muscle stim- ula-
    tion. In CHI.
 9. Zheng, W.-L.; Dong, B.-N.; and Lu, B.-L. 2014. Multimodal emotion recognition using eeg
    and eye tracking data. 2014 36th Annual International Conference of the IEEE Engineering
    in Medicine and Biology Society 50405043.
10. Lu,Y.;Zheng,W.-L.;Li,B.;andLu,B.-L. 2015. Combining eye movements and eeg to enhance
    emotion recognition. In IJCAI.
11. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion:
    Maximizing for domain invariance. CoRR abs/1412.3474.
12. Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with
    deep adaptation networks. In ICML.
13. Sun, B.; Feng, J.; and Saenko, K. 2016. Return of frustratingly easy domain adaptation. In
    AAAI.
14. Sun,B.,andSaenko,K. 2016. Deepcoral:Correlationalignment for deep domain adaptation. In
    ECCV Workshops.
15. Ghifary, M.; Kleijn, W. B.; Zhang, M.; Balduzzi, D.; and Li, W. 2016. Deep reconstruction-
    classification networks for unsupervised domain adaptation. In ECCV.
16. Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.;
    Courville, A. C.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS.
17. Ganin, Y., and Lempitsky, V. S. 2015. Unsupervised domain adaptation by backpropagation.
    In ICML.
18. Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets
    with policy gradient. In AAAI.
19. Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain
    adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
    29622971.
20. Lu, S.; Yu, L.; Zhang, W.; and Yu, Y. 2018. Cot: Cooperative training for generative model-
    ing. CoRR abs/1804.03782.
21. Lu,Y.;Zheng,W.-L.;Li,B.;andLu,B.-L. 2015. Combining eye movements and eeg to enhance
    emotion recognition. In IJCAI.
22. Liu, W.; Zheng, W.-L.; and Lu, B.-L. 2016. Emotion recognition using multimodal deep
    learning. In ICONIP.
12       Jie-Lin Qiu et al.

23. Tang, H.; Liu, W.; Zheng, W.-L.; and Lu, B.-L. 2017. Multimodal emotion recognition using
    deep neural networks. In ICONIP.
24. Qiao, R.; Qing, C.; Zhang, T.; Xing, X.; and Xu, X. 2017. A novel deep-learning based
    framework for multi-subject emotion recognition. 2017 4th International Conference on In-
    formation,CyberneticsandComputationalSocialSystems (ICCSS) 181185.
25. Hotelling, H. 1936. Relations between two sets of variates. Biometrikae.
26. Andrew, G.; Arora, R.; Bilmes, J. A.; and Livescu, K. 2013. Deep canonical correlation
    analysis. In ICML.
27. Koelstra, S.; Mhl, C.; Soleymani, M.; Lee, J.-S.; Yazdani, A.; Ebrahimi, T.; Pun, T.; Nijholt,
    A.; and Patras, I. 2012. Deap: A database for emotion analysis ;using physiological signals.
    IEEE Transactions on Affective Computing 3:18 31.
28. Zheng, W.-L.; Liu, W.; Lu, Y.; liang Lu, B.; and Cichock- i, A. 2018. Emotionmeter: A
    multimodal framework for recognizing human emotions. IEEE transactions on cyber- netics.
29. Yin, Z.; Zhao, M.; Wang, Y.; Yang, J.; and Zhang, J. 2017. Recognition of emotions using
    multimodal physiological signals and an ensemble deep learning model. Computer methods
    and programs in biomedicine 140:93110.
30. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR ab-
    s/1412.6980.
31. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep
    learning. In ICML.