<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eunjeong Koh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shlomo Dubnov</string-name>
          <email>sdubnovg@ucsd.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Music Department University of California</institution>
          ,
          <addr-line>San Diego La Jolla, CA 92039</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emotion is a complicated notion present in music that is hard to capture even with fine-tuned feature engineering. In this paper, we investigate the utility of state-of-the-art pre-trained deep audio embedding methods to be used in the Music Emotion Recognition (MER) task. Deep audio embedding methods allow us to efficiently capture the high dimensional features into a compact representation. We implement several multi-class classifiers with deep audio embeddings to predict emotion semantics in music. We investigate the effectiveness of L3-Net and VGGish deep audio embedding methods for music emotion inference over four music datasets. The experiments with several classifiers on the task show that the deep audio embedding solutions can improve the performances of the previous baseline MER models. We conclude that deep audio embeddings represent musical emotion semantics for the MER task without expert human engineering.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        It is an essential step for music indexing and
recommendation tasks to understand emotional information in music.
Previous Music Emotion Recognition (MER) studies
explore sound components that can be used to analyze
emotions such as duration, pitch, velocity, and melodic interval.
Those representations are high-level acoustic features based
on domain knowledge
        <xref ref-type="bibr" rid="ref28 ref30 ref31 ref47 ref7">(Wang, Wang, and Lanckriet 2015;
Madhok, Goel, and Garg 2018; Chen et al. 2016; Lin, Chen,
and Yang 2013)</xref>
        .
      </p>
      <p>
        Relying on human expertise to design the acoustic
features for pre-processing large amounts of new data is not
always feasible. Furthermore, existing emotion-related
features are often fine-tuned for the target dataset based on
music domain expertise and are not generalizable across
different datasets
        <xref ref-type="bibr" rid="ref30 ref35 ref36 ref7">(Panda, Malheiro, and Paiva 2018b)</xref>
        .
      </p>
      <p>
        Advancement in deep neural networks now allows us
to learn useful domain-agnostic representations, known as
deep audio embeddings, from raw audio input data with no
human intervention. Furthermore, it has been reported that
deep audio embeddings frequently outperform hand-crafted
feature representations in other signal processing problems
such as Sound Event Detection (SED) and video tagging
task
        <xref ref-type="bibr" rid="ref14 ref50">(Wilkinghoff 2020; DCASE 2019)</xref>
        .
      </p>
      <p>
        The power of deep audio embeddings is to automatically
identify predominant aspects in the data at scale.
Specifically, the Mel-based Look, Listen, and Learn network
(L3Net) embedding method recently matched state-of-the-art
performance on the SED task
        <xref ref-type="bibr" rid="ref13">(Cramer et al. 2019)</xref>
        . Using
a sufficient amount of training data (around 60M training
samples) and carefully designed training choices, Cramer
et al. were able to detect novel sound features in each
audio clip using the L3-Net audio embeddings
        <xref ref-type="bibr" rid="ref13">(Cramer et al.
2019)</xref>
        . Cramer et al. released their optimal pre-trained
L3Net model which can now be extended to new tasks.
      </p>
      <p>
        In this paper, we compare and analyze the deep audio
embeddings, L3-Net and VGGish, for representing musical
emotion semantics. VGGish is also a type of deep audio
embedding method based on a VGG-like structure trained to
predict video tags from the Youtube-8M dataset
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref4">(Abu-ElHaija et al. 2016; Hershey et al. 2017; Jansen et al. 2017,
2018)</xref>
        . We repurpose the two deep audio embeddings,
originally designed for the SED task, to the task of MER. In
evaluating the performance of the embedding methods over four
different music emotion datasets, we use the embeddings in
several classification models and evaluate their efficacy for
the MER task on each dataset.
      </p>
      <p>Our results show that the embedding methods provide
an effective knowledge transfer mechanism between SED
and MER domains without any additional training samples.
More importantly, the deep audio embedding does not
require expert human engineering of the sound features for
the emotion prediction task. Our study reveals that
audiodomain knowledge from the SED task can be extended to
the MER task.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        One of the goals of the MER task is to automatically
recognize the emotional information conveyed in music
        <xref ref-type="bibr" rid="ref25">(Kim
et al. 2010)</xref>
        . Although there are many studies in the MER
field
        <xref ref-type="bibr" rid="ref30 ref44 ref51 ref52 ref7">(Soleymani et al. 2013; Yang and Chen 2012; Yang,
Dong, and Li 2018)</xref>
        , it is a complex process to compare
features and performances of the studies because of the
technical differences in data representation, emotion labeling, and
feature selection algorithm. In addition, different studies are
L3
VGGish
Features
      </p>
      <p>SVM
Naïve Bayes
Random Forest</p>
      <p>MLP
CNN</p>
      <p>RNN</p>
      <p>Classifiers
difficult to reproduce as many of them use different
public datasets or private datasets with small amounts of music
clips and different levels of features.</p>
      <p>
        Previous studies have utilized neural networks to
efficiently extract emotional information and analyze the salient
semantics of the acoustic features. Recent works explore
neural networks given the significant improvements over
hand-crafted feature-based methods
        <xref ref-type="bibr" rid="ref14 ref26 ref38 ref39 ref42 ref43 ref49 ref5">(Piczak 2015; Salamon
and Bello 2017; Pons and Serra 2019; Simonyan and
Zisserman 2014)</xref>
        . Specifically, using Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs)
based models, several studies attempt to extract necessary
parameters for emotion prediction and reduce the
dimensionality of the corresponding emotional features
        <xref ref-type="bibr" rid="ref10 ref14 ref14 ref15 ref29 ref39 ref39 ref45">(Cheuk
et al. 2020; Thao, Herremans, and Roig 2019; Dong et al.
2019; Liu, Fang, and Huang 2019)</xref>
        . After careful feature
engineering, these methods are suitable for a target data set
for emotion prediction, however, a considerable amount of
training and optimization process is still required.
      </p>
      <p>Deep audio embeddings are a type of audio features
extracted by a neural network that take audio data as an input
and compute features of the input audio. The advantages of
deep audio embedding representations are that they
summarize the high dimensional spectrograms into a compact
representation. Using deep audio embedding representation, 1)
information can be extracted without being limited to
specific kinds of data, and 2) it can save time and resources.</p>
      <p>
        Several studies have used deep audio embedding
methods in music classification tasks. For example, Choi et al.
implemented a convnet feature-based deep audio
embedding and showed how it can be used in six different
music tagging tasks such as dance genre classification, genre
classification, speech/music classification, emotion
prediction, vocal/non-vocal classification, and audio event
classification
        <xref ref-type="bibr" rid="ref11">(Choi et al. 2017)</xref>
        . Kim et al. proposed several
statistical methods to understand deep audio embeddings for
usage in learning tasks
        <xref ref-type="bibr" rid="ref23">(Kim et al. 2019)</xref>
        . However, there
are currently no studies analyzing the use of deep audio
embeddings in the MER task across multiple datasets.
      </p>
      <p>
        Knowledge transfer is getting increased attention in the
Music Information Retrieval (MIR) research as a method to
enhance sound features. Recent MIR studies report
considerable performance improvements in music analysis,
indexing, and classification tasks by using cross-domain
knowledge transfer
        <xref ref-type="bibr" rid="ref19 ref46">(Hamel and Eck 2010; Van den Oord,
Dieleman, and Schrauwen 2013)</xref>
        . For automatic emotion
recognition in speech data, Feng and Chaspari used a Siamese
neural network for optimizing pairwise differences between
source and target data
        <xref ref-type="bibr" rid="ref16">(Feng and Chaspari 2020)</xref>
        . In the
context of SED, where the goal is to detect different sound
events in audio streams, Cramer et al.
        <xref ref-type="bibr" rid="ref13">(Cramer et al. 2019)</xref>
        propose a new audio analysis method, using deep audio
embeddings, based on computer vision techniques. It remains
to be seen if knowledge transfer can be successfully applied
on deep audio embeddings from the SED domain to the MIR
domain for the task of MER.
      </p>
      <p>In this study, we use deep audio embedding methods
designed for the SED task and apply it over four music emotion
datasets for learning emotion features in music.</p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <sec id="sec-3-1">
        <title>Downstream Task: Music Emotion Recognition</title>
        <p>We employ a two-step experimental approach (see Figure 1).</p>
        <p>Step 1. Given a song as an input, a deep audio embedding
model extracts the deep audio embeddings that indicate the
acoustic features of the song.</p>
        <p>Step 2. After extracting deep audio embeddings, the
selected classification model predicts the corresponding
emotion category that indicates the emotion label of the song.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Deep Audio Embeddings</title>
        <p>
          We choose two deep audio embedding methods, L3-Net
and VGGish, which are state-of-the-art audio
representations pre-trained on 60M AudioSet
          <xref ref-type="bibr" rid="ref17 ref21">(Gemmeke et al. 2017)</xref>
          and Youtube-8M data
          <xref ref-type="bibr" rid="ref4">(Abu-El-Haija et al. 2016)</xref>
          . AudioSet
and Youtube-8M are large labeled training datasets that are
widely used in audio and video learning with deep neural
networks.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Look, Listen, and Learn network (L3-Net) L3-Net is an</title>
        <p>
          audio embedding method
          <xref ref-type="bibr" rid="ref13">(Cramer et al. 2019)</xref>
          motivated by
the original work of Look, Listen, and Learn (L3)
          <xref ref-type="bibr" rid="ref42 ref5">(Arandjelovic and Zisserman 2017)</xref>
          that processes Audio-Visual
Correspondence learning task in computer vision research.
The key differences between the original L3 (by
Arandjelovic´ and Zisserman) and L3-Net (by Cramer et al.) are
(1) input data format (video vs. audio), (2) final embedding
dimensionality, and (3) training sample size.
        </p>
        <p>
          The L3-Net audio embedding method consists of 2D
convolutional layers and 2D max-pooling layers, and each
convolution layer is followed by batch normalization and a
ReLU nonlinearity (see Figure 2). For the last layer, a
maxpooling layer is performed to produce a single 512
dimension feature vector (L3-Net serves as an option for output
embedding size such as 6144 or 512, and we choose 512 as
our embedding size). The L3-Net method is pre-trained on
Google AudioSet 60M training samples containing mostly
musical performances
          <xref ref-type="bibr" rid="ref17 ref21">(Gemmeke et al. 2017)</xref>
          .
        </p>
        <p>We follow the design choices of the L3-Net study which
result in the best performance in their SED task. We use Mel
spectrograms with 256 Mel bins spanning the entire audible
frequency range, resulting in a 512 dimension feature
vector. We revise OpenL3 open-source implementation1for our
experiments.</p>
        <p>
          VGGish We also verify another deep audio embedding
method, VGGish
          <xref ref-type="bibr" rid="ref26 ref43 ref49">(Simonyan and Zisserman 2014)</xref>
          ,
VGGstructure (VGGNet) based deep audio embedding model.
VGGish is a 128-dimensional audio embedding method,
motivated by VGGNet
          <xref ref-type="bibr" rid="ref26 ref43 ref49">(Simonyan and Zisserman 2014)</xref>
          , and
pre-trained on a large YouTube-8M dataset
          <xref ref-type="bibr" rid="ref4">(Abu-El-Haija
et al. 2016)</xref>
          . Original VGGNet is targeting large scale
image classification tasks, and VGGish is targeting extracting
acoustic features from audio waveforms. The VGGish audio
embedding method consists of 2D convolutional layers and
2D max-pooling layers to produce a single 128 dimension
feature vector (see Figure 2). We modify a VGGish
opensource implementation2 for our experiments.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Music Emotion Classifiers</title>
        <p>From the computed deep audio embeddings, we predict an
emotion category corresponding to each audio vector as a
multi-class classification problem. We employ six
different classification models, Support Vector Machine (SVM),
Naive Bayes (NB), Random Forest (RF), Multilayer
Perceptron (MLP), Convolution Neural Network (CNN), and
Recurrent Neural Network (RNN).</p>
        <p>
          For each classification task, we use 80% of the data for
training, 10% for testing, and 10% for validation. All six
classification models are implemented in Scikit-learn
          <xref ref-type="bibr" rid="ref37">(Pedregosa et al. 2011)</xref>
          , Keras
          <xref ref-type="bibr" rid="ref12">(Chollet et al. 2015)</xref>
          , and
Tensorflow
          <xref ref-type="bibr" rid="ref1">(Abadi et al. 2016)</xref>
          . In the case of MLP, CNN, and
RNN classification models, we share some implementation
details below.
        </p>
        <p>1OpenL3 open-source library:https://openl3.readthedocs.io/en/latest/index.html
2VGGish:https://github.com/tensorflow/models/tree/master/research/audioset/
vggish</p>
        <p>
          MLP: We implement the MLP model with two of a
single hidden layer with 512 nodes, a ReLU activation
function, an output layer with a number of emotion categories,
and a softmax activation function. The model is processed
using the categorical cross-entropy loss function and we use
Adam stochastic gradient descent
          <xref ref-type="bibr" rid="ref26 ref43 ref49">(Kingma and Ba 2014)</xref>
          .
We fit the model for 1000 training epochs with the default
batch size of 32 samples and evaluate the performance at the
end of each training epoch on the test dataset.
        </p>
        <p>
          CNN: For CNN classification model, we revise the
convolutional filter design proposed by Abdoli et al.
          <xref ref-type="bibr" rid="ref14 ref3 ref39">(Abdoli,
Cardinal, and Koerich 2019)</xref>
          , which includes four 1D
convolution layers and a 1D max-pooling operation layer. Each
layer processes 64 convolutional filters. The input to the
network is a Mel spectrogram, size of 512 feature vector
extracted from a deep audio embedding method. This input
size is varied depending on the type of embedding
methods. For example, in the case of L3-Net, the embedding size
is 512, VGGish embedding size is 128. ReLU activation
functions are applied to the convolutional layers to reduce
the backpropagation errors and accelerate the learning
process
          <xref ref-type="bibr" rid="ref18 ref33">(Goodfellow, Bengio, and Courville 2016)</xref>
          . The
softmax function is used as the output activation function with a
number of emotion categories. Adam optimizer, categorical
cross-entropy loss function, and the batch size of 32
samples are used. The stopping criterion is set as 1000 epochs
with an early-stopping rule if there is no improvement to the
score during the last 100 learning epochs.
        </p>
        <p>
          RNN: Weninger et al.
          <xref ref-type="bibr" rid="ref26 ref43 ref48 ref49">(Weninger, Eyben, and Schuller
2014)</xref>
          propose LSTM-RNN design as an automaton-like
structure mapping from an observation sequence to an
output feature sequence. We use LSTM networks with a
pointwise softmax function based on a number of emotion
categories. Adam optimizer, the categorical cross-entropy loss
function, and the batch size of 32 samples are used. The
same stopping criterion is set as CNNs.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>Four different datasets are selected for computing the
emotional features in music data. In Table 1, we show the number
of music files of each dataset by emotion category.</p>
        <p>
          4Q Audio Emotion Dataset: This dataset is introduced
by Panda et al.
          <xref ref-type="bibr" rid="ref22 ref30 ref35 ref36 ref7">(Panda, Malheiro, and Paiva 2018a)</xref>
          ,
annotated each music clip into four Arousal-Valence (A-V)
quadrants based on Rusell’s model
          <xref ref-type="bibr" rid="ref41">(Russell 2003)</xref>
          : Q1 (A+V+),
Q2 (A+V-), Q3 (A-V-), Q4 (A-V+). Each emotion category
has 225 music clips, and each music clip is 30 seconds long.
The total music clips for the dataset are 900 files.
        </p>
        <p>
          Bi-modal Emotion Dataset: This dataset is introduced
by Malheiro et al.
          <xref ref-type="bibr" rid="ref32">(Malheiro et al. 2016)</xref>
          in a context of
bimodal analysis in the emotion recognition with audio and
lyric information. The emotion category is also annotated
into four A-V quadrants by Russell’s model. In this dataset,
each emotion category has a different number of music clips,
Q1: 52 clips; Q2: 45 clips; Q3: 31 clips, and Q4: 34 clips,
and each music clip is 30 seconds long. The total music clips
for the dataset are 162 files. The size of this dataset is the
smallest for our experiments.
        </p>
        <p>
          Emotion in Music: Using a crowdsourcing platform,
Soleymani et al.
          <xref ref-type="bibr" rid="ref44">(Soleymani et al. 2013)</xref>
          release a music
emotion dataset with 20,000 arousal and valence annotations on
1,000 music clips. For our experiments, we map the arousal
and valence annotation into four A-V quadrants followed by
previous Russell’s model settings. Each emotion category
has a different number of music clips, Q1: 305 clips; Q2: 87
clips; Q3: 241 clips, and Q4: 111 clips, and each music clip
is 45 seconds long. We use 744 music clips of the dataset in
our experiments. This dataset is one of the most frequently
used datasets for the MER task.
        </p>
        <p>
          Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS): This dataset is introduced by
Livingstone et al.
          <xref ref-type="bibr" rid="ref30 ref7">(Livingstone and Russo 2018)</xref>
          for understanding
the emotional context in speech and singing data. In singing
data, it includes the recording clips of human singing with
different emotional contexts. 24 different actors were asked
to sing in six different emotional states: neutral, calm, happy,
sad, angry, fearful. We choose singing data only for our
experiments. Each emotion category has a different number of
music clips, neutral: 92 clips; calm: 184 clips; happy: 184
clips, sad: 184 clips, angry: 184 clips, and fearful: 20 clips,
and each music clip is 5 seconds long. The total music clips
for the dataset are 848 files.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline Audio Features</title>
        <p>
          As a baseline feature, we use Mel-Frequency Cepstral
Coefficients (MFCCs), which are known to be efficient low-level
descriptors for timbre analysis, used as features of music
tagging tasks
          <xref ref-type="bibr" rid="ref11 ref24 ref30 ref7">(Choi et al. 2017; Kim, Lee, and Nam 2018)</xref>
          .
MFCCs describe the overall shape of a spectral envelope.
We first calculate the time derivatives of the given MFCCs
and then take the mean and standard deviation over the time
axis. Finally, we concatenate all statistics into one vector.
We generate the MFCC features of each music clip into a
matrix of 20 x 1500. Librosa is used for MFCCs extraction
and audio processing
          <xref ref-type="bibr" rid="ref12 ref34">(McFee et al. 2015)</xref>
          .
For classification problems, classifier performance is
typically defined according to the confusion matrix associated
with the classifier. We use accuracy measure as a primary
evaluation criterion. We also calculate F1-score and r2 score
for comparison with other baseline models.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluation of Music Emotion Recognition</title>
        <p>In Figure 3, we show the performance of deep audio
embeddings over four music emotion datasets. We empirically
analyze deep audio embeddings in several settings against
baseline MFCC features. The experiments are validated with 20
repetitions of cross-validation where we report the average
results. We share key observations in the next sections.
Performance Analyzed by Features The L3-Net
embedding has the best performance in all considered cases except
for two, CNN classifier accuracy both in Bi-modal Emotion
and Emotion in Music dataset (see Figure 3). Even though
the L3-Net embedding is generally not a descriptor for any
music-related tasks before, the performance convinces us to
use a pre-trained L3-Net audio embedding model for the
MER task.</p>
        <p>Since the direct use of the L3-Net embedding shows the
better performance, we also investigate more about the
different embedding dimension of the L3-Net and compare the
performance between 512 and 6144. Interestingly, we
observe decreasing results with the dimension of 6144 L3
embeddings. This indicates that those extra features might not
be relevant but introducing noise. While the 512 L3
embeddings show consistent higher performance in many cases,
based on our observations, even we increase the depth and
number of parameters, 6144 L3-Net embeddings perform
slightly lower on this MER task. Thus, we have not included
the performance in the figure. Note that reported results in
Figure 3 are only considered the performance of 512.</p>
        <p>Comparing between L3-Net and VGGish, L3-Net
outperforms VGGish across the dataset. This could be because
L3Net was pre-trained on both visual and audio onto the same
embedded space which can include more features. The
performance of VGGish is better than MFCC baseline features
with the rest of the classification models, even though it has
fewer parameters, 128. This justifies our use of L3-Net as a
main deep audio embedding choice for MER task and
VGGish is for some cases.</p>
        <p>
          It is generally known that decision trees and in our case,
RF is better than other neural network based classifiers
where the data comprises a large set of categories
          <xref ref-type="bibr" rid="ref14 ref39">(Pons and
Serra 2019)</xref>
          . It also deals better with dependence between
variables, which might increase the error and cause some
significant features to become insignificant during training.
SVM uses kernel trick to solve non-linear problems whereas
decision trees derive hyper-rectangles in input space to solve
the problem. This is why decision trees are better for
categorical data and it deals with co-linearity better than SVM.
We still find that SVM outperforms RF in some cases. The
reason can be that SVM deals better with margins and thus
better handles outliers. Although these are tangential
considerations, it seems to support the overall notion that MER is
a higher level recognition problem that first needs to address
the division of the data into multiple acoustic categories, also
requiring the learning of a rather non-trivial partition
structure within these sub-categories.
        </p>
        <p>
          Performance Analyzed by Datasets For comparison
with prior works studying emotions in audio signals, we
analyze the performance of previous studies on each dataset we
used. We choose four baseline MER models for our
experiments: 1) Panda et al.
          <xref ref-type="bibr" rid="ref22 ref30 ref35 ref36 ref7">(Panda, Malheiro, and Paiva 2018a)</xref>
          release the 4Q Music Emotion dataset and present the study
of musical texture and expressivity features, 2) Malheiro
et al.
          <xref ref-type="bibr" rid="ref32">(Malheiro et al. 2016)</xref>
          present novel lyrical features
for MER task and release the Bi-modal Emotion dataset,
3) Choi et al.
          <xref ref-type="bibr" rid="ref11">(Choi et al. 2017)</xref>
          present a pre-trained
convnet feature for music classification and regression tasks and
evaluate the model using Emotion in Music dataset, 4) Arora
and Chaspari
          <xref ref-type="bibr" rid="ref30 ref7">(Arora and Chaspari 2018)</xref>
          present the method
of a siamese network for speech emotion classification and
evaluate the method using RAVDESS dataset. We compare
those baseline models to the performance of our proposed
method (see Table 2).
        </p>
        <p>
          In the case of the 4Q Audio Emotion dataset, the
previous study by Panda et al. obtained its best result of 73.5%
F1-score with a high number of 800 features. In Table 3,
Domain Knowledge means a feature set defined by domain
knowledge in the study. For achieving the performance of
the previous study, the following steps are needed. First,
we need to pre-process standard or baseline audio features
of each audio clip. The study used Marsyas, MIR Toolbox,
and PsySound3 audio frameworks to extract a total of 1702
features. Second, we need to calculate the correlation
between the pair of features for normalization. After the
preprocessing, the number of features can be decreased to 898
features. Third, after computing these baseline audio
features, we also need to compute novel features of each
audio clip proposed by the study. Those features were
carefully designed based on domain expertise, such as glissando
features, vibrato, and tremolo features. Finally, baseline
features and extracted novel features are combined for the MER
task. For the evaluation, the study conducted post-processing
of the features with the ReliefF feature selection algorithm
          <xref ref-type="bibr" rid="ref40">(Robnik-Sˇ ikonja and Kononenko 2003)</xref>
          , ranked the features
and evaluated its best-suited features. Since the performance
has been evaluated by hyperparameter tuning and feature
selection algorithms, these factors may influence the
performance of the MER task significantly. Note that in our
proposed approach, we show the performance without any
postprocessing.
        </p>
        <p>
          In the case of the Bi-modal Emotion dataset, the previous
study by Malheiro et al.
          <xref ref-type="bibr" rid="ref32">(Malheiro et al. 2016)</xref>
          presented
its best classification result of 72.6% F1-score on the dataset
which is lower than the performance we have, 88% F1-score
from the result of L3-Net embedding with SVM classifier.
        </p>
        <p>
          In the case of the Emotion in Music dataset, previous
studies predicted the time-varying arousal and valence
annotation and calculated r2 score as a performance
measure
          <xref ref-type="bibr" rid="ref11 ref24 ref26 ref27 ref30 ref43 ref48 ref49 ref7">(Weninger, Eyben, and Schuller 2014; Lee et al. 2019;
Kim, Lee, and Nam 2018; Choi et al. 2017)</xref>
          . We previously
map these time-varying annotations into four A-V quadrants
based on Rusell’s model and show our prediction
performance with four emotion categories (see Figure 3-(c)). For a
fair comparison, we also verify the original time-varying
dynamic annotations from the dataset
          <xref ref-type="bibr" rid="ref44">(Soleymani et al. 2013)</xref>
          and compare the result with the baseline model. Using the
Emotion in Music dataset, Choi et al. reported its r2 scores
of arousal annotation, 0.656 and valence annotation, 0.462
          <xref ref-type="bibr" rid="ref11">(Choi et al. 2017)</xref>
          . The best performance of L3-Net
embeddings achieves 0.671 r2 score on arousal and 0.556 r2
score on valence annotation. The result shows that we have
a considerable and higher performance on arousal and
valence annotation. The result confirms that L3-Net
embedding method shows favorable performance than the previous
embedding features over Emotion in Music data.
        </p>
        <p>
          In the case of RAVDESS data, the study by Arora and
Chaspari
          <xref ref-type="bibr" rid="ref30 ref7">(Arora and Chaspari 2018)</xref>
          reported its best
classification accuracy of 63.8% over the dataset which is lower
than our accuracy, 71.0%, from the result of L3-Net
embedding with CNN classifier (see Figure 3-(d)).
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Performance Analyzed by A-V Quadrants In Table 3,</title>
        <p>we show the results analyzed by each quadrant. This
classification report gives us a further understanding of the
characteristic of each emotion category in music. The meaning
of each quadrant (Q1, Q2, Q3, Q4) information is described
in Table 1.</p>
        <p>
          In the case of the 4Q Audio Emotion dataset, Q2 and Q3
categories obtain a higher score compared to the Q1 and Q4.
This indicates that emotional features in music clips with
lower valence components are easier to recognize.
Specifically, the Q2 category shows higher performance which is
distinctive than others. Based on the dataset
          <xref ref-type="bibr" rid="ref30 ref35 ref36 ref7">(Panda,
Malheiro, and Paiva 2018b)</xref>
          , the study describes music clips of
the Q2 category belong to specific genres, such as heavy
metal, which have recognizable acoustic features than
others.
        </p>
        <p>Lower results in Q1 and Q4 categories may also reflect
the characteristics of music clips. For instance, the Q1
category indicates happy emotions, which are typically energetic
based on positive arousal and positive valence components.
Since Q1 and Q4 categories share the same valence axis
based on Rusell’s model, if the intensity of the song is not
intense, the difference between the two quadrants (Q1&amp;Q4
or Q2&amp;Q3) may not be apparent. This aspect results in
similar behaviors on the Q2 and Q3 categories’ performances as
well.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Conclusion</title>
      <p>In this paper, we evaluate L3-Net and VGGish pre-trained
deep audio embedding methods for MER task over 4Q
Audio Emotion, Bi-modal Emotion, Emotion in Music, and
RAVDESS datasets. Even though L3-Net has not been
intended for emotion recognition, we find that L3-Net is the
best representation for the MER task. Note that we achieve
this performance without any additional domain knowledge
feature selection method, feature training process, and
finetuning process. Comparing to MFCC baseline features, the
empirical analysis shows that L3-Net is robust across
multiple datasets with favorable performance. Overall, the
result using L3-Net shows improvement compared to
baseline models for Bi-modal Emotion, Emotion in Music, and
RAVDESS dataset. In the case of the 4Q Audio Emotion
dataset, complex hand-crafted features (over 100 features)
still seem to perform better. Specifically, our work does not
consider rhythm or specific musical parameters over the time
axis that 4Q Audio Emotion had, looking into time-based
aspects could be the next step for future research.</p>
      <p>In order to gain deeper insight into the meaning of
acoustic features for emotional recognition, we use T-SNE
visualization (see Figure 4). In both cases of L3-Net and VGGish,
two main clusters on the left and right side of the figure
mean male/female singer groups. We can also see a
relatively smooth grouping of samples by emotions with
different colors. In the case of L3-Net embeddings (top figure of
Figure 4), multiple small groups in each cluster indicate
individual singer which has audio recordings in different
emotions. L3-Net data seems to cluster into multiple smaller
groups according to gender and individual categories, and
this shows L3-Net outperforms for detecting different
timbre information than VGGish. This pattern seems to be
consistent in the wild range of T-SNE perplexity parameters.
This also shows that our study provides an empirical
justification that L3-Net outperforms VGGish, with the intuition
discussed in the paper based on the clustering shown in
Figure 4.</p>
      <p>
        Accordingly, for the next step, a possible direction to
validate different classifiers is to explore a combination of
discrete neural learning methods, such as VQ-VAE, to first
solve the categorical problem, and only later learn a more
smooth decision surface. VQ-VAE has been recently
explored for spectrogram-based music inpainting
        <xref ref-type="bibr" rid="ref8">(Bazin et al.
2020)</xref>
        . It would be interesting to explore similar high-level
parameterization using L3-Net embeddings.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Devin,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Irving,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Isard,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; et al.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          .
          <source>In 12th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 16)</source>
          ,
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Abdoli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Cardinal,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Koerich</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>End-to-end environmental sound classification using a 1d convolutional neural network</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>136</volume>
          :
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Abu-El-Haija</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Kothari,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Natsev,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; Toderici,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Varadarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ; and
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Youtube-8m: A large-scale video classification benchmark</article-title>
          .
          <source>arXiv preprint arXiv:1609</source>
          .
          <fpage>08675</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Arandjelovic</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Look, listen and learn</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <fpage>609</fpage>
          -
          <lpage>617</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Chaspari,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Exploring siamese neural network architectures for preserving speaker identity in speech emotion classification</article-title>
          .
          <source>In Proceedings of the 4th International Workshop on Multimodal Analyses Enabling Artificial Agents in HumanMachine Interaction</source>
          ,
          <fpage>15</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Bazin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hadjeres</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Esling,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; and Malt,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Spectrogram Inpainting for Interactive Generation of Instrument Sounds</article-title>
          . URL https://boblsturm.github.io/aimusic2020/papers/ CSMC MuMe 2020 paper 49.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2016.
          <article-title>A scheme of MIDI music emotion classification based on fuzzy theme extraction and neural network</article-title>
          .
          <source>In 2016 12th International Conference on Computational Intelligence and Security (CIS)</source>
          ,
          <fpage>323</fpage>
          -
          <lpage>326</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Cheuk</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          ; Luo, Y.-J.;
          <string-name>
            <surname>Balamurali</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Roig</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Herremans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Regression-based music emotion prediction using triplet neural networks</article-title>
          .
          <source>In 2020 International Joint Conference on Neural Networks (IJCNN)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fazekas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Sandler,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Transfer learning for music classification and regression tasks</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>09179</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; et al.
          <year>2015</year>
          . keras.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Cramer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Wu, H.
          <string-name>
            <surname>-H.; Salamon</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Bello</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <year>2019</year>
          . Look, Listen, and Learn More:
          <article-title>Design Choices for Deep Audio Embeddings</article-title>
          .
          <source>In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>3852</fpage>
          -
          <lpage>3856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>DCASE.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Detection and Classification of Acoustic Scenes and Events. Task 4: Sound Event Detection in Domestic Environments</article-title>
          . URL http://dcase.community/challenge2019/task
          <article-title>-soundevent-detection-in-domestic-environments.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>21</volume>
          (
          <issue>12</issue>
          ):
          <fpage>3150</fpage>
          -
          <lpage>3163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Chaspari,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>A Siamese Neural Network with Modified Distance Loss For Transfer Learning in Speech Emotion Recognition</article-title>
          . arXiv preprint arXiv:
          <year>2006</year>
          .03001 .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Gemmeke</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , W.;
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Plakal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Ritter,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Audio set: An ontology and human-labeled dataset for audio events</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>776</fpage>
          -
          <lpage>780</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Bengio,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Deep learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Hamel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Eck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Learning features from music audio with deep belief networks</article-title>
          .
          <source>In ISMIR</source>
          , volume
          <volume>10</volume>
          ,
          <fpage>339</fpage>
          -
          <lpage>344</lpage>
          . Utrecht, The Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Hershey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gemmeke</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Plakal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Saurous,
          <string-name>
            <given-names>R. A.</given-names>
            ;
            <surname>Seybold</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          ; et al.
          <year>2017</year>
          .
          <article-title>CNN architectures for large-scale audio classification</article-title>
          .
          <source>In 2017 ieee international conference on acoustics, speech and signal processing (icassp)</source>
          ,
          <fpage>131</fpage>
          -
          <lpage>135</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gemmeke</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          , W.; and
          <string-name>
            <surname>Freedman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Large-scale audio event discovery in one million youtube videos</article-title>
          .
          <source>In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>786</fpage>
          -
          <lpage>790</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Plakal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pandya</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Ellis,
          <string-name>
            <given-names>D. P.</given-names>
            ;
            <surname>Hershey</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. C.</surname>
          </string-name>
          ; and Saurous,
          <string-name>
            <surname>R. A.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Unsupervised learning of semantic audio representations</article-title>
          .
          <source>In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>126</fpage>
          -
          <lpage>130</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Urbano</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Liem,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Hanjalic</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Are Nearby Neighbors Relatives?: Are Nearby Neighbors Relatives?: Testing Deep Music Embeddings</article-title>
          .
          <source>Frontiers in Applied Mathematics and Statistics</source>
          <volume>5</volume>
          :
          <fpage>53</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Sample-level CNN architectures for music auto-tagging using raw waveforms</article-title>
          .
          <source>In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>
          ,
          <fpage>366</fpage>
          -
          <lpage>370</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Migneco</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Morton,
          <string-name>
            <surname>B. G.</surname>
          </string-name>
          ; Richardson,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Scott</surname>
          </string-name>
          , J.;
          <string-name>
            <surname>Speck</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Turnbull</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Music emotion recognition: A state of the art review</article-title>
          .
          <source>In Proc. ISMIR</source>
          , volume
          <volume>86</volume>
          ,
          <fpage>937</fpage>
          -
          <lpage>952</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Park, J.; and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Enhancing music features by knowledge transfer from user-item log data</article-title>
          .
          <source>In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>386</fpage>
          -
          <lpage>390</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Exploration of Music Emotion Recognition Based on MIDI</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ISMIR</given-names>
          </string-name>
          ,
          <fpage>221</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Fang,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Music emotion recognition using a variant of recurrent neural network</article-title>
          .
          <source>In 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA</source>
          <year>2018</year>
          ). Atlantis Press.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Livingstone</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>The Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English</article-title>
          .
          <source>PloS one 13</source>
          <volume>(5)</volume>
          :
          <fpage>e0196391</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Madhok</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Goel,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>SentiMozart: Music Generation based on Emotions</article-title>
          .
          <source>In ICAART (2)</source>
          ,
          <fpage>501</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Malheiro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Panda,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Gomes,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; and Paiva,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Bi-modal music emotion recognition: Novel lyrical features and dataset</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>9th International Workshop on Music and Machine LearningMML</source>
          '
          <fpage>2016</fpage>
          -in . . . .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>McFee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Raffel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ellis</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McVicar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Battenberg</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Nieto</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>librosa: Audio and music signal analysis in python</article-title>
          .
          <source>In Proceedings of the 14th python in science conference</source>
          , volume
          <volume>8</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Panda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Malheiro, R.; and Paiva,
          <string-name>
            <surname>R. P.</surname>
          </string-name>
          2018a.
          <article-title>Musical Texture and Expressivity Features for Music Emotion Recognition</article-title>
          .
          <source>In ISMIR</source>
          ,
          <fpage>383</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Panda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Malheiro,
          <string-name>
            <surname>R. M.</surname>
          </string-name>
          ; and Paiva,
          <string-name>
            <surname>R. P.</surname>
          </string-name>
          2018b.
          <article-title>Novel audio features for music emotion recognition</article-title>
          .
          <source>IEEE Transactions on Affective Computing .</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; Weiss, R.;
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; et al.
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>the Journal of machine Learning research</source>
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Piczak</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Environmental sound classification with convolutional neural networks</article-title>
          .
          <source>In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Pons</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Serra</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Randomly weighted CNNs for (music) audio classification</article-title>
          .
          <source>In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>336</fpage>
          -
          <lpage>340</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Robnik- Sˇikonja</surname>
            , M.; and Kononenko,
            <given-names>I.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Theoretical and empirical analysis of ReliefF and RReliefF</article-title>
          .
          <source>Machine learning 53(1- 2)</source>
          :
          <fpage>23</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Core affect and the psychological construction of emotion</article-title>
          .
          <source>Psychological review 110</source>
          <volume>(1)</volume>
          :
          <fpage>145</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Salamon</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Bello</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deep convolutional neural networks and data augmentation for environmental sound classification</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          <volume>24</volume>
          (
          <issue>3</issue>
          ):
          <fpage>279</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409</source>
          .
          <fpage>1556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Soleymani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Caro</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sha</surname>
          </string-name>
          , C.-Y.; and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.-H.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>1000 songs for emotional analysis of music</article-title>
          .
          <source>In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia</source>
          , 1-
          <fpage>6</fpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Thao</surname>
            ,
            <given-names>H. T. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Herremans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Roig,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Multimodal Deep Models for Predicting Affective Responses Evoked by Movies</article-title>
          . In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW),
          <fpage>1618</fpage>
          -
          <lpage>1627</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Van den Oord</surname>
          </string-name>
          , A.;
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schrauwen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Deep content-based music recommendation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>2643</volume>
          -
          <fpage>2651</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            , H.-M.; and Lanckriet,
            <given-names>G.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>A histogram density modeling approach to music emotion recognition</article-title>
          .
          <source>In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>698</fpage>
          -
          <lpage>702</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>Weninger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>On-line continuoustime music mood regression with deep recurrent neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <source>In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <fpage>5412</fpage>
          -
          <lpage>5416</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>Wilkinghoff</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>On open-set classification with L3-Net embeddings for machine listening applications</article-title>
          .
          <source>In 28th European Signal Processing Conference (EUSIPCO).</source>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Review of data features-based music emotion recognition methods</article-title>
          .
          <source>Multimedia Systems</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>365</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            , Y.-H.; and Chen,
            <given-names>H. H.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Machine recognition of music emotion: A review</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST) 3</source>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>