Single and Multi Column Neural Networks for Content-based
                      Music Genre Recognition
                                   Chang Wook Kim2 , Jaehun Kim3 , Kwangsub Kim1 , Minz Won1∗
                                                                   1 Kakao Corp., Republic of Korea
                                                                   2 Kakao Brain, Republic of Korea
                                                         3 Delft University of Technology, Netherlands


ABSTRACT
This working note reports approaches of team KART to Media-
Eval2017 AcousticBrainz Genre Task and their results. To solve the
problem, we mainly considered the sparsity and noise of data, net-
work design for the multi-label classification, and implementation
of successful Deep Neural Network (DNN) models. We propose
three steps of preprocessing and depict two different approaches: a
single-column model and a multi-column model.

                                                                                        Figure 1: Box plots of High Frequency Content (HFC) mean
1    INTRODUCTION                                                                       for 30 genres before clipping (left) and after clipping (right).
A music genre is a class, type or category that defined by conven-
tion [6]. However, taxonomies of music genres can differ by commu-
nities. The MediaEval2017 AcousticBrainz Genre Task aims to pre-                        suppressed or boosted these outlier values to the outlier boundaries.
dict the genre and subgenre of unlabeled music recordings from four                     A lower limit and an upper limit boundaries for judging outliers
different datasets which consist of four different genre/subgenre                       are defined as :
taxonomies [1]. Each dataset includes precomputed music audio
features using Essentia library [2] and genre/subgenre annotations                                   LowerLimit = FirstQuartile − 1.5 ∗ IQR
                                                                                                                                                         (1)
that follow its own taxonomy.                                                                       U pperLimit = ThirdQuartile + 1.5 ∗ IQR
   We approached the problem based on careful consideration of
following:                                                                              where Inter Quartile Range (IQR) is a difference between the first
                                                                                        quartile and the third quartile. Values bigger than the upper limit
       • How to handle the noisy and sparse data?                                       were clipped at the upper limit, and values smaller than the lower
       • How to solve the multi-label classification task?                              limit were boosted to the lower limit. Figure 1 (right) shows the
       • How to apply a variety of successful deep neural network                       distribution after the fitting of the Figure 1 (left).
          models to our task?                                                              Upper limits and lower limits of the training set were derived
                                                                                        from feature value distributions of each feature and each genre.
2    PREPROCESSING                                                                      However, since the genre labels of the test set should not be in-
Before starting the model training, we conducted three steps of                         formed, we thresholded the test set at the maximum upper limit
feature preprocessing: (i) feature vectorization, (ii) fitting outlier                  and the minimum lower limit of the training set.
feature values to the outlier boundaries, and (iii) selecting features                     After the outlier fitting, we defined features that concentrate
by feature value distribution analysis.                                                 around same values for every genre as useless features and removed
   Essentially, we tried to use features as raw as possible. Its un-                    133 features from the training and test sets.
derlying assumption is that the deep neural network model can
learn useful representations of the raw data if there are sufficient                    3     MODEL
amount of samples. We omitted all the information under ’meta
                                                                                        We implemented two Feed-forward Neural Networks (FNNs). The
data’ keys. Further, we applied PCA to the covariance matrices
                                                                                        main difference in architectures is whether the label hierarchy
of filter banks. For the computation convenience, we only choose
                                                                                        between the genre and the sub-genre is considered explicitly. Since
the eigenvector whose corresponding eigenvalue is most large. In
                                                                                        the provided input features are already processed, the model is
addition, we encoded categorical features into binary vectors in
                                                                                        designed for encoding interdependency among labels.
one-hot manner.
   For some features, there are outliers with extremely high or
low values in comparison with their medians (Figure 1 left). We                         3.1    Single Column Model
                                                                                        As a baseline, we implemented an Single Column FNN (SCNN)
∗ Author’s names are listed alphabetically; authors contributed equally to this work.
                                                                                        whose output dimensions correspond to the entire labels. The label
                                                                                        hierarchy between genre and sub-genre was not considered in this
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                     model. The genre and sub-genre are equally treated as independent
                                                                                        labels. We applied the weight vector w with the loss function, to
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                          C. Kim, J. Kim, K. Kim, M. Won,


give less penalty for more frequent labels as following:                  Table 1: Mean F1 scores on test set: F1 values averaged over
                                                                          all datasets
                                      1
                       wi = 1 +                                     (2)
                                log (1 + fi )
                                                                                                 Runs    F 1t r ack       F 1l abel
where fi denote raw count of label i in the given training dataset.
In this way, the error from less frequent labels can be counted                             Baseline1    0.1095             0.007
relatively larger than more frequent labels. It leads to learning less                      Baseline2    0.2378             0.003
frequent labels more sensitively. The loss function of this model is:                         SCNN       0.2526            0.0085
                            1 Õ            (m) (m)
                                                                                              MCNN       0.1828            0.0084
                 L SC N N =        wi H (ŷi , yi )                (3)
                            M m,i

where H denotes the binary cross-entropy of the true label and the                            P(all)          P(subgenre) P(genre)
             (m)
prediction, yi is a binary vector for label i of the observation m,
  (m)
ŷi     is a prediction for label i of the observation m, and M denotes                                               x

                                                                                                              P(subgenre|genre)
                                               (m)                                          FC (1000)
the size of the mini-batch. The inference of ŷi is:
                          (m)       (m)                                                                             HW                HW
                           ŷ     = f (x     ;θ)                    (4)                     FC (2000)               HW                HW
                                                                                                                    HW                HW
where f (x(m) ; θ ) denotes an FNN that has a set of model parame-                          FC (2000)               HW
                                                                                                                    HW
                                                                                                                                      HW
                                                                                                                                      HW
ters denoted as θ , and x(m) is a feature vector of the observation
                                                                                              Input                          Input
m. We applied ReLU[5] activation function for hidden layers and
used the sigmoid function for the output layer. We also used the                     (a) Single-Column Model (b) Multi-Column Model
dropout[7] for every hidden layers with the dropout probability
0.5. We applied the batch normalization[4] to the hidden layers to        Figure 2: The architectures of suggested models. (left) The
accelerate training process. The details are depicted in Figure 2.        SCNN, which has 3 hidden fully-connected (FC) layers and
   During the inference, we only predict labels whose probability         one output layer. The number inside the parentheses indi-
exceed the threshold α. We set α as 0.2 and 0.3 for SCNN, which           cates the number of units in each layer. (right) The structure
are found as optimal values by the cross-validation.                      of the MCNN, which has 5 highway blocks (HW) and an out-
                                                                          put layer. The number of units in each highway blocks are
3.2      Multi Column Model                                               identical to the input dimensionality. The red dotted lines
To explicitly reflect label dependency between sub-genre and genre,       indicate the threshold for each model.
we implemented a Multi Column FNN model (MCNN). It has a
parallel structure for each set of sub-genre and genre, and merged        4   RESULTS AND ANALYSIS
by the Bayes rule on the top layer, as following:                         A partial result of our runs and baselines obtained from test set
                            (m)                                           presented in Table 1. The scores are mean F 1 scores per tracks and
                           ŷд    = f (x(m) ; θд )                  (5)   per labels, respectively, which are averaged over all results from
                         (m)      (m)                                     datasets. Baseline1 is a random predictor and Baseline2 is a majority
                        ŷsд = ŷд ∗ f (x(m) ; θ sд )               (6)
                                                                          predictor. SCNN presented in Table 1 uses threshold α = 0.2.
        (m)     (m)                                                          When comparing the scores, we noticed that SCNN is overall
where ŷд and ŷsд denote the estimated probabilities of genre
and sub-genre from each model. The posterior probability of sub-          better than Baselines, but the MCNN is better than Baselines only in
        (m)                     (m)                                       per label scores. However, considering the recall scores, which are
genre ŷsд is conditioned by ŷд ∗ . Here д∗ denotes the genre where
                                                                          not presented in the note due to space, it shows that both suggested
the sub-genre sд belongs. The loss function of this model is:
                                                                          models score better recall than the Baseline 2. This shows both
                1 Õ            (m) (m)              (m) (m)               models are working better for predicting sparse sub-genres than
    L MC N N =        [wд H (ŷд , yд ) + w sд H (ŷsд , ysд )] (7)
                M m                                                       baselines and suggesting our weighted losses work as intended.
where wд and w sд are scalar weights to balance learning rate of             Compared to the validation accuracy, the test accuracy got worse.
the genre column and the sub-genre column. We used the ratio of 1:9       Since the training set and the validation set are skewed and sparse
between wд and w sд , considering sub-genre labels are more sparse        data, our models failed to learn generalized parameters. Experi-
than genre labels. We used the batch normalization only after the         ments with the data augmentation have to be explored to overcome
input layer. The dropout was not applied. We used threshold α sд =        the drawback.
0.25 for sub-genre and threshold αд = 0.4 for genre, respectively.           Also, a large model size of MCNN can be another reason of its
   Assuming given feature set is already sufficiently processed, we       worse test accuracy. The highway networks have 2 times larger than
also applied a highway network[8] architecture. It controls the           the standard fully-connected layers and MCNN has two columns
gradient flow by the parametric gate at each layer, similar to the        of the network to model genre and sub-genre predictor separately.
Long Short-Term Memory[3]. We applied this architecture for the           This structure makes the model 4 times bigger than SCNN model.
network not only to have a deeper structure, but also to use the          A Multi-Column architecture with small units and standard fully-
information more close to the input feature.                              connected layers will be useful.
Content-based Music Genre Recognition from Multiple Sources                   MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Dmitry Bogdanov, Alastair Porter, Julian Urbano, and Hendrik
    Schreiber. 2017. MediaEval 2017 AcousticBrainz Genre Task: Content-
    based Music Genre Recognition from Multiple Sources.
[2] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Per-
    fecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José R
    Zapata, Xavier Serra, and others. 2013. Essentia: An Audio Analysis
    Library for Music Information Retrieval. In International Society for
    Music Information Retrieval (ISMIR’13) Conference. Curitiba, Brazil,
    493–498.
[3] Sepp Hochreiter and JÃĳrgen Schmidhuber. 1997. Long Short-term
    Memory. 9 (12 1997), 1735–80.
[4] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accel-
    erating deep network training by reducing internal covariate shift. In
    International Conference on Machine Learning. 448–456.
[5] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve
    restricted boltzmann machines. In Proceedings of the 27th international
    conference on machine learning (ICML-10). 807–814.
[6] Jim Samson. 2017.              Genre. In Grove Music Online.
    Oxford Music Online. Oxford University Press. Web.,
    http://www.oxfordmusiconline.com/subscriber/article/grove/music/40599.

[7] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever,
    and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent
    neural networks from overfitting. Journal of machine learning research
    15, 1 (2014), 1929–1958.
[8] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015.
    Highway networks. arXiv preprint arXiv:1505.00387 (2015).