Single and Multi Column Neural Networks for Content-based Music Genre Recognition Chang Wook Kim2 , Jaehun Kim3 , Kwangsub Kim1 , Minz Won1∗ 1 Kakao Corp., Republic of Korea 2 Kakao Brain, Republic of Korea 3 Delft University of Technology, Netherlands ABSTRACT This working note reports approaches of team KART to Media- Eval2017 AcousticBrainz Genre Task and their results. To solve the problem, we mainly considered the sparsity and noise of data, net- work design for the multi-label classification, and implementation of successful Deep Neural Network (DNN) models. We propose three steps of preprocessing and depict two different approaches: a single-column model and a multi-column model. Figure 1: Box plots of High Frequency Content (HFC) mean 1 INTRODUCTION for 30 genres before clipping (left) and after clipping (right). A music genre is a class, type or category that defined by conven- tion [6]. However, taxonomies of music genres can differ by commu- nities. The MediaEval2017 AcousticBrainz Genre Task aims to pre- suppressed or boosted these outlier values to the outlier boundaries. dict the genre and subgenre of unlabeled music recordings from four A lower limit and an upper limit boundaries for judging outliers different datasets which consist of four different genre/subgenre are defined as : taxonomies [1]. Each dataset includes precomputed music audio features using Essentia library [2] and genre/subgenre annotations LowerLimit = FirstQuartile − 1.5 ∗ IQR (1) that follow its own taxonomy. U pperLimit = ThirdQuartile + 1.5 ∗ IQR We approached the problem based on careful consideration of following: where Inter Quartile Range (IQR) is a difference between the first quartile and the third quartile. Values bigger than the upper limit • How to handle the noisy and sparse data? were clipped at the upper limit, and values smaller than the lower • How to solve the multi-label classification task? limit were boosted to the lower limit. Figure 1 (right) shows the • How to apply a variety of successful deep neural network distribution after the fitting of the Figure 1 (left). models to our task? Upper limits and lower limits of the training set were derived from feature value distributions of each feature and each genre. 2 PREPROCESSING However, since the genre labels of the test set should not be in- Before starting the model training, we conducted three steps of formed, we thresholded the test set at the maximum upper limit feature preprocessing: (i) feature vectorization, (ii) fitting outlier and the minimum lower limit of the training set. feature values to the outlier boundaries, and (iii) selecting features After the outlier fitting, we defined features that concentrate by feature value distribution analysis. around same values for every genre as useless features and removed Essentially, we tried to use features as raw as possible. Its un- 133 features from the training and test sets. derlying assumption is that the deep neural network model can learn useful representations of the raw data if there are sufficient 3 MODEL amount of samples. We omitted all the information under ’meta We implemented two Feed-forward Neural Networks (FNNs). The data’ keys. Further, we applied PCA to the covariance matrices main difference in architectures is whether the label hierarchy of filter banks. For the computation convenience, we only choose between the genre and the sub-genre is considered explicitly. Since the eigenvector whose corresponding eigenvalue is most large. In the provided input features are already processed, the model is addition, we encoded categorical features into binary vectors in designed for encoding interdependency among labels. one-hot manner. For some features, there are outliers with extremely high or low values in comparison with their medians (Figure 1 left). We 3.1 Single Column Model As a baseline, we implemented an Single Column FNN (SCNN) ∗ Author’s names are listed alphabetically; authors contributed equally to this work. whose output dimensions correspond to the entire labels. The label hierarchy between genre and sub-genre was not considered in this Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland model. The genre and sub-genre are equally treated as independent labels. We applied the weight vector w with the loss function, to MediaEval’17, 13-15 September 2017, Dublin, Ireland C. Kim, J. Kim, K. Kim, M. Won, give less penalty for more frequent labels as following: Table 1: Mean F1 scores on test set: F1 values averaged over all datasets 1 wi = 1 + (2) log (1 + fi ) Runs F 1t r ack F 1l abel where fi denote raw count of label i in the given training dataset. In this way, the error from less frequent labels can be counted Baseline1 0.1095 0.007 relatively larger than more frequent labels. It leads to learning less Baseline2 0.2378 0.003 frequent labels more sensitively. The loss function of this model is: SCNN 0.2526 0.0085 1 Õ (m) (m) MCNN 0.1828 0.0084 L SC N N = wi H (ŷi , yi ) (3) M m,i where H denotes the binary cross-entropy of the true label and the P(all) P(subgenre) P(genre) (m) prediction, yi is a binary vector for label i of the observation m, (m) ŷi is a prediction for label i of the observation m, and M denotes x P(subgenre|genre) (m) FC (1000) the size of the mini-batch. The inference of ŷi is: (m) (m) HW HW ŷ = f (x ;θ) (4) FC (2000) HW HW HW HW where f (x(m) ; θ ) denotes an FNN that has a set of model parame- FC (2000) HW HW HW HW ters denoted as θ , and x(m) is a feature vector of the observation Input Input m. We applied ReLU[5] activation function for hidden layers and used the sigmoid function for the output layer. We also used the (a) Single-Column Model (b) Multi-Column Model dropout[7] for every hidden layers with the dropout probability 0.5. We applied the batch normalization[4] to the hidden layers to Figure 2: The architectures of suggested models. (left) The accelerate training process. The details are depicted in Figure 2. SCNN, which has 3 hidden fully-connected (FC) layers and During the inference, we only predict labels whose probability one output layer. The number inside the parentheses indi- exceed the threshold α. We set α as 0.2 and 0.3 for SCNN, which cates the number of units in each layer. (right) The structure are found as optimal values by the cross-validation. of the MCNN, which has 5 highway blocks (HW) and an out- put layer. The number of units in each highway blocks are 3.2 Multi Column Model identical to the input dimensionality. The red dotted lines To explicitly reflect label dependency between sub-genre and genre, indicate the threshold for each model. we implemented a Multi Column FNN model (MCNN). It has a parallel structure for each set of sub-genre and genre, and merged 4 RESULTS AND ANALYSIS by the Bayes rule on the top layer, as following: A partial result of our runs and baselines obtained from test set (m) presented in Table 1. The scores are mean F 1 scores per tracks and ŷд = f (x(m) ; θд ) (5) per labels, respectively, which are averaged over all results from (m) (m) datasets. Baseline1 is a random predictor and Baseline2 is a majority ŷsд = ŷд ∗ f (x(m) ; θ sд ) (6) predictor. SCNN presented in Table 1 uses threshold α = 0.2. (m) (m) When comparing the scores, we noticed that SCNN is overall where ŷд and ŷsд denote the estimated probabilities of genre and sub-genre from each model. The posterior probability of sub- better than Baselines, but the MCNN is better than Baselines only in (m) (m) per label scores. However, considering the recall scores, which are genre ŷsд is conditioned by ŷд ∗ . Here д∗ denotes the genre where not presented in the note due to space, it shows that both suggested the sub-genre sд belongs. The loss function of this model is: models score better recall than the Baseline 2. This shows both 1 Õ (m) (m) (m) (m) models are working better for predicting sparse sub-genres than L MC N N = [wд H (ŷд , yд ) + w sд H (ŷsд , ysд )] (7) M m baselines and suggesting our weighted losses work as intended. where wд and w sд are scalar weights to balance learning rate of Compared to the validation accuracy, the test accuracy got worse. the genre column and the sub-genre column. We used the ratio of 1:9 Since the training set and the validation set are skewed and sparse between wд and w sд , considering sub-genre labels are more sparse data, our models failed to learn generalized parameters. Experi- than genre labels. We used the batch normalization only after the ments with the data augmentation have to be explored to overcome input layer. The dropout was not applied. We used threshold α sд = the drawback. 0.25 for sub-genre and threshold αд = 0.4 for genre, respectively. Also, a large model size of MCNN can be another reason of its Assuming given feature set is already sufficiently processed, we worse test accuracy. The highway networks have 2 times larger than also applied a highway network[8] architecture. It controls the the standard fully-connected layers and MCNN has two columns gradient flow by the parametric gate at each layer, similar to the of the network to model genre and sub-genre predictor separately. Long Short-Term Memory[3]. We applied this architecture for the This structure makes the model 4 times bigger than SCNN model. network not only to have a deeper structure, but also to use the A Multi-Column architecture with small units and standard fully- information more close to the input feature. connected layers will be useful. Content-based Music Genre Recognition from Multiple Sources MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Dmitry Bogdanov, Alastair Porter, Julian Urbano, and Hendrik Schreiber. 2017. MediaEval 2017 AcousticBrainz Genre Task: Content- based Music Genre Recognition from Multiple Sources. [2] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Per- fecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José R Zapata, Xavier Serra, and others. 2013. Essentia: An Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval (ISMIR’13) Conference. Curitiba, Brazil, 493–498. [3] Sepp Hochreiter and JÃijrgen Schmidhuber. 1997. Long Short-term Memory. 9 (12 1997), 1735–80. [4] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accel- erating deep network training by reducing internal covariate shift. In International Conference on Machine Learning. 448–456. [5] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814. [6] Jim Samson. 2017. Genre. In Grove Music Online. Oxford Music Online. Oxford University Press. Web., http://www.oxfordmusiconline.com/subscriber/article/grove/music/40599. [7] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15, 1 (2014), 1929–1958. [8] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387 (2015).