MediaEval 2017 AcousticBrainz Genre Task: Multilayer Perceptron Approach Khaled Koutini, Alina Imenina, Matthias Dorfer, Alexander Rudolf Gruber, Markus Schedl Johannes Kepler University Linz, Austria khaled.koutini@jku.at,markus.schedl@jku.at ABSTRACT Table 1: Model Specifications. ReLU: Rectified Linear Unit [4], σ : the number of output units, which is the number This report describes the approach developed by the JKU team for of possible labels (unique genres + unique sub-genres). For the MediaEval 2017 AcousticBrainz Genre Task. After experiment- training a constant batch size of 500 samples is used, and a ing with various classifiers on the development dataset, our final learning rate of 0.001 approach is based on multilayer perceptron classifiers. Input: 2646 1 INTRODUCTION First layer: We present an approach for recognizing genre for unknown music 4000 Dense(ReLU) + Drop-Out(0.5) recordings given the data provided in the AcousicBrainz dataset [5]. 4000 Dense(tanh) + Drop-Out(0.5) Details about data, task, and evaluation are described in [2]. 4000 Dense(sigmoid) + Drop-Out(0.5) Our work is developed for both subtasks of the MediaEval 2017 Second layer: AcousticBrainz Genre Task. For the single-source classification Concat layer subtask a multilayer perceptron is applied on each source. For the 8000 Dense + Drop-Out(0.6) multiple-source classification subtask we use similarity measures Batch-Normalization layer between sources to adjust the probability of the record belonging Non-linearity (ReLU) to a certain genre in each source. Output layer: k-bins sigmoid 2 APPROACH We split the ground truth of each source using the script provided by the organizers, into a training and a validation set, where each 2.2.2 First hidden layer. The first hidden layer is a dense layer comprises 80% and 20% of the original data respectively. The split consisting of 12000 units where the first 4000 units have a rectified also ensures that no recording from the same recordings group linear [4] activation function, the next 4000 units have a tanh acti- appears in both the training and validation sets, in order to avoid vation function and the last 4000 units have a siдmoid activation the album effect. function. As shown in Table 1, each group of units is followed by a dropout layer with a dropout-probability of 0.5. 2.1 Features Selection 2.2.3 Second hidden layer. The second hidden layer consists of As stated in the overview paper [2], we are given for each record- 8000 batch-normalized rectified linear units. As input to this layer ing a set of features extracted using Essentia [3]. Given the large we concatenate the output of the 3 groups of the first layer and number of provided features, a fine-grained manual inspection for add the second layer with no activation function or bias. We again individual features is not feasible. Instead, we pick broad features apply dropout with a probability of 0.6. groups high in the Essentia feature groups hierarchy. Namely, We use all the low level features, rhythm features except beats_count 2.2.4 Output layer. The output layer consists of k units, where and beats_position, tonal features except chords_key, chords_scale, k is source specific, denoting the number of labels of the source key_key and key_scale. Overall, this yields 2646 numerical features (genre or sub-genre), the activation function of the output layer is per recording. siдmoid. 2.2 Neural Network 2.2.5 Loss function. We used mean binary cross-entropy as loss function for the network. We tried various neural network architectures and compared their performance based on the mean label-wise F-score of batches, using the Lastfm dataset. The best performing architecture is outlined in 2.3 Adjusting Threshold Table 1. The output of our neural network are k numerical values for each recording, as stated in Section 2.2.4. Each output is in the range 2.2.1 Input layer. As stated in Section 2.1, there are 2646 input [0, 1] representing the probability of the label (genre or sub-genre) features. We normalized the input using z-score normalization. corresponding to the respective output neuron. If the probability of a label for a given recording is larger than a predefined threshold, Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland we assign that label to the recording. Based on our experiments, we found that using a threshold of 0.5 for all of the labels results MediaEval’17, 13-15 September 2017, Dublin, Ireland K. Koutini et al. in high precision but low recall. Since the goal of the task is to label probabilities produced by the model trained on the recording’s optimize precision, recall and F-score, we adjusted the threshold for source training set as well as the projected label probabilities of all each label individually to obtain the best value for these evaluation other sources (see Equation (1)). measures. Best results are obtained when using static thresholds of Pk + 13 i,k Pi · Mi,k Í either 0.2 or 0.3 for all labels or by using a dynamic threshold for Yk = (1) 2 each label, estimated by maximizing the mean F-score. 3 RESULTS AND ANALYSIS Table 2: Lastfm validation set evaluation results. P: Preci- Table 3 shows the evaluation results on the validation sets of differ- sion, R: Recall, F: F-score ent sources using models trained on the respected training sets and using dynamic thresholds, as produced by the evaluation scripts Average per Threshold provided by the task organizers [2]. We can clearly observe that 0.2 0.3 dynamic P 0.54 0.60 0.59 Table 3: Validation set evaluation results of the 4 sources us- track (all labels) R 0.64 0.59 0.59 ing dynamic threshold (section 2.3). P: Precision, R: Recall, F 0.55 0.55 0.54 F: F-score P 0.69 0.71 0.70 track (genre labels) R 0.79 0.76 0.73 Average per Source F 0.71 0.71 0.70 Allmusic Tagtraum Lastfm Discogs P 0.39 0.46 0.44 P 0.53 0.53 0.59 0.50 track (all label (all labels) R 0.36 0.32 0.35 R 0.56 0.59 0.59 0.62 labels) F 0.35 0.35 0.37 F 0.48 0.53 0.54 0.51 P 0.51 0.58 0.60 P 0.71 0.72 0.70 0.78 track (genre label (genre labels) R 0.58 0.53 0.53 R 0.74 0.76 0.73 0.80 labels) F 0.54 0.54 0.56 F 0.70 0.72 0.70 0.76 P 0.49 0.32 0.44 0.28 label (all Table 2 shows the evaluation results for the mentioned threshold R 0.36 0.32 0.35 0.29 labels) setups on the Lastfm validation set, using the evaluation scripts F 0.40 0.30 0.37 0.27 provided by the task organizers [2]. P 0.56 0.55 0.60 0.59 label (genre R 0.48 0.51 0.53 0.54 labels) 2.4 Combining Different Sources F 0.51 0.52 0.56 0.56 The second part of the task [2] consisted of combining information from multiple source to predict labels of one source. To achieve this, predicting sub-genres labels is harder than predicting genre labels we calculate the similarity between every label of one source and which might be a result of the fewer training examples of those every label of all other sources, in order to adjust the probability of sub-genres in the dataset. assigning a source label to a recording using other source’s labels We submitted 3 runs for the first task [2], two runs using static probabilities from models trained on these other sources. As stated threshold of 0.2 and 0.3, and a run using dynamic thresholds as in the overview paper [2], the datasets of different sources intersect. described in section 2.3. We also submitted 5 runs for the second We exploit this intersection to estimate the similarity between the task [2], two runs identical to the static threshold runs of task1, and labels of different sources. 3 runs based on probabilities calculated as described in section 2.4 Labels are modeled as vectors, where each label is a vector of using static threshold of 0.2 and 0.3 and dynamic thresholds. the recordings annotated with it in the ground truth. The similarity Table 4 summarizes the f-score official results [1] of our best run between labels from different sources is measured as the cosine of each source. similarity between these label vectors. Based on that we compute Table 4: Official results [1] (F-score) similarity matrices Mi, j between different sources where element (i, j) holds the similarity of label i of the first source and label j in Average per Source the second source. We use these pairwise similarities as conversion matrices to project probabilities produced by a model trained on Allmusic Tagtraum Lastfm Discogs one source to the labels of another source. For a specific recording, track (all 0.43 0.53 0.55 0.51 the probabilities Pi of source labels i are a vector of length ni . This labels) vector is produced by a model trained on the training set of source i. track (genre 0.67 0.73 0.72 0.77 To also make use of the models trained on other sources we compute labels) P j · M j,i which is a vector of the same length ni also representing label (all 0.20 0.28 0.35 0.25 the probabilities of source i labels. However, this vector is produced labels) by a model trained on source j by projecting the probabilities P j label (genre 0.41 0.50 0.55 0.56 using the respective conversion matrix. The final label probabilities labels) (task 2) for a specific recording of source k are the weighted average MediaEval 2017 AcousticBrainz Genre Task: Team JKU MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] D Bogdanov, A Porter, J Urbano, and H Schreiber. 2017. Official results of the MediaEval 2017 AcousticBrainz Genre Task. https:// multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/. (2017). [Online; accessed 09-September-2017]. [2] D Bogdanov, A Porter, J Urbano, and H Schreiber. 2017. The Media- Eval 2017 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources. In Working Notes Proceedings of the MediaEval 2016 Workshop. Dublin, Ireland. [3] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J.R. Zapata, and X. Serra. 2013. Essentia: An Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval (ISMIR’13) Conference. Curitiba, Brazil, 493–498. [4] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814. [5] A. Porter, D. Bogdanov, R. Kaye, R. Tsukanov, and X. Serra. 2015. Acousticbrainz: a community platform for gathering music informa- tion obtained from audio. In Proceedings of the 16th International Society for Music Information Retrieval Conference. Malaga, Spain, 786–792.