Hierarchical Multilabel Classification and Voting for Genre Classification Benjamin Murauer, Maximilian Mayerl, Michael Tschuggnall, Eva Zangerle, Martin Pichl, Günther Specht University of Innsbruck, Austria firstname.lastname@uibk.ac.at ABSTRACT Multilabel classification. The fact that any track may feature mul- This paper summarizes our contribution (team DBIS) to the Acous- tiple genres and subgenres complicates the classification problem, ticBrainz Genre Task: Content-based music genre recognition from since not all classification algorithm inherently support multilabel multiple sources as part of MediaEval 2017. We utilize a hierarchical classification. We solved this problem by applying the one-vs.-the- set of multilabel classifiers to predict genres and subgenres and rely rest strategy, effectively training a separate binary classifier for on a voting scheme to predict labels across datasets. every label. Different genre labels across data sets. As subtask 2 allows to combine all datasets for training, the (vastly) differing genre labels 1 INTRODUCTION used in the four available training sets posed a challenge. We tack- In the MediaEval AcousticBrainz Genre Task, the goal is to classify led this problem by computing a direct mapping between the main tracks into main and subgenres, using content-based features com- class labels of all training sets aiming to find equivalent genre labels puted with Essentia [2] and collected by AcousticBrainz [11]. Four across all datasets. Therefore, we applied the Levenshtein string dis- separate training and test sets of tracks were provided, stemming tance measure [9] (as previously used for e.g., entity matching [6]) from four different sources (AllMusic, Discogs, Lastfm, and Tag- to find all labels with a distance of at most 1. This slightly fuzzy traum). The task features two subtasks, which differ in the amount matching approach allows us to neglect minor syntactic differences of data that can be used for solving them: In subtask 1, only training in the labels (e.g., hip hop vs. hiphop). Preliminary experiments and data from the same source as the current test data may be used for manual inspection showed that this allows to increase the number the classification; in subtask 2, all provided datasets can be utilized of matching labels while still avoiding false positives. We did not for training. However, the evaluation is performed on a per-dataset match sub-genres, as our experiments showed that those diverged basis. Further details can be found in [1]. to a far greater extent. Classification Algorithms. We implemented our solution with 2 CLASSIFICATION AND CHALLENGES two different classification methods2 : (1) a linear C-support vector There are multiple factors that make the posed task difficult to machine [12] and (2) an extra-trees classifier [7]. In addition, multi- solve, particularly the large amount of data to handle and the mul- layer neural networks, that are known to work well for this task tilabel nature of the classification problem make the tasks highly (c.f. [4, 5, 8]), and extreme gradient boosting [3] showed promising challenging. Subtask 2 is further complicated by the fact that genre preliminary results, but were deemed infeasible due to the compu- and subgenre labels are hardly consistent across the four provided tational resources required to train full-scale models. training sets, hence providing a heterogeneous set of labels. In the following, we firstly sketch our approach to mitigate these 2.1 Subtask 1 difficulties. Next, we detail the classification approaches we used throughout subtasks 1 and 2 and lastly, present the obtained results. Train Classifier For Main Genres We make our implementation available for reproducibility and for promoting research in this direction1 . Train separate subgenre classifier for each main genre Reducing the amount of data. To reduce the amount of data and make the task computationally feasible within the limited time Predict main genres for each track frame, we at first skipped detailed features describing low level energy bands of the energy spectrum and verified on a preliminary Predict subgenres for each main genre predicted for each track basis that the respective central moments are sufficient in terms of classification accuracy. This allowed us to reduce the number Fallback: Assign most popular main genre to tracks with no predicted label of features used for training the genre classifiers to 395 (from over 3,000 features originally provided). The full list of features can be Figure 1: Classification workflow for subtask 1. found in our GitHub repository1 . 1 https://github.com/dbis-uibk/MusicGenreClassification The workflow underlying our approach for subtask 1 is outlined in Figure 1. First, we train one classifier for main genres and a separate classifier for each main genre’s subgenres. After that, we Copyright held by the owner/author(s). MediaEval 2017 Workshop, Sept. 13-15, 2017, Dublin, Ireland. 2 We relied on the python library scikit-learn [10] for implementing the machine learning parts of the tasks. MediaEval 2017 Workshop, Sept. 13-15, 2017, Dublin, Ireland. B. Murauer, M. Mayerl, M. Tschuggnall, E. Zangerle, M. Pichl, G. Specht utilize the main genre classifier to predict the main genres of every was predicted by 75% of the classifiers and is retained (run #4); track in the test set. Following that, for every track in the test set (2) double the weight of the prediction of the classifier trained and every main genre predicted for that track, the corresponding specifically on the training set corresponding to the current test set subgenre classifier is used to predict the subgenre labels for the and retain genres predicted by at least 60% of the usable classifiers track. Lastly, as it is possible in multilabel classification that no (e.g., if we did predictions for the Lastfm test set and the Lastfm label is assigned to a track (i.e., if every binary classifier predicts and Discogs classifiers predicted rock/pop, then that label was a ’no’ for its respective label), we apply a ‘most popular genre’ assigned three votes out of five (i.e., 60%) and retained (run #5)). fallback approach and assign the most common main genre label This puts more emphasis on the predictions of the training set and for the respective dataset to ensure that each track is assigned a hence, classifier that is trained on the naturally best training data main genre. (stemming from the same data source as the current test set). To exploit the possibility to submit five submission runs, this Prediction of subgenres and handling of tracks with no predicted basic approach was implemented with the following configurations labels was handled the same way as in subtask 1. For this subtask, of classification algorithms for main and subgenre, which are also support vector machines were used as classifiers, with C = 10.0 and listed in Table 1: balanced class weights as determined in preliminary experiments. • Run #1 uses a SVM with C = 1.0 and no class weight balancing for thepmain genre classifier; an extra-trees classifier with 50 3 RESULTS AND OUTLOOK trees, | f eatures | features considered when searching for the The results of the evaluation of our approach for subtasks 1 and best split and balanced class weights for the subgenre classifiers. 2 can be found in Tables 2 and 3, respectively. Table 2 shows the results of run #3, which provided the best overall performance • Run #2 uses a SVM with C = 1.0 and balanced class weights for in both subtasks. Table 3 contains the results of run #5, which the p main genre classifier; an extra-trees classifier with 50 trees, performed better in some measures (in bold font) compared to run | f eatures | features considered when searching for the best #3 in subtask 2. Due to space limitations, the results of the other split and balanced class weights for the subgenre classifiers. runs are omitted. • Run #3 includes a SVM with C = 10.0 and balanced class weights Possible improvements of the presented approaches include dif- for the main- and subgenre classifiers. ferent classifying methods such as deep neural networks and a The C value for the SVMs was selected after a grid search on more detailed feature selection process. These steps were rendered a smaller test set of 10,000 randomly sampled tracks. The chosen impossible due to time constraints and technical limitations of the amount of features and trees for the extra trees classifier was a trade available hardware. off between classification runtime and accuracy, as more features would possibly have provided more accuracy. For runs #4 and #5, Table 1: Submitted Runs the results of run #3 were used. Run # Subtask 1 Subtask 2 1 unbalanced SVM + ET unbalanced SVM + ET 2.2 Subtask 2 2 balanced SVM + ET balanced SVM + ET For subtask 2, the set of all provided datasets could be utilized to 3 bal. SVM + bal. SVM bal. SVM + bal. SVM classify each of the four test sets. We chose to implement this using 4 bal. SVM + bal. SVM bal. SVM + bal. SVM + voting 50 a voting mechanism. First, SVMs as main genre classifiers were 5 bal. SVM + bal. SVM bal. SVM + bal. SVM + voting 60 trained as in subtask 1, independently for every training set. These classifiers were then used to predict the main genres of a given Table 2: F-scores for subtask 1 with run # 3. track as follows: Goal AllMusic Discogs Lastfm Tagtraum (1) Predict the main genres of the track with all four classifiers. Per Track (all) 0.249 0.374 0.340 0.363 (2) Utilize the genre mapping as described above to map the pre- Per Track (genre) 0.587 0.680 0.512 0.478 dicted genres to the genre labels of the current test set (other- Per Track (subgenre) 0.193 0.219 0.251 0.303 wise, the predicted labels would not be compatible and hence, Per Label (all) 0.070 0.144 0.155 0.153 create false positives). Thereby, classification results where no Per Label (genre) 0.266 0.441 0.313 0.345 class label was contained in (or could be mapped to) the test Per Label (subgenre) 0.065 0.129 0.139 0.131 set were discarded. (3) For every genre predicted by any of the four classifiers, count Table 3: F-scores for subtask 2 with run # 5. Bold numbers the number of classifiers that predicted this genre (using two mark better results than in subtask 1. different weighing schemes) and weigh this by the number of Goal AllMusic Discogs Lastfm Tagtraum classifiers that produced a usable result. Per Track (all) 0.183 0.426 0.366 0.401 To arrive at the final set of main genres for every track, we Per Track (genre) 0.516 0.668 0.523 0.629 applied two different variants, which can be seen in Table 1 for runs Per Track (subgenre) 0.065 0.014 0.056 0.166 #4 and #5: (1) weigh every prediction equally and retain genres Per Label (all) 0.019 0.026 0.055 0.059 predicted by at least 50% of the usable classifiers—for example, if Per Label (genre) 0.230 0.395 0.309 0.272 Per Label (subgenre) 0.013 0.008 0.030 0.034 three of the four classifiers predict the label rock/pop, that label Multi-Genre Music MediaEval 2017 Workshop, Sept. 13-15, 2017, Dublin, Ireland. REFERENCES [1] Dmitry Bogdanov, Alastair Porter, Juliàn Urbano, and Hendrik Schreiber. 2017. The MediaEval 2017 AcousticBrainz Genre Task: Content-based Music Genre Recognition from Multiple Sources. In Proc. of the MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017. [2] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Per- fecto Herrera, O. Mayor, Gerard Roma, Justin Salamon, J. R. Zapata, and Xavier Serra. 2013. ESSENTIA: an Audio Analysis Library for Music Information Retrieval. In International Society for Music In- formation Retrieval Conference (ISMIR’13). Curitiba, Brazil, 493–498. http://hdl.handle.net/10230/32252 [3] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794. [4] Sander Dieleman, Philemon Brakel, and Benjamin Schrauwen. 2011. Audio-based Music Classification with a Pretrained Convolutional Network.. In In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011). 669–674. [5] Shyamala Doraisamy, Shahram Golzari, Noris Mohd. Norowi, Md Nasir Sulaiman, and Nur Izura Udzir. 2008. A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music. In In Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR 2008). 331– 336. [6] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19, 1 (Jan 2007), 1–16. https://doi.org/10.1109/TKDE.2007. 250581 [7] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine learning 63, 1 (2006), 3–42. [8] Arijit Ghosal, Rudrasis Chakraborty, Bibhas Chandra Dhara, and San- joy Kumar Saha. 2015. Perceptual feature-based song genre clas- sification using RANSAC. International Journal of Computational Intelligence Studies 4, 1 (2015), 31–49. [9] Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710. [10] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret- tenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830. [11] Alastair Porter, Dmitry Bogdanov, Robert Kaye, Roman Tsukanov, and Xavier Serra. 2015. AcousticBrainz: a community platform for gathering music information obtained from audio. In 16th International Society for Music Information Retrieval Conference (ISMIR 2015). Malaga, Spain, 786–792. http://dblp.org/rec/html/conf/ismir/PorterBKTS15 [12] Ting-Fan Wu, Chih-Jen Lin, and Ruby C Weng. 2004. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, Aug (2004), 975–1005.