ICSI in MediaEval 2017 Multi-Genre Music Task Kijung Kim1 , Jaeyoung Choi2 1 University of California, Berkeley, CA, United States 2 International Computer Science Institute, Berkeley, CA, USA kijung@berkeley.edu,jaeyoung@icsi.berkeley.edu ABSTRACT 2.1 Stochastic Gradient Descent Classifier: Run We present our approach and result for the MediaEval 2017 Acous- 1 and 2 ticBrainz Content-based music genre recognition task. Experimen- Run 1 consisted of each song having a concatenated feature tal results show that the best results come from random forest with vector of all features minus the ones mentioned above with the partial feature selection. SGDClassifier. To accommodate for large data, batch training of size 80,000 was used. Run 2’s feature formulation and the model is the same as Run 1 INTRODUCTION 1. The difference is in the prediction process. The procedure was The 2017 Content-based music genre recognition from multiple to look at the results for each song and mainly go with the genre sources Task [1] consists of two subtasks: single source classifi- prediction. For example, given main genre A has subgenres B,C and cation and multiple source classification. We focused on the first main genre D has subgenres E, F, if the classifiers classified a song as subtask, which the goal was to predict genres using a single source main genre A with subgenres C,D, and F but does not classify it as of ground truth with broad genre categories as class labels. In the main genre D, because main genre D was not predicted, the predic- following sections, we describe our feature formulation, models tions will ignore subgenre F, and the final prediction will be genre and experiments in details. A with subgenres C,D. In short, subgenre predictions were ignored if their main "parent" genres were not predicted. This approach was 2 TECHNICAL APPROACH to decrease the chance of false positives for subgenres. In short, we The proposed framework can be divided into three phases: (1) made a system of hierarchy and weighed genre predictions higher feature formulation , (2) standardization, and (3) model selection than subgenre predictions. and predictions. (1) Feature Formulation The dataset provides each song 2.2 RandomForest with Parital Feature with three groups of pre-extracted features: tonal, rhythm, and Selection: Run 3, 4 and 5 low-level. A feature vector for each song was formed as a concate- For the next three Runs, we used random forest classifier (RFC) nation of all the individual features from each group. For features with partial feature selection. We used the feature importance [4] with specifics labels such as mean, max, and min, they were simply from the trained random forest classifier. We first took a subset of concatenated together. For the sake of simplicity, categorical fea- the train data (around 100,000 songs), formulated a concatenated tures were not considered. The excluded features are: "key_key", feature vector for each song, and fit the features to the RFC for each "key_scale", "chords_key", and "chords_scale". The "beats_position" genre and subgenre. Then, we used the ranked feature importance was excluded as the feature for each song has variable length, and list from the classifier to select the x% best features, which resulted we assumed that the features "bpm" and "beats counts" were suffi- in different best features for each genre and subgenre [5]. From cient. This resulted in a 2647-dimensional feature vector for each there, we trained all-for-one RFC’s using the top x% features for song. each genre and subgenre with its own x% best features, and used a (2) Standardization We randomly sampled a subset of 100,000 subset of the train data (around 150,000 songs). Finally, prediction songs for each dataset, formulated the feature vector, and computed of the genres were made based on a conglomeration of all the RFC’s. the mean and standard deviation for all indicies in the feature vec- Run 3, 4, and 5 used the top 25%, 50%, and 75% of the features tor. Then, at the test phase, each feature was standardized using from the ranked list of feature importance from the trained RFC, the pre-computed mean and standard deviation. that resulted in a 661, 1323, and 1985 dimensional feature vector (3) Model Selection and Predictions From scikit-learn [6], per song, respectively. two classifiers used in our approach were the Stochastic Gradient Descent (SGD) classifier with hinge loss and Random Forest clas- 3 RESULTS AND ANALYSIS sifier with 16 estimators [2]. A binary classifier was trained for each genre/subgenre, the results were conglomerated together and In this section, we report accumulated results on the sub-task prediction for each genre/subgenre was made independently. based on our two different approaches.1 Our results are reported The first two Runs consisted of concatenating all provided fea- in Figure 1. The test set is composed of three different databases tures (except the ones mentioned above) and using the SGD classi- (Discogs, Lastfm, Tagtraum), and we took the average of precision, fier. recall, and f-score to obtain single number. We observe that the approaches based on the Random Forest Copyright held by the owner/author(s). Classifiers (Runs 3, 4, 5) outperform the SGD Classifier approaches MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 https://multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/ MediaEval’17, 13-15 September 2017, Dublin, Ireland Kijung Kim1 , Jaeyoung Choi2 Table 1: Subtask 1 Results dataset Run Ptrackall Rtrackall Ftrackall Ptrackgen Rtrackgen Ftrackgen Ptracksub Rtracksub Ftracksub Plabelall Rlabelall Flabelall Plabelgen Rlabelgen Flabelgen Plabelsub Rlabelsub Flabelsub discogs Run 1 0.0109 0.65 0.0214 0.0989 0.791 0.172 0.006 0.5171 0.0117 0.0104 0.5702 0.0163 0.0963 0.7399 0.1471 0.0061 0.5617 0.0098 discogs Run 2 0.0136 0.6083 0.0264 0.0989 0.791 0.172 0.0069 0.4322 0.0135 0.0105 0.4496 0.0164 0.0963 0.7399 0.1471 0.0062 0.435 0.0099 discogs Run 3 0.032 0.3337 0.0568 0.2059 0.4023 0.2475 0.0181 0.2735 0.0331 0.0247 0.2484 0.0349 0.1375 0.3149 0.1638 0.0191 0.2451 0.0285 discogs Run 4 0.028 0.3621 0.0508 0.1725 0.4595 0.2326 0.0153 0.2823 0.0284 0.0191 0.2585 0.0293 0.1095 0.351 0.1409 0.0146 0.2538 0.0237 discogs Run 5 0.0216 0.3493 0.0401 0.0999 0.3943 0.1513 0.0136 0.3092 0.0257 0.016 0.2764 0.025 0.0889 0.3908 0.117 0.0123 0.2707 0.0204 lastfm Run 1 0.0065 0.4746 0.0129 0.0454 0.4887 0.0824 0.0048 0.4981 0.0096 0.0089 0.5392 0.0105 0.0553 0.4076 0.0466 0.0042 0.5525 0.0068 lastfm Run 2 0.0097 0.3715 0.0187 0.0454 0.4887 0.0824 0.0054 0.2648 0.0106 0.0117 0.268 0.0089 0.0553 0.4076 0.0466 0.0073 0.2539 0.0051 lastfm Run 3 0.0385 0.2619 0.0637 0.0993 0.3053 0.1406 0.0274 0.219 0.0461 0.0307 0.2075 0.0406 0.0926 0.2585 0.1026 0.0244 0.2023 0.0343 lastfm Run 4 0.0338 0.2711 0.0575 0.0883 0.3338 0.1331 0.0219 0.202 0.0379 0.0264 0.1978 0.0353 0.0815 0.2831 0.1049 0.0208 0.1892 0.0282 lastfm Run 5 0.0273 0.277 0.0483 0.0691 0.3284 0.1098 0.0191 0.2187 0.0341 0.0217 0.206 0.0311 0.0591 0.277 0.0795 0.0179 0.1989 0.0262 tagtraum Run 1 0.0114 0.5375 0.0222 0.0372 0.3977 0.0673 0.0095 0.6141 0.0187 0.0129 0.5859 0.0185 0.0509 0.4937 0.0514 0.0084 0.5967 0.0146 tagtraum Run 2 0.0129 0.3272 0.0246 0.0371 0.3981 0.0673 0.0082 0.2519 0.0157 0.0151 0.2915 0.018 0.0509 0.494 0.0513 0.011 0.2678 0.0141 tagtraum Run 3 0.0487 0.2655 0.0782 0.1213 0.3366 0.1654 0.0318 0.2137 0.0528 0.0384 0.2099 0.0488 0.111 0.3349 0.1327 0.0299 0.1953 0.039 tagtraum Run 4 0.0456 0.3201 0.0774 0.1276 0.4357 0.1876 0.0273 0.2432 0.0477 0.0326 0.2257 0.0443 0.0957 0.3291 0.1232 0.0253 0.2136 0.0351 tagtraum Run 5 0.0336 0.2968 0.0589 0.0636 0.3376 0.1038 0.0259 0.2562 0.0458 0.0262 0.2324 0.0374 0.0722 0.3845 0.094 0.0208 0.2146 0.0307 (Runs 1, 2). In particular, we note that the recall in Runs 1, 2 is especially high while the precision is especially low, which meant the classifiers predicted each song with almost every label. For Runs REFERENCES 3, 4, and 5, we observe a significantly lower recall with a better [1] Dmitry Bogdanov, Alastair Porter, Julian Urbano, and Hendrik precision. Schreiber. 2017. The Mediaeval 2017 AcousticBrainz Genre Task: Content-based music genre recognition from multiple sources. Proc. For Runs 3, 4, and 5, we observe a trend that by adding additional of the MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017 features, the recall improves at the cost of precision. However, Run (2017). 5 disproves such trend, as it shows that Run 5 gives only better [2] Ben Hoyle, Markus Michael Rau, Roman Zitlau, Stella Seitz, and Jochen recall for per-label results while showing worse result in all metrics Weller. 2015. Feature importance for machine learning redshifts ap- in per-track results. plied to SDSS galaxies. Monthly Notices of the Royal Astronomical Society 449, 2 (2015), 1275–1283. [3] Guillaume Lemaitre, Fernando Nogueira, and Christos K Aridas. 2017. 4 CONCLUSION Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research Runs 1 and 2 clearly suffered from oversampling, which lead 18, 17 (2017), 1–5. the classifiers in most genres to predict positive, which resulted in [4] Gilles Louppe. 2014. Understanding random forests: From theory to high recall and low precision. Runs 3,4, and 5 did not suffer like practice. arXiv preprint arXiv:1407.7502 (2014). Runs 1 and 2, but upon observing precision, recall, and f-scores for [5] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. each genre, the classifiers did far worse on non-popular genres and 2013. Understanding variable importances in forests of randomized subgenres, which lead to overall lower precision and recall. trees. In Advances in neural information processing systems. 431–439. The shortcomings came, for Runs 1 and 2, from errors in sam- [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, pling. These sampling errors were technical, as they originated A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. from code. For Runs 3,4, and 5, the shortcomings came from a lack 2011. Scikit-learn: Machine Learning in Python. Journal of Machine of a system to combine results from different classifiers. For one, Learning Research 12 (2011), 2825–2830. we could have exploited the probabilities generated from the model for each prediction to ascertain a threshold for each genre and subgenre. This would have helped especially for sparse subgenres. For future works, for Runs 1, 2, it would be interesting to see if taking less top % of the features from the feature importance list for each genre and subgenre will improve precision. Also, it may be worth trying majority voting by training several different random forest classifiers using the feature importance list. Lastly, it would be interesting to try the imbalanced-learn package [3] which is compatible with scikit-learn and may fix the class imbalance for Runs 1 and 2. ACKNOWLEDGMENTS This work was supported in part by AWS Research Grants.