INTRODUCTION

ICSI in MediaEval 2017 Multi-Genre Music Task

Kijung Kim

kijung@berkeley.edu 1 2

Jaeyoung Choi

jaeyoung@icsi.berkeley.edu 0 2 0 International Computer Science Institute , Berkeley, CA , USA 1 University of California , Berkeley, CA , United States 2 [5] Gilles Louppe , Louis Wehenkel, Antonio Sutera, and Pierre Geurts

2017

13 15

We present our approach and result for the MediaEval 2017 AcousticBrainz Content-based music genre recognition task. Experimental results show that the best results come from random forest with partial feature selection. Stochastic Gradient Descent Classifier: Run 1 and 2 Run 1 consisted of each song having a concatenated feature vector of all features minus the ones mentioned above with the SGDClassifier. To accommodate for large data, batch training of size 80,000 was used. Run 2's feature formulation and the model is the same as Run 1. The diference is in the prediction process. The procedure was to look at the results for each song and mainly go with the genre prediction. For example, given main genre A has subgenres B,C and main genre D has subgenres E, F, if the classifiers classified a song as main genre A with subgenres C,D, and F but does not classify it as main genre D, because main genre D was not predicted, the predictions will ignore subgenre F, and the final prediction will be genre A with subgenres C,D. In short, subgenre predictions were ignored if their main "parent" genres were not predicted. This approach was to decrease the chance of false positives for subgenres. In short, we made a system of hierarchy and weighed genre predictions higher than subgenre predictions. 1https://multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/results/

INTRODUCTION

The 2017 Content-based music genre recognition from multiple sources Task [1] consists of two subtasks: single source classification and multiple source classification. We focused on the first subtask, which the goal was to predict genres using a single source of ground truth with broad genre categories as class labels. In the following sections, we describe our feature formulation, models and experiments in details.

TECHNICAL APPROACH

The proposed framework can be divided into three phases: (1) feature formulation , (2) standardization, and (3) model selection and predictions.

(1) Feature Formulation The dataset provides each song with three groups of pre-extracted features: tonal, rhythm, and low-level. A feature vector for each song was formed as a concatenation of all the individual features from each group. For features with specifics labels such as mean, max, and min, they were simply concatenated together. For the sake of simplicity, categorical features were not considered. The excluded features are: "key_key", "key_scale", "chords_key", and "chords_scale". The "beats_position" was excluded as the feature for each song has variable length, and we assumed that the features "bpm" and "beats counts" were suficient. This resulted in a 2647-dimensional feature vector for each song.

(2) Standardization We randomly sampled a subset of 100,000 songs for each dataset, formulated the feature vector, and computed the mean and standard deviation for all indicies in the feature vector. Then, at the test phase, each feature was standardized using the pre-computed mean and standard deviation.

(3) Model Selection and Predictions From scikit-learn [6], two classifiers used in our approach were the Stochastic Gradient Descent (SGD) classifier with hinge loss and Random Forest classifier with 16 estimators [2]. A binary classifier was trained for each genre/subgenre, the results were conglomerated together and prediction for each genre/subgenre was made independently.

The first two Runs consisted of concatenating all provided features (except the ones mentioned above) and using the SGD classiifer. 2.1 2.2

RandomForest with Parital Feature Selection: Run 3, 4 and 5

For the next three Runs, we used random forest classifier (RFC) with partial feature selection. We used the feature importance [4] from the trained random forest classifier. We first took a subset of the train data (around 100,000 songs), formulated a concatenated feature vector for each song, and fit the features to the RFC for each genre and subgenre. Then, we used the ranked feature importance list from the classifier to select the x% best features, which resulted in diferent best features for each genre and subgenre [5]. From there, we trained all-for-one RFC’s using the top x% features for each genre and subgenre with its own x% best features, and used a subset of the train data (around 150,000 songs). Finally, prediction of the genres were made based on a conglomeration of all the RFC’s.

Run 3, 4, and 5 used the top 25%, 50%, and 75% of the features from the ranked list of feature importance from the trained RFC, that resulted in a 661, 1323, and 1985 dimensional feature vector per song, respectively. 3

RESULTS AND ANALYSIS

In this section, we report accumulated results on the sub-task based on our two diferent approaches. 1 Our results are reported in Figure 1. The test set is composed of three diferent databases (Discogs, Lastfm, Tagtraum), and we took the average of precision, recall, and f-score to obtain single number.

We observe that the approaches based on the Random Forest Classifiers (Runs 3, 4, 5) outperform the SGD Classifier approaches

Ptrackall Rtrackall Ftrackall Ptrackgen Rtrackgen Ftrackgen Ptracksub Rtracksub Ftracksub Plabelall Rlabelall Flabelall Plabelgen Rlabelgen Flabelgen Plabelsub Rlabelsub Flabelsub 0.65 0.6083 0.3337 0.3621 0.3493 0.4746 0.3715 0.2619 0.2711 0.277 0.5375 0.3272 0.2655 0.3201 0.2968 (Runs 1, 2). In particular, we note that the recall in Runs 1, 2 is especially high while the precision is especially low, which meant the classifiers predicted each song with almost every label. For Runs 3, 4, and 5, we observe a significantly lower recall with a better precision.

For Runs 3, 4, and 5, we observe a trend that by adding additional features, the recall improves at the cost of precision. However, Run 5 disproves such trend, as it shows that Run 5 gives only better recall for per-label results while showing worse result in all metrics in per-track results.

Runs 1 and 2 clearly sufered from oversampling, which lead the classifiers in most genres to predict positive, which resulted in high recall and low precision. Runs 3,4, and 5 did not sufer like Runs 1 and 2, but upon observing precision, recall, and f-scores for each genre, the classifiers did far worse on non-popular genres and subgenres, which lead to overall lower precision and recall.

The shortcomings came, for Runs 1 and 2, from errors in sampling. These sampling errors were technical, as they originated from code. For Runs 3,4, and 5, the shortcomings came from a lack of a system to combine results from diferent classifiers. For one, we could have exploited the probabilities generated from the model for each prediction to ascertain a threshold for each genre and subgenre. This would have helped especially for sparse subgenres.

For future works, for Runs 1, 2, it would be interesting to see if taking less top % of the features from the feature importance list for each genre and subgenre will improve precision. Also, it may be worth trying majority voting by training several diferent random forest classifiers using the feature importance list. Lastly, it would be interesting to try the imbalanced-learn package [3] which is compatible with scikit-learn and may fix the class imbalance for Runs 1 and 2.

ACKNOWLEDGMENTS

This work was supported in part by AWS Research Grants.

(2017).

Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. practice. arXiv preprint arXiv:1407.7502 (2014). 2013. Understanding variable importances in forests of randomized trees. In Advances in neural information processing systems. 431–439.

Schreiber. 2017. The

Mediaeval 2017 AcousticBrainz Genre

Task: Content-based music genre recognition from multiple sources . Proc. of the MediaEval 2017 Workshop , Dublin, Ireland, Sept. 13 - 15 , 2017 [2]

Ben

Hoyle , Markus Michael Rau, Roman Zitlau, Stella Seitz, and

Jochen

Weller . 2015 . Feature importance for machine learning redshifts applied to SDSS galaxies . Monthly Notices of the Royal Astronomical [6]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . 2011 . Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research 12 ( 2011 ), 2825 - 2830 .