=Paper=
{{Paper
|id=Vol-1263/paper88
|storemode=property
|title=MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_88.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiSWV14
}}
==MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music==
MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music ∗ Anna Aljanaki Mohammad Frans Wiering Remco C. Information and Soleymani Information and Veltkamp Computing Sciences Computer Science Computing Sciences Information and Utrecht University Dept. Utrecht University Computing Sciences the Netherlands University of Geneva the Netherlands Utrecht University a.aljanaki@uu.nl Switzerland F.Wiering@uu.nl the Netherlands mohammad.soleymani@unige.ch R.C.Veltkamp@uu.nl ABSTRACT bass line. Also, the presence or absence of drop in a specific We predict drops in electronic dance music (EDM), em- case is debatable. ploying different multimodal approaches. We combine three sources of data: noisy labels collected through crowdsourc- 2. RELATED WORK ing, timed comments from SoundCloud and audio content analysis. We predict the correct labels from the noisy labels Karthik Yadati et al. [4] (the organisers of Mediaeval 2014 using the majority vote and Dawid-Skene methods. We also CrowdSorting task) conducted an acoustic analysis to detect employ timed comments from SoundCloud users to count drops in EDM. The audio was first segmented under the as- the occurrence of specific terms near the potential drop sumption that a drop moment must be an important struc- event, and, finally, we conduct an acoustic analysis of the tural boundary. Then, each of the segmentation boundaries audio excerpts. The best results are obtained, when both was classified based on the analysis of several features in a annotations, metadata and audio, are combined, though the time window around the potential drop. MFCCs, spectro- differences between them are not significant. gram and rhythmical features were used based on the notion that a drop event is usually characterized by a sudden change 1. INTRODUCTION of rhythm and timbre. This working notes paper describes a submission to the 3. APPROACH CrowdSorting brave new task in the MultiMediaeval 2014 benchmark. The main aim of the task is to detect drops in electronic music. According to the Wikipedia definition: For each of the excerpts, three annotations from MTurk “Drop or climax is the point in a music track where a switch workers were provided. Fleiss’ kappa for these labels was of rhythm or bassline occurs and usually follows a recog- 0.24 (calculated without songs from the fourth category, ”ab- nizable build section and break”[1]. The task involves cate- sent sound file”). Around 30% of the excerpts were unan- gorizing 15 second electronic music excerpts into three cat- imously rated by annotators. For about 60%, two of the egories: those containing a drop, those containing part of annotators agreed. For the remaining 10% of the excerpts, the drop, and those without a drop. The organizers provide all the annotators provided different answers. We mainly three types of data: unreliable crowdsourced annotations, sought to improve the categorization of the second and es- timed comments from SoundCloud users, and audio. Acous- pecially the last categories. tic analysis is optional to the task. For more detail we refer to the task overview paper [3]. 3.1 Metadata analysis and improving ground truth We submitted four runs: three are based on annotations and other metadata, and one is based on a combination of metadata and acoustic features. Due to the social attention The first run employs a simple majority vote. In case all that drop phenomenon gets in electronic music, the task of the annotators categorize the segment differently, we label drop detection is naturally suitable for a combined approach, it as containing part of the drop. using both metadata and acoustic features. The acoustic- In the second run, we use the Dawid-Skene algorithm [2] only approach is rather challenging, because there are many to compute the probabilities of each label, and the qual- informal descriptions of what constitutes a drop, including ity of workers, based on their agreement with other work- rhythmic and dynamical changes, or specific patterns in the ers. The Dawid-Skene model calculates the confusion matri- ∗First two authors contributed equally to this work and ap- ces for each worker using a Maximum-Likelihood estimation based on their agreement with the other workers. We use pear in alphabetical order. the Get-another-Label toolbox1 implementation of Dawid- Skene. Then, we use the calculated probabilities combined with the given labels to predict the actual labels. Copyright is held by the author/owner(s). 1 MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain https://github.com/ipeirotis/Get-Another-Label In the third run, we count the number of timed comments 40 from SoundCloud users which include the term ”drop” near the moment of hypothetic drop (the 15 second time window 35 defined by organizers). We use a Naı̈ve Bayes classifier to train a model based on a number of comments in addition to 30 the three noisy labels. The model is only used to categorize the excerpts with no agreement between annotators. 25 Loudness (Erbs) 3.2 Audio analysis 20 As a training data, we employed the excerpts for which 15 all the three workers agreed. There were 164 such excerpts in total, 105 for which workers indicated that the excerpt 10 local maximum contained an entire drop, 54 for which they indicated there local minimum was no drop, and 4 for which they agreed there was part 5 drop smoothed time series of the drop present. We decided to exclude the excerpts unsmoothed time series labeled ”part of the drop”, as it is not possible to learn to 0 recognize it based on just four samples. −10 0 10 20 30 40 50 60 70 80 Time window (index) The acoustic approach was based on the fact that during a drop, there is usually a moment of silence, or sometimes Figure 1: A smoothed and segmented time-series of the loudness level changes drastically after the drop. We an excerpt with drop. analyzed the energy of the signal in non-overlapping win- dows of 100 ms. The obtained time-series was smoothed us- Run Name F1 Drop Part No drop ing the weighted moving average. The smoothed time-series Run 1 Majority Vote 0.69 0.72 0.31 0.75 was segmented on their local maximums and minimums. To Run 2 DS 0.69 0.72 0.31 0.75 predict the presence of the drop event, we used the following statistics on these events: Run 3 MV+SoundCloud 0.7 0.73 0.28 0.76 Run 4 MV+Audio 0.71 0.72 0.27 0.79 1. The value of the biggest local minimum in an excerpt 2. The fraction of the biggest minimum to an average 5. CONCLUSION minimum In this task, we only achieved marginal improvement over 3. The number of potential drop events, as detected by the baseline, i.e., majority vote. Both acoustic analysis and decrease in loudness bigger than threshold the use of SoundCloud metadata resulted in a small but 4. The dynamic range of the excerpt insignificant prediction improvement. This shows that in the presence of enough labels given by MTurk workers, we could Based on these characteristics and a ground-truth of 160 ex- not significantly improve the accuracy based on the content cerpts, we trained a logistic regression classifier to predict or social media metadata. However, they are nevertheless the presence of drops, and obtained 80% precision with 10- useful in cold start scenarios. fold cross validation. The model was used to predict the presence of drops for the excerpts where all three workers 6. ACKNOWLEDGEMENTS gave different ratings (i.e., ”drop is present”, ”part of the drop is present”, ”drop is not present”). The biggest limita- This publication was supported by the Dutch national tion of this approach is that the model does not incorporate program COMMIT. the ”part of the drop” category. 7. REFERENCES [1] M. J. Butler. Unlocking the Groove. Rhythm, Meter, 4. EVALUATION and Musical Design in Electronic Dance Music. 2006. [2] A. P. Dawid and A. M. Skene. Maximum likelihood The evaluation metric for this task is the F1 score, cal- estimation of observer error-rates using the em culated based on high-fidelity labels from the experts, used algorithm. Applied statistics, (1):20–28, 1979. as a ground-truth. Though there are some differences be- [3] M. L. Karthik Yadati, Pavala S.N. Chandrasekaran tween submissions, none of them were statistically signifi- Ayyanathan. Crowdsorting timed comments about cant on a one-sided Wilcoxon ranksum test. The majority music: Foundations for a new crowdsourcing task. In vote scores are as usual hard to beat. Using comments from MediaEval Workshop, Barcelona, Spain, October 16-17 SoundCloud users results in some improvement, and using 2014. acoustic features performs similarly. Looking at the accu- [4] K. Yadati, M. A. Larson, C. C. Liem, and A. Hanjalic. racy per category, we can see that the acoustic submission Detecting drops in electronic dance music: suffers from imprecision in the category ”part of the drop”, Content-based approaches to a socially significant music which is natural, because it does not model that. On the event. In Proceedings of the 15th International Society other hand, the precision of ”no drop” labels is higher than for Music Information Retrieval Conference, 2014. for all other submissions.