=Paper=
{{Paper
|id=Vol-1263/paper88
|storemode=property
|title=MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_88.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AljanakiSWV14
}}
==MediaEval 2014: A Multimodal Approach to Drop Detection in Electronic Dance Music==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_88.pdf</pdf>
<pre>
MediaEval 2014: A Multimodal Approach to Drop Detection
              in Electronic Dance Music ∗

                Anna Aljanaki               Mohammad              Frans Wiering                Remco C.
               Information and              Soleymani              Information and             Veltkamp
             Computing Sciences          Computer Science        Computing Sciences          Information and
              Utrecht University               Dept.              Utrecht University       Computing Sciences
               the Netherlands          University of Geneva       the Netherlands          Utrecht University
                 a.aljanaki@uu.nl           Switzerland           F.Wiering@uu.nl            the Netherlands
                                       mohammad.soleymani@unige.ch                          R.C.Veltkamp@uu.nl


ABSTRACT                                                          bass line. Also, the presence or absence of drop in a specific
We predict drops in electronic dance music (EDM), em-             case is debatable.
ploying different multimodal approaches. We combine three
sources of data: noisy labels collected through crowdsourc-       2.     RELATED WORK
ing, timed comments from SoundCloud and audio content
analysis. We predict the correct labels from the noisy labels
                                                                     Karthik Yadati et al. [4] (the organisers of Mediaeval 2014
using the majority vote and Dawid-Skene methods. We also
                                                                  CrowdSorting task) conducted an acoustic analysis to detect
employ timed comments from SoundCloud users to count
                                                                  drops in EDM. The audio was first segmented under the as-
the occurrence of specific terms near the potential drop
                                                                  sumption that a drop moment must be an important struc-
event, and, finally, we conduct an acoustic analysis of the
                                                                  tural boundary. Then, each of the segmentation boundaries
audio excerpts. The best results are obtained, when both
                                                                  was classified based on the analysis of several features in a
annotations, metadata and audio, are combined, though the
                                                                  time window around the potential drop. MFCCs, spectro-
differences between them are not significant.
                                                                  gram and rhythmical features were used based on the notion
                                                                  that a drop event is usually characterized by a sudden change
1. INTRODUCTION                                                   of rhythm and timbre.

   This working notes paper describes a submission to the         3.     APPROACH
CrowdSorting brave new task in the MultiMediaeval 2014
benchmark. The main aim of the task is to detect drops
in electronic music. According to the Wikipedia definition:          For each of the excerpts, three annotations from MTurk
“Drop or climax is the point in a music track where a switch      workers were provided. Fleiss’ kappa for these labels was
of rhythm or bassline occurs and usually follows a recog-         0.24 (calculated without songs from the fourth category, ”ab-
nizable build section and break”[1]. The task involves cate-      sent sound file”). Around 30% of the excerpts were unan-
gorizing 15 second electronic music excerpts into three cat-      imously rated by annotators. For about 60%, two of the
egories: those containing a drop, those containing part of        annotators agreed. For the remaining 10% of the excerpts,
the drop, and those without a drop. The organizers provide        all the annotators provided different answers. We mainly
three types of data: unreliable crowdsourced annotations,         sought to improve the categorization of the second and es-
timed comments from SoundCloud users, and audio. Acous-           pecially the last categories.
tic analysis is optional to the task. For more detail we refer
to the task overview paper [3].
                                                                  3.1      Metadata analysis and improving ground
                                                                           truth
   We submitted four runs: three are based on annotations
and other metadata, and one is based on a combination of
metadata and acoustic features. Due to the social attention          The first run employs a simple majority vote. In case all
that drop phenomenon gets in electronic music, the task of        the annotators categorize the segment differently, we label
drop detection is naturally suitable for a combined approach,     it as containing part of the drop.
using both metadata and acoustic features. The acoustic-             In the second run, we use the Dawid-Skene algorithm [2]
only approach is rather challenging, because there are many       to compute the probabilities of each label, and the qual-
informal descriptions of what constitutes a drop, including       ity of workers, based on their agreement with other work-
rhythmic and dynamical changes, or specific patterns in the       ers. The Dawid-Skene model calculates the confusion matri-
∗First two authors contributed equally to this work and ap-       ces for each worker using a Maximum-Likelihood estimation
                                                                  based on their agreement with the other workers. We use
pear in alphabetical order.                                       the Get-another-Label toolbox1 implementation of Dawid-
                                                                  Skene. Then, we use the calculated probabilities combined
                                                                  with the given labels to predict the actual labels.
Copyright is held by the author/owner(s).                         1
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        https://github.com/ipeirotis/Get-Another-Label
   In the third run, we count the number of timed comments                         40
from SoundCloud users which include the term ”drop” near
the moment of hypothetic drop (the 15 second time window                           35
defined by organizers). We use a Naı̈ve Bayes classifier to
train a model based on a number of comments in addition to                         30
the three noisy labels. The model is only used to categorize
the excerpts with no agreement between annotators.                                 25


                                                                 Loudness (Erbs)
3.2 Audio analysis                                                                 20

   As a training data, we employed the excerpts for which                          15
all the three workers agreed. There were 164 such excerpts
in total, 105 for which workers indicated that the excerpt                         10                                   local maximum
contained an entire drop, 54 for which they indicated there                                                             local minimum
was no drop, and 4 for which they agreed there was part                             5         drop                      smoothed time series
of the drop present. We decided to exclude the excerpts                                                                 unsmoothed time series
labeled ”part of the drop”, as it is not possible to learn to                      0
recognize it based on just four samples.                                           −10    0    10    20       30     40       50      60   70     80
                                                                                                          Time window (index)
   The acoustic approach was based on the fact that during
a drop, there is usually a moment of silence, or sometimes              Figure 1: A smoothed and segmented time-series of
the loudness level changes drastically after the drop. We               an excerpt with drop.
analyzed the energy of the signal in non-overlapping win-
dows of 100 ms. The obtained time-series was smoothed us-                            Run      Name        F1            Drop       Part No drop
ing the weighted moving average. The smoothed time-series
                                                                                    Run 1  Majority Vote 0.69           0.72       0.31  0.75
was segmented on their local maximums and minimums. To
                                                                                    Run 2      DS        0.69           0.72       0.31  0.75
predict the presence of the drop event, we used the following
statistics on these events:                                                         Run 3 MV+SoundCloud 0.7             0.73       0.28  0.76
                                                                                    Run 4   MV+Audio     0.71           0.72       0.27  0.79
  1. The value of the biggest local minimum in an excerpt
  2. The fraction of the biggest minimum to an average                  5.              CONCLUSION
     minimum
                                                                          In this task, we only achieved marginal improvement over
  3. The number of potential drop events, as detected by                the baseline, i.e., majority vote. Both acoustic analysis and
     decrease in loudness bigger than threshold                         the use of SoundCloud metadata resulted in a small but
  4. The dynamic range of the excerpt                                   insignificant prediction improvement. This shows that in the
                                                                        presence of enough labels given by MTurk workers, we could
Based on these characteristics and a ground-truth of 160 ex-            not significantly improve the accuracy based on the content
cerpts, we trained a logistic regression classifier to predict          or social media metadata. However, they are nevertheless
the presence of drops, and obtained 80% precision with 10-              useful in cold start scenarios.
fold cross validation. The model was used to predict the
presence of drops for the excerpts where all three workers              6.              ACKNOWLEDGEMENTS
gave different ratings (i.e., ”drop is present”, ”part of the
drop is present”, ”drop is not present”). The biggest limita-
                                                                          This publication was supported by the Dutch national
tion of this approach is that the model does not incorporate
                                                                        program COMMIT.
the ”part of the drop” category.

                                                                        7.              REFERENCES
                                                                        [1] M. J. Butler. Unlocking the Groove. Rhythm, Meter,
4. EVALUATION                                                               and Musical Design in Electronic Dance Music. 2006.
                                                                        [2] A. P. Dawid and A. M. Skene. Maximum likelihood
  The evaluation metric for this task is the F1 score, cal-                 estimation of observer error-rates using the em
culated based on high-fidelity labels from the experts, used                algorithm. Applied statistics, (1):20–28, 1979.
as a ground-truth. Though there are some differences be-                [3] M. L. Karthik Yadati, Pavala S.N. Chandrasekaran
tween submissions, none of them were statistically signifi-                 Ayyanathan. Crowdsorting timed comments about
cant on a one-sided Wilcoxon ranksum test. The majority                     music: Foundations for a new crowdsourcing task. In
vote scores are as usual hard to beat. Using comments from                  MediaEval Workshop, Barcelona, Spain, October 16-17
SoundCloud users results in some improvement, and using                     2014.
acoustic features performs similarly. Looking at the accu-              [4] K. Yadati, M. A. Larson, C. C. Liem, and A. Hanjalic.
racy per category, we can see that the acoustic submission                  Detecting drops in electronic dance music:
suffers from imprecision in the category ”part of the drop”,                Content-based approaches to a socially significant music
which is natural, because it does not model that. On the                    event. In Proceedings of the 15th International Society
other hand, the precision of ”no drop” labels is higher than                for Music Information Retrieval Conference, 2014.
for all other submissions.

</pre>